This message was deleted.
# kubernetes
a
This message was deleted.
h
on the cluster that is not connected, what is output of:
kubectl get po -n cattle-system
then look at the logs of those pods
kubectl logs -n cattle-system <pod name>
a
Thanks. So part of my problem right now is that I can't authenticate/query those nodes or cluster. It just hangs trying to reach the rancher URL in the config file or the direct IP for the CP node. Is there a way to access this cluster in this state that maybe I don't know of?
it just hangs in an output - almost as if the token in the downloaded kubeconfig file is no longer authenticating or able to authenticate due to the status
hi @hundreds-evening-84071 - a bit more to update. This is the exact issue I am hitting: https://github.com/rancher/rancher/issues/41292 I don't know if/how to upgrade via rke cli manually - is that something you have some experience in perhaps? I cross posted to the vsphere channel as well in case there is anyone there with experience - no hits thus far. thx for any help!
h
I do not have any downstream RKE clusters... so I do not have that experience (of upgrading RKE from Rahcher UI). My Rancher cluster however is RKE. And this is the method I have used (over last 4-years) to upgrade RKE via CLI... https://rke.docs.rancher.com/upgrades#listing-supported-kubernetes-versions
Run:
rke config --list-version --all
Then update
rancher-cluster.yml
Add
kubernetes_version: <version from --list-version above>
run
rke up --config ./rancher-cluster.yml
I download latest rke binary from here: https://github.com/rancher/rke/releases/
a
thank you. That is the page I am landing on at the moment and I downloaded rke clic 1.3.24 since it lists both my existing unsupported version and the one I want to get to (1.23) to be supported with Rancher 2.7.5. For the rancher-cluster.yml file - is that located on each node or is that something you built? I am not finding it so that is where I am stuck at the moment
h
that file I built when I deployed RKE cluster...
this is what I have in it:
Copy code
nodes:
  - address: rancher1
    user: rancher
    role: [controlplane,worker,etcd]
  - address: rancher2
    user: rancher
    role: [controlplane,worker,etcd]
  - address: rancher3
    user: rancher
    role: [controlplane,worker,etcd]
 
services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h
 
kubernetes_version: "v1.15.9-rancher1-2"
a
ok, I build through the UI initially so I am guessing that is why I don't have it anywhere. Is it OK to assume I can take the exmaple above or on their site and use that likely?
saw an example on their site earlier
h
so I have 3-node RKE cluster in HA that runs Rancher... ignore the kubernetes version - that was from when I wrote my documentation many years ago
possibly - but again I have not done what you are doing so just guessing
a
ok...I am in a bit pof a pickle so I wil try anything
h
good luck
a
you have confirmed some of what I was looking for. I think I will try that process above and hopefully that PSP error won't bite me. like even the UI seems to think it can upfdate, but it complains about the PSP in that github forum
h
atleast create the file and see if backup works:
rke etcd snapshot-save --name <filename> --config /root/rancher-cluster.yml
a
ok, thakns, I will try that - i will let you know how I make out. I appreciate you getting back to me - this is helpful. I'll try not to be a pest and I'll dig into this and try a few things now. thank you!
h
you are welcome - good luck and happy friday
a
thanks! right back ya - let's hope it is a happy friday...haha
🤞 1
Hi @hundreds-evening-84071 - thanks for your help last week. We have gotten further with doing the "rke up" command. The backup command you gave me now works, however when we try the actual upgrade, we are getting a TLS cert error. Initially we had to update/copy the SSH keys for the docker user to all the nodes. I was curious if you might have seen this before or have any thoughts on where to fix it. I know you haven't done this upgrade with a cluster that was created in the UI, but figured you might have some thoughts that I haven't found/looked at yet:
Copy code
INFO[0034] [etcd] Successfully started [rke-log-linker] container on host [10.227.227.71]
INFO[0034] Removing container [rke-log-linker] on host [10.227.227.71], try #1
INFO[0034] [remove/rke-log-linker] Successfully removed container on host [10.227.227.71]
INFO[0034] Image [rancher/rke-tools:v0.1.88] exists on host [10.227.227.71]
INFO[0035] Starting container [rke-log-linker] on host [10.227.227.71], try #1
INFO[0035] [etcd] Successfully started [rke-log-linker] container on host [10.227.227.71]
INFO[0035] Removing container [rke-log-linker] on host [10.227.227.71], try #1
INFO[0036] [remove/rke-log-linker] Successfully removed container on host [10.227.227.71]
INFO[0036] [etcd] Successfully started etcd plane.. Checking etcd cluster health
WARN[0139] [etcd] host [10.227.227.71] failed to check etcd health: failed to get /health for host [10.227.227.71]: Get "<https://10.227.227.71:2379/health>": remote error: tls: bad certificate
FATA[0139] [etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [10.227.227.71] failed to report healthy. Check etcd container logs on each host for more information
a
Hi, I didn't run across that one yet. Have tried a few others....thank you for this...let me review this and run through it and see what it does. I'll let you know. thank you