https://rancher.com/ logo
Title
d

dazzling-businessperson-64789

05/23/2023, 3:40 PM
Greetings, I have a few questions about Rancher/Kubernetes/upgrade. First, we have a Rancher cluster here consisting of 3 servers and an imported RKE cluster consisting of 9 machines. Those machines were installed some time ago with CentOS 7.9 as operating system running Rancher 2.6.2 and Kubernetes 1.21. Access to the Rancher web ui was configured to be done via AD credentials (local logins were also available). Now, the OS on the servers has been upgraded to Rocky Linux 8.7 and suddenly the AD logins don’t work anymore and also login with local credentials is not possible. In the logs of the nginX containers I see connection attempts but they time out. 1. Is it possible that the OS upgrade broke things for Rancher and if yes, how would one proceed in order to regain access again to the cluster? 2. How could/should one perform an upgrade of Rancher to the latest available stable version? Not sure how this has been installed here as there’s no documentation available here. 3. Which tools need to be installed on the Rancher cluster and the RKE cluster to perform the upgrade.
h

hundreds-evening-84071

05/23/2023, 3:57 PM
So was it inplace upgrade for OS from CentOS7.8 to Rocky 8.7?? Did you guys take cluster backup before the upgrade? 1) What kubernetes distribution is Rancher running on?
kubectl get nodes
2) I would have created a new cluster on newer OS and used Rancher backup operator https://ranchermanager.docs.rancher.com/reference-guides/backup-restore-configuration/examples#backup 3) Follow the upgrade documentation: https://rke.docs.rancher.com/upgrades
r

red-waitress-37932

05/23/2023, 3:59 PM
oww, I would have just created new nodes and phased out the old ones completely instead of upgrading the clusters. Less downtime, easier to go back if something fails along the line and less baggage from the old OS. EDIT: essentially what dc.901 said 🙂
do you still have a non-LDAP/AD login for the rancher web ui?
d

dazzling-businessperson-64789

05/24/2023, 5:38 AM
@hundreds-evening-84071, yes the upgrade was in place. I did a backup from the UI before I first attempted to upgrade Rancher. And, of course, there’s a nightly backup running for the whole machine. 1. the kubectl command did not return (ServerTimeout) 2. We will create a new cluster for sure but for now my task is to make that cluster available again 😞 @red-waitress-37932 I have credentials for local admin but as said they don't work either. And in the logs of the nginX containers I see messages that indicate a timeout
upstream timed out (110: Operation timed out) while connecting to upstream
. So, my guess would be that there are one or more containers in either an error state or that not all containers that are needed up and running. Is there a way to see what containers SHOULD be running?
BTW, I just saw that none of the running containers has an IP address 🤔
h

hundreds-evening-84071

05/24/2023, 12:16 PM
I am bit confused... So, please help me understand You are not able to login to Rancher UI? kubectl command returns ServerTimeout So, how can you see some containers are running and some are not and they do or do not have IPs?
d

dazzling-businessperson-64789

05/24/2023, 1:14 PM
Yes, the UI is returning 504 and the kubectl command returns ServerTimeout. On the command line with
docker ps
I see the running containers (but I have no idea if that are all containers that should run or if there a more containers needed to be up and running). And with `docker inspect <CONTAINER_ID>`I can see, that there’s no IP address set in the output.
h

hundreds-evening-84071

05/24/2023, 5:49 PM
oh I see - you are looking at docker.. so that means its RKE cluster... One thing you can do is spin up another RKE cluster deploy rancher there and compare the 2 setup sorry do not have any thing better to tell you at this point
d

dazzling-businessperson-64789

05/24/2023, 6:21 PM
a friend of mine just gave me the hint that due to the OS upgrade some certificates might have changed and therefore I might have the issues. Could that be a possible cause?
h

hundreds-evening-84071

05/24/2023, 10:27 PM
it is possible... although I do not know what would be the easiest path to find (and resolve) the issue? personally I would spin up a fresh environment and compare which docker containers are up and which are not. Then look at
docker logs <container-name>
that may hopefully shed some light?
d

dazzling-businessperson-64789

05/25/2023, 5:06 AM
thanks, will give it a try
r

rough-farmer-49135

05/25/2023, 2:53 PM
Something I'd try is to go through the pre-req setup instructions for Rancher & downstream cluster to make sure that you still have firewalld turned off, still have the Network Manager exceptions, SELinux is still as you expect, still have kernel settings (sysctl) as expected, & all that jazz.
If none of that works I believe there's an #rke channel and they might know more? I've used RKE2 & K3S (through K3D mostly), but not RKE, so I don't know much about it.