https://rancher.com/ logo
#rke2
Title
s

shy-zebra-53074

02/28/2023, 4:20 PM
Good morning all! I’m really struggling with this for the past couple of weeks and would love any insight. I’m deploying RKE2 on AWS using AMIs that I automate the build and deployment of with packer. I have complete automation and minimization of the configuration of the RHEL 8.5 OS from iso minimal install, setting up the
etcd
user, etc…. I’m deploying in a 3-master HA configuration. What I am seeing is sometimes (not all) one of the master nodes (could be A, B or C) will hang. I am unable to SSH in, I’m unable to run any commands against the node. The node never recovers. I have to login to AWS console and manually stop and start the node, at which point it joins and I can add workers. I am trying to ensure that I am compliant with the supported runtime and requirements exactly as they should be as this will be a large production cluster and need to ensure it is exactly as needed. Anyone else had this issue? Anyone else seen this before? My process is: 1) I start Master A, wait 5m 2) Add Join Token to B/C 3) Start Master B, wait 5m 4) Start Master C, wait 5m… The entire process is automated with ansible. As I stated, this doesn’t happen every time. One thing I was doing is SSH into Master A and run
watch -n1 "kubectl get nodes"
so I’m not sure if querying the API server while things are getting going would have this effect. I also provided all of the logs to AWS support for them to review OOM errors, but nothing looked out of the ordinary. RKE2 Version:
1.26.0+rke2r2
OS:
RHEL 8.5 - minimal install
Should I maybe add more time (more than 5m) between the start of each Master node?
c

creamy-pencil-82913

02/28/2023, 4:58 PM
If you can’t even access the node, that sounds like something much lower level than RKE2. Have you tried getting into the node via the rescue console to see what it’s doing?
s

shy-zebra-53074

02/28/2023, 5:01 PM
yes, rescue console also non-responsive… Even if I’m SSH’d into the node when it happens I still have a prompt but typing
ls
does nothing
just hangs
I think I have a worker node in that state right now
yah one of my worker nodes is now in that state
Copy code
[root@ip-192-168-0-10 ~]# kubectl get nodes
NAME                                                STATUS     ROLES                       AGE     VERSION
ip-192-168-0-10.us-gov-east-1.compute.internal      Ready      control-plane,etcd,master   33m     v1.26.0+rke2r2
ip-192-168-112-162.us-gov-east-1.compute.internal   Ready      worker                      9m14s   v1.26.0+rke2r2
ip-192-168-134-112.us-gov-east-1.compute.internal   Ready      worker                      9m15s   v1.26.0+rke2r2
ip-192-168-16-10.us-gov-east-1.compute.internal     Ready      control-plane,etcd,master   27m     v1.26.0+rke2r2
ip-192-168-32-10.us-gov-east-1.compute.internal     Ready      control-plane,etcd,master   20m     v1.26.0+rke2r2
ip-192-168-53-32.us-gov-east-1.compute.internal     Ready      worker                      9m24s   v1.26.0+rke2r2
ip-192-168-76-222.us-gov-east-1.compute.internal    Ready      worker                      9m32s   v1.26.0+rke2r2
ip-192-168-85-181.us-gov-east-1.compute.internal    NotReady   <none>                      9m12s   v1.26.0+rke2r2
ip-192-168-98-61.us-gov-east-1.compute.internal     Ready      worker                      9m15s   v1.26.0+rke2r2
ip-192-168-85-181.us-gov-east-1.compute.internal
and when trying to SSH into it
Copy code
[root@ip-192-168-0-10 ~]# ssh admin@192.168.85.181 -i /home/admin/.ssh/aws.us-gov-east-1.dev
it just hangs
the only thing I do on these nodes is install RKE2 and this happens during the cluster startup… agreed it must be something lower-level?
image.png
The console just hangs….
I can try and manually restart the node in AWS, it should join and get logs
c

creamy-pencil-82913

02/28/2023, 5:25 PM
If you can’t even use the serial console, that sure sounds like an issue with the kernel or something.
s

shy-zebra-53074

02/28/2023, 5:26 PM
yes agreed, that’s what’s concerning is there seems to be some interaction w/ RKE2 and the kernel that may be causing this?
one caveat is that the RHEL 8.5 instances run STIG controls against them… I have done this with previous version of RKE2 for 2+ years now, this is the first time I’m seeing this… I’m working to restart the node now so I can pull containerd / kubelet logs
ok well AWS couldn’t stop the node so it looks like it just terminated it
Do you know is this something I can engage w/ Rancher professional services on? This behavior is concerning and I think it’s important to dig in and understand why this is happening
c

creamy-pencil-82913

02/28/2023, 5:47 PM
if you have a support contract then yeah you should be able to open a case about it. If the OS itself is hanging they may punt and suggest you open an issue with RHEL and/or AWS though, as what you’re experiencing not really an issue with our product - especially since we can’t even see that it’s getting as far as running RKE2 before it hangs. And even if RKE2 is running when it hangs, that would still probably be on the OS as it shouldn’t be possible to crash it that hard just running a userspace process.
s

shy-zebra-53074

02/28/2023, 5:49 PM
is this something that you have ever seen before? what looks to be random node crashes / hanging?
should I open an issue or discussion on github?
c

creamy-pencil-82913

02/28/2023, 5:52 PM
you’re welcome to, but it our support staff don’t engage with that, just me and a couple other engineers and our PM
Whenever I’ve seen a node hard-lock it’s been a kernel issue
The only thing I can think of that sounds similar is https://github.com/rancher/rke2/issues/3892
s

shy-zebra-53074

02/28/2023, 5:57 PM
Thank you that’s really helpful, at least anything that can point me in any direction because I’m pretty lost right now about this
if anything else comes to mind please feel free to point me in that direction and I’ll start working some support arrangements
c

creamy-pencil-82913

02/28/2023, 5:57 PM
did your hardening by any chance set /proc/sys/vm/panic_on_oom to 1?
s

shy-zebra-53074

02/28/2023, 5:57 PM
checking
c

creamy-pencil-82913

02/28/2023, 5:58 PM
or also 2 I guess. anything other than 0
s

shy-zebra-53074

02/28/2023, 5:58 PM
Copy code
[root@ip-192-168-0-10 ~]# cat /proc/sys/vm/panic_on_oom
0
c

creamy-pencil-82913

02/28/2023, 5:58 PM
k so that’s not it
s

shy-zebra-53074

02/28/2023, 5:59 PM
Copy code
[root@ip-192-168-0-10 ~]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       1023996 0       -2
also seeing this for swap
I’m performing another rebuild now to see if i can get a node to hard-lock and maybe restart it and try and login
c

creamy-pencil-82913

02/28/2023, 6:00 PM
There is absolutely nothing on the AWS serial console when it hangs? Nothing in EC2 under “get system log” or “get instance screenshot”? And you said the serial console was also non-responsive?
s

shy-zebra-53074

02/28/2023, 6:01 PM
correct, when I hit enter nothing
c

creamy-pencil-82913

02/28/2023, 6:01 PM
and both the log and screenshot are blank?
s

shy-zebra-53074

02/28/2023, 6:01 PM
yes and I will confirm as well when I get another occurrence
it seems to happen about 30-40% of the time
but this is the first time I’ve seen it on a worker
would interacting w/ the API via
kubectl
while things are standing up have any effect? I know I’m probably reaching here
c

creamy-pencil-82913

02/28/2023, 6:05 PM
on a functioning system, there shouldn’t be anything you can do, in rke2 or otherwise, to hard-lock it as you’re describing.
s

shy-zebra-53074

02/28/2023, 6:06 PM
yah agreed, it is very odd so I’ll be digging into this and will update you for your awareness, I’ll go ahead and create an issue to start tracking and hopefully help others
hey @creamy-pencil-82913 appreciate the discussion! just for your awareness with absolutely no change (just re-running the one-line command we use to launch RKE2) the cluster stood up using same configs as when I had the worker hard-lock:
Copy code
Every 3.0s: kubectl get nodes                                                                                                        ip-192-168-0-10.us-gov-east-1.compute.internal: Tue Feb 28 13:32:54 2023

NAME                                                STATUS   ROLES                       AGE   VERSION
ip-192-168-0-10.us-gov-east-1.compute.internal      Ready    control-plane,etcd,master   44m   v1.26.0+rke2r2
ip-192-168-105-44.us-gov-east-1.compute.internal    Ready    worker                      20m   v1.26.0+rke2r2
ip-192-168-118-161.us-gov-east-1.compute.internal   Ready    worker                      19m   v1.26.0+rke2r2
ip-192-168-134-49.us-gov-east-1.compute.internal    Ready    worker                      20m   v1.26.0+rke2r2
ip-192-168-16-10.us-gov-east-1.compute.internal     Ready    control-plane,etcd,master   37m   v1.26.0+rke2r2
ip-192-168-32-10.us-gov-east-1.compute.internal     Ready    control-plane,etcd,master   31m   v1.26.0+rke2r2
ip-192-168-56-154.us-gov-east-1.compute.internal    Ready    worker                      21m   v1.26.0+rke2r2
ip-192-168-64-223.us-gov-east-1.compute.internal    Ready    worker                      21m   v1.26.0+rke2r2
ip-192-168-94-18.us-gov-east-1.compute.internal     Ready    worker                      21m   v1.26.0+rke2r2
h

hundreds-airport-66196

03/02/2023, 1:41 AM
Im not sure if this will help you. Years ago I was setting up a linux box (Debian) via a VPN. I can ssh but doing an "ls -l" will hang randomly. It turns out an MTU mismatch.
4 Views