This message was deleted.
# rke2
a
This message was deleted.
c
If you can’t even access the node, that sounds like something much lower level than RKE2. Have you tried getting into the node via the rescue console to see what it’s doing?
s
yes, rescue console also non-responsive… Even if I’m SSH’d into the node when it happens I still have a prompt but typing
ls
does nothing
just hangs
I think I have a worker node in that state right now
yah one of my worker nodes is now in that state
Copy code
[root@ip-192-168-0-10 ~]# kubectl get nodes
NAME                                                STATUS     ROLES                       AGE     VERSION
ip-192-168-0-10.us-gov-east-1.compute.internal      Ready      control-plane,etcd,master   33m     v1.26.0+rke2r2
ip-192-168-112-162.us-gov-east-1.compute.internal   Ready      worker                      9m14s   v1.26.0+rke2r2
ip-192-168-134-112.us-gov-east-1.compute.internal   Ready      worker                      9m15s   v1.26.0+rke2r2
ip-192-168-16-10.us-gov-east-1.compute.internal     Ready      control-plane,etcd,master   27m     v1.26.0+rke2r2
ip-192-168-32-10.us-gov-east-1.compute.internal     Ready      control-plane,etcd,master   20m     v1.26.0+rke2r2
ip-192-168-53-32.us-gov-east-1.compute.internal     Ready      worker                      9m24s   v1.26.0+rke2r2
ip-192-168-76-222.us-gov-east-1.compute.internal    Ready      worker                      9m32s   v1.26.0+rke2r2
ip-192-168-85-181.us-gov-east-1.compute.internal    NotReady   <none>                      9m12s   v1.26.0+rke2r2
ip-192-168-98-61.us-gov-east-1.compute.internal     Ready      worker                      9m15s   v1.26.0+rke2r2
ip-192-168-85-181.us-gov-east-1.compute.internal
and when trying to SSH into it
Copy code
[root@ip-192-168-0-10 ~]# ssh admin@192.168.85.181 -i /home/admin/.ssh/aws.us-gov-east-1.dev
it just hangs
the only thing I do on these nodes is install RKE2 and this happens during the cluster startup… agreed it must be something lower-level?
The console just hangs….
I can try and manually restart the node in AWS, it should join and get logs
c
If you can’t even use the serial console, that sure sounds like an issue with the kernel or something.
s
yes agreed, that’s what’s concerning is there seems to be some interaction w/ RKE2 and the kernel that may be causing this?
one caveat is that the RHEL 8.5 instances run STIG controls against them… I have done this with previous version of RKE2 for 2+ years now, this is the first time I’m seeing this… I’m working to restart the node now so I can pull containerd / kubelet logs
ok well AWS couldn’t stop the node so it looks like it just terminated it
Do you know is this something I can engage w/ Rancher professional services on? This behavior is concerning and I think it’s important to dig in and understand why this is happening
c
if you have a support contract then yeah you should be able to open a case about it. If the OS itself is hanging they may punt and suggest you open an issue with RHEL and/or AWS though, as what you’re experiencing not really an issue with our product - especially since we can’t even see that it’s getting as far as running RKE2 before it hangs. And even if RKE2 is running when it hangs, that would still probably be on the OS as it shouldn’t be possible to crash it that hard just running a userspace process.
s
is this something that you have ever seen before? what looks to be random node crashes / hanging?
should I open an issue or discussion on github?
c
you’re welcome to, but it our support staff don’t engage with that, just me and a couple other engineers and our PM
Whenever I’ve seen a node hard-lock it’s been a kernel issue
The only thing I can think of that sounds similar is https://github.com/rancher/rke2/issues/3892
s
Thank you that’s really helpful, at least anything that can point me in any direction because I’m pretty lost right now about this
if anything else comes to mind please feel free to point me in that direction and I’ll start working some support arrangements
c
did your hardening by any chance set /proc/sys/vm/panic_on_oom to 1?
s
checking
c
or also 2 I guess. anything other than 0
s
Copy code
[root@ip-192-168-0-10 ~]# cat /proc/sys/vm/panic_on_oom
0
c
k so that’s not it
s
Copy code
[root@ip-192-168-0-10 ~]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       1023996 0       -2
also seeing this for swap
I’m performing another rebuild now to see if i can get a node to hard-lock and maybe restart it and try and login
c
There is absolutely nothing on the AWS serial console when it hangs? Nothing in EC2 under “get system log” or “get instance screenshot”? And you said the serial console was also non-responsive?
s
correct, when I hit enter nothing
c
and both the log and screenshot are blank?
s
yes and I will confirm as well when I get another occurrence
it seems to happen about 30-40% of the time
but this is the first time I’ve seen it on a worker
would interacting w/ the API via
kubectl
while things are standing up have any effect? I know I’m probably reaching here
c
on a functioning system, there shouldn’t be anything you can do, in rke2 or otherwise, to hard-lock it as you’re describing.
s
yah agreed, it is very odd so I’ll be digging into this and will update you for your awareness, I’ll go ahead and create an issue to start tracking and hopefully help others
hey @creamy-pencil-82913 appreciate the discussion! just for your awareness with absolutely no change (just re-running the one-line command we use to launch RKE2) the cluster stood up using same configs as when I had the worker hard-lock:
Copy code
Every 3.0s: kubectl get nodes                                                                                                        ip-192-168-0-10.us-gov-east-1.compute.internal: Tue Feb 28 13:32:54 2023

NAME                                                STATUS   ROLES                       AGE   VERSION
ip-192-168-0-10.us-gov-east-1.compute.internal      Ready    control-plane,etcd,master   44m   v1.26.0+rke2r2
ip-192-168-105-44.us-gov-east-1.compute.internal    Ready    worker                      20m   v1.26.0+rke2r2
ip-192-168-118-161.us-gov-east-1.compute.internal   Ready    worker                      19m   v1.26.0+rke2r2
ip-192-168-134-49.us-gov-east-1.compute.internal    Ready    worker                      20m   v1.26.0+rke2r2
ip-192-168-16-10.us-gov-east-1.compute.internal     Ready    control-plane,etcd,master   37m   v1.26.0+rke2r2
ip-192-168-32-10.us-gov-east-1.compute.internal     Ready    control-plane,etcd,master   31m   v1.26.0+rke2r2
ip-192-168-56-154.us-gov-east-1.compute.internal    Ready    worker                      21m   v1.26.0+rke2r2
ip-192-168-64-223.us-gov-east-1.compute.internal    Ready    worker                      21m   v1.26.0+rke2r2
ip-192-168-94-18.us-gov-east-1.compute.internal     Ready    worker                      21m   v1.26.0+rke2r2
h
Im not sure if this will help you. Years ago I was setting up a linux box (Debian) via a VPN. I can ssh but doing an "ls -l" will hang randomly. It turns out an MTU mismatch.