This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

02/28/2023, 4:20 PM

This message was deleted.

creamy-pencil-82913

02/28/2023, 4:58 PM

If you can’t even access the node, that sounds like something much lower level than RKE2. Have you tried getting into the node via the rescue console to see what it’s doing?

shy-zebra-53074

02/28/2023, 5:01 PM

yes, rescue console also non-responsive… Even if I’m SSH’d into the node when it happens I still have a prompt but typing

ls

does nothing

shy-zebra-53074

02/28/2023, 5:01 PM

just hangs

shy-zebra-53074

02/28/2023, 5:02 PM

I think I have a worker node in that state right now

shy-zebra-53074

02/28/2023, 5:02 PM

yah one of my worker nodes is now in that state

shy-zebra-53074

02/28/2023, 5:02 PM

Copy code

[root@ip-192-168-0-10 ~]# kubectl get nodes
NAME                                                STATUS     ROLES                       AGE     VERSION
ip-192-168-0-10.us-gov-east-1.compute.internal      Ready      control-plane,etcd,master   33m     v1.26.0+rke2r2
ip-192-168-112-162.us-gov-east-1.compute.internal   Ready      worker                      9m14s   v1.26.0+rke2r2
ip-192-168-134-112.us-gov-east-1.compute.internal   Ready      worker                      9m15s   v1.26.0+rke2r2
ip-192-168-16-10.us-gov-east-1.compute.internal     Ready      control-plane,etcd,master   27m     v1.26.0+rke2r2
ip-192-168-32-10.us-gov-east-1.compute.internal     Ready      control-plane,etcd,master   20m     v1.26.0+rke2r2
ip-192-168-53-32.us-gov-east-1.compute.internal     Ready      worker                      9m24s   v1.26.0+rke2r2
ip-192-168-76-222.us-gov-east-1.compute.internal    Ready      worker                      9m32s   v1.26.0+rke2r2
ip-192-168-85-181.us-gov-east-1.compute.internal    NotReady   <none>                      9m12s   v1.26.0+rke2r2
ip-192-168-98-61.us-gov-east-1.compute.internal     Ready      worker                      9m15s   v1.26.0+rke2r2

shy-zebra-53074

02/28/2023, 5:03 PM

ip-192-168-85-181.us-gov-east-1.compute.internal

shy-zebra-53074

02/28/2023, 5:03 PM

and when trying to SSH into it

Copy code

[root@ip-192-168-0-10 ~]# ssh admin@192.168.85.181 -i /home/admin/.ssh/aws.us-gov-east-1.dev

shy-zebra-53074

02/28/2023, 5:03 PM

it just hangs

shy-zebra-53074

02/28/2023, 5:04 PM

the only thing I do on these nodes is install RKE2 and this happens during the cluster startup… agreed it must be something lower-level?

shy-zebra-53074

02/28/2023, 5:06 PM

The console just hangs….

shy-zebra-53074

02/28/2023, 5:10 PM

I can try and manually restart the node in AWS, it should join and get logs

creamy-pencil-82913

02/28/2023, 5:25 PM

If you can’t even use the serial console, that sure sounds like an issue with the kernel or something.

shy-zebra-53074

02/28/2023, 5:26 PM

yes agreed, that’s what’s concerning is there seems to be some interaction w/ RKE2 and the kernel that may be causing this?

shy-zebra-53074

02/28/2023, 5:27 PM

one caveat is that the RHEL 8.5 instances run STIG controls against them… I have done this with previous version of RKE2 for 2+ years now, this is the first time I’m seeing this… I’m working to restart the node now so I can pull containerd / kubelet logs

shy-zebra-53074

02/28/2023, 5:28 PM

ok well AWS couldn’t stop the node so it looks like it just terminated it

shy-zebra-53074

02/28/2023, 5:30 PM

Do you know is this something I can engage w/ Rancher professional services on? This behavior is concerning and I think it’s important to dig in and understand why this is happening

creamy-pencil-82913

02/28/2023, 5:47 PM

if you have a support contract then yeah you should be able to open a case about it. If the OS itself is hanging they may punt and suggest you open an issue with RHEL and/or AWS though, as what you’re experiencing not really an issue with our product - especially since we can’t even see that it’s getting as far as running RKE2 before it hangs. And even if RKE2 is running when it hangs, that would still probably be on the OS as it shouldn’t be possible to crash it that hard just running a userspace process.

shy-zebra-53074

02/28/2023, 5:49 PM

is this something that you have ever seen before? what looks to be random node crashes / hanging?

shy-zebra-53074

02/28/2023, 5:49 PM

should I open an issue or discussion on github?

creamy-pencil-82913

02/28/2023, 5:52 PM

you’re welcome to, but it our support staff don’t engage with that, just me and a couple other engineers and our PM

creamy-pencil-82913

02/28/2023, 5:52 PM

Whenever I’ve seen a node hard-lock it’s been a kernel issue

creamy-pencil-82913

02/28/2023, 5:53 PM

The only thing I can think of that sounds similar is https://github.com/rancher/rke2/issues/3892

shy-zebra-53074

02/28/2023, 5:57 PM

Thank you that’s really helpful, at least anything that can point me in any direction because I’m pretty lost right now about this

shy-zebra-53074

02/28/2023, 5:57 PM

if anything else comes to mind please feel free to point me in that direction and I’ll start working some support arrangements

creamy-pencil-82913

02/28/2023, 5:57 PM

did your hardening by any chance set /proc/sys/vm/panic_on_oom to 1?

shy-zebra-53074

02/28/2023, 5:57 PM

checking

creamy-pencil-82913

02/28/2023, 5:58 PM

or also 2 I guess. anything other than 0

shy-zebra-53074

02/28/2023, 5:58 PM

Copy code

[root@ip-192-168-0-10 ~]# cat /proc/sys/vm/panic_on_oom
0

creamy-pencil-82913

02/28/2023, 5:58 PM

k so that’s not it

shy-zebra-53074

02/28/2023, 5:59 PM

Copy code

[root@ip-192-168-0-10 ~]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       1023996 0       -2

shy-zebra-53074

02/28/2023, 5:59 PM

also seeing this for swap

shy-zebra-53074

02/28/2023, 6:00 PM

I’m performing another rebuild now to see if i can get a node to hard-lock and maybe restart it and try and login

creamy-pencil-82913

02/28/2023, 6:00 PM

There is absolutely nothing on the AWS serial console when it hangs? Nothing in EC2 under “get system log” or “get instance screenshot”? And you said the serial console was also non-responsive?

shy-zebra-53074

02/28/2023, 6:01 PM

correct, when I hit enter nothing

creamy-pencil-82913

02/28/2023, 6:01 PM

and both the log and screenshot are blank?

shy-zebra-53074

02/28/2023, 6:01 PM

yes and I will confirm as well when I get another occurrence

shy-zebra-53074

02/28/2023, 6:02 PM

it seems to happen about 30-40% of the time

shy-zebra-53074

02/28/2023, 6:02 PM

but this is the first time I’ve seen it on a worker

shy-zebra-53074

02/28/2023, 6:03 PM

would interacting w/ the API via

kubectl

while things are standing up have any effect? I know I’m probably reaching here

creamy-pencil-82913

02/28/2023, 6:05 PM

on a functioning system, there shouldn’t be anything you can do, in rke2 or otherwise, to hard-lock it as you’re describing.

shy-zebra-53074

02/28/2023, 6:06 PM

yah agreed, it is very odd so I’ll be digging into this and will update you for your awareness, I’ll go ahead and create an issue to start tracking and hopefully help others

shy-zebra-53074

02/28/2023, 6:34 PM

hey @creamy-pencil-82913 appreciate the discussion! just for your awareness with absolutely no change (just re-running the one-line command we use to launch RKE2) the cluster stood up using same configs as when I had the worker hard-lock:

Copy code

Every 3.0s: kubectl get nodes                                                                                                        ip-192-168-0-10.us-gov-east-1.compute.internal: Tue Feb 28 13:32:54 2023

NAME                                                STATUS   ROLES                       AGE   VERSION
ip-192-168-0-10.us-gov-east-1.compute.internal      Ready    control-plane,etcd,master   44m   v1.26.0+rke2r2
ip-192-168-105-44.us-gov-east-1.compute.internal    Ready    worker                      20m   v1.26.0+rke2r2
ip-192-168-118-161.us-gov-east-1.compute.internal   Ready    worker                      19m   v1.26.0+rke2r2
ip-192-168-134-49.us-gov-east-1.compute.internal    Ready    worker                      20m   v1.26.0+rke2r2
ip-192-168-16-10.us-gov-east-1.compute.internal     Ready    control-plane,etcd,master   37m   v1.26.0+rke2r2
ip-192-168-32-10.us-gov-east-1.compute.internal     Ready    control-plane,etcd,master   31m   v1.26.0+rke2r2
ip-192-168-56-154.us-gov-east-1.compute.internal    Ready    worker                      21m   v1.26.0+rke2r2
ip-192-168-64-223.us-gov-east-1.compute.internal    Ready    worker                      21m   v1.26.0+rke2r2
ip-192-168-94-18.us-gov-east-1.compute.internal     Ready    worker                      21m   v1.26.0+rke2r2

hundreds-airport-66196

03/02/2023, 1:41 AM

Im not sure if this will help you. Years ago I was setting up a linux box (Debian) via a VPN. I can ssh but doing an "ls -l" will hang randomly. It turns out an MTU mismatch.

37 Views

Open in Slack

Previous Next