This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

04/02/2025, 10:04 PM

This message was deleted.

creamy-pencil-82913

04/02/2025, 10:19 PM

Check the system time. That usually means that the last pod restart would be in the future according to the current system clock

witty-computer-79149

04/02/2025, 10:21 PM

yeah, i did system time looks fine

witty-computer-79149

04/02/2025, 10:23 PM

what’s weird is I’ll do a cordon and kubectl get pods -A says the pods are all running in thst seever but they can’t be bc i stoped the rke2 service

creamy-pencil-82913

04/02/2025, 10:24 PM

Stopping the service doesn't stop the pods. Nor does it delete them.

creamy-pencil-82913

04/02/2025, 10:25 PM

Eventually the node will go not ready but the pods will remain in their last state until the node is deleted or the pods are force deleted.

creamy-pencil-82913

04/02/2025, 10:26 PM

Without the kubelet running to report that the pods are actually gone they won't change state. The cluster has no way of knowing what is actually going on. It could still be up and running but there is a network outage, for example.

witty-computer-79149

04/02/2025, 10:26 PM

yeah it’s already in “NotReady” state

witty-computer-79149

04/02/2025, 10:26 PM

“NotReady,SchedulingDisabled”

creamy-pencil-82913

04/02/2025, 10:26 PM

It cannot reason about something it is not communicating with.

creamy-pencil-82913

04/02/2025, 10:27 PM

This is how Kubernetes works

witty-computer-79149

04/02/2025, 10:27 PM

got it, so red herring on the invalid part

witty-computer-79149

04/02/2025, 10:30 PM

but wouldn’t the pods disappear from the get pods list if i drain the node?

witty-computer-79149

04/02/2025, 10:37 PM

yeah, i started the node back up, services came back it seemed, it went from <invalid> to status “Unknown” looking at the pod with describe I see “Failed to create pod sand pod: rock error: code = unknown desc = failed to reserve sandbox name “cilium-gcmec_kube-system_….” name is reserved for “c8ef…”

creamy-pencil-82913

04/02/2025, 10:42 PM

you can’t drain the node while it’s down. you would need to do that first, before stopping it.

witty-computer-79149

04/02/2025, 10:46 PM

yeah, i did that before hand

witty-computer-79149

04/02/2025, 10:46 PM

i’ve tried both

creamy-pencil-82913

04/02/2025, 10:50 PM

what exactly are you trying to accomplish?

creamy-pencil-82913

04/02/2025, 10:51 PM

if you want it drained, drain it first. then stop the service. Node that not all types of pods will get drained, so there will likely be some that remain on the node even after draining.

witty-computer-79149

04/02/2025, 10:51 PM

so essentially what i’m doing is testing out kube-vip to make sure it’s performing HA failover as expected…I’ve done it two ways…. once by forcefully shutting a server down, get it back up and working (only reliable way to get rid of all these errors is to delete the whole node and rejoin it) then I also tried it by cordoning it rebooting and uncordonging it.

witty-computer-79149

04/02/2025, 10:52 PM

It’s 3 nodes all master nodes

witty-computer-79149

04/02/2025, 10:52 PM

kube-vip set up on them

witty-computer-79149

04/02/2025, 10:53 PM

like right now it’s in a state where the kube-controller-manager-master1 is in status “unknown”.

creamy-pencil-82913

04/02/2025, 10:53 PM

I suspect there’s something odd with your kube-vip config then. It is absolutely possible to shut down one more more nodes of a HA cluster and have everything come back up cleanly without needing to delete anything.

creamy-pencil-82913

04/02/2025, 10:54 PM

Either that or something else is wrong. Aside from whatever you have going on with kube-vip this is super basic stuff.

witty-computer-79149

04/02/2025, 10:55 PM

if i run kubectl logs -n kube-system kube-controller-manager-master1 it shows “failed to try and resolving symlinks in path”

witty-computer-79149

04/02/2025, 10:55 PM

yeah, it is super basic, i’m literally copy and pasting configs from kube-vip website and rke2 website

witty-computer-79149

04/02/2025, 10:55 PM

and it’s flaky

witty-computer-79149

04/02/2025, 10:56 PM

only reliable way to get the node back into a normal state is to delete it and rejoin

creamy-pencil-82913

04/02/2025, 10:56 PM

what is not an error I have ever seen before

creamy-pencil-82913

04/02/2025, 10:56 PM

are you setting your nodes up weird?

witty-computer-79149

04/02/2025, 10:56 PM

nope not at all

witty-computer-79149

04/02/2025, 10:57 PM

i open the ports i need for cilium, set the rke2 repo up, runs”dnf install rke2-server”

witty-computer-79149

04/02/2025, 10:58 PM

edit the /etc/rancher/rke2/config.yaml

creamy-pencil-82913

04/02/2025, 10:59 PM

why is it complaining about symlinks though. Are your filesystems set up weird? Are you missing stuff from /var/log after a reboot?

witty-computer-79149

04/02/2025, 11:00 PM

add cni: cilium, token, server: https://myvip.example.com and tls-san: which has the host name of that server and the myvip.example.com and that’s it.

creamy-pencil-82913

04/02/2025, 11:00 PM

it should just be looking at things under /var/log/pods with symlinks under /var/log/containers. If you have something that is clearing that out on reboot then yes, it may be confused. The kubelet uses those log files to track container restart count, and the files are not expected to be removed when the node restarts.

witty-computer-79149

04/02/2025, 11:02 PM

yup, it is… /var/log/pods/kube-system_cloud-controller-manager…/cloud-controller-manager/2.log

witty-computer-79149

04/02/2025, 11:02 PM

i check that folder and 2.log isn’t there

witty-computer-79149

04/02/2025, 11:02 PM

but i’m not deleting it

witty-computer-79149

04/02/2025, 11:02 PM

is there some persistence thing im missing?

witty-computer-79149

04/02/2025, 11:03 PM

i assumed the default config would just persist things on the nodes local disk

creamy-pencil-82913

04/02/2025, 11:03 PM

it does, yes

creamy-pencil-82913

04/02/2025, 11:04 PM

what distro are you running this on? how are your filesystems laid out?

witty-computer-79149

04/02/2025, 11:04 PM

rhel9

witty-computer-79149

04/02/2025, 11:05 PM

it’s literally the most basic config

witty-computer-79149

04/02/2025, 11:06 PM

100g drive, /boot, /boot/efi, /

witty-computer-79149

04/02/2025, 11:07 PM

like that log file for that pod is not there…but i’m not deleting it

witty-computer-79149

04/02/2025, 11:09 PM

ok, i’m resetting this node…did kubectl delete node <master3>

witty-computer-79149

04/02/2025, 11:10 PM

kubectl get pods -A shows no master3 pods running

witty-computer-79149

04/02/2025, 11:10 PM

all other pods on master1 and 2 are running and doesn’t seem to be complaining

creamy-pencil-82913

04/02/2025, 11:10 PM

we test every release on that distro… so I know it’ll work out of the box.

witty-computer-79149

04/02/2025, 11:11 PM

so now i’m going to add this node back

witty-computer-79149

04/02/2025, 11:11 PM

yeah, i mean it works…until it reboots

creamy-pencil-82913

04/02/2025, 11:11 PM

what are you using as the server address when joining nodes?

witty-computer-79149

04/02/2025, 11:12 PM

server: https://rancher vip.example.com

witty-computer-79149

04/02/2025, 11:13 PM

ranchervip.example.com is in the kube-vip config

witty-computer-79149

04/02/2025, 11:13 PM

and refers to the vip ip

creamy-pencil-82913

04/02/2025, 11:15 PM

and you’re sure that kube-vip isn’t advertising that vip on this node as soon as it comes up? so that it ends up talking to itself somehow?

witty-computer-79149

04/02/2025, 11:17 PM

i mean this node isn’t the system that has the vip attached to it

witty-computer-79149

04/02/2025, 11:17 PM

ok, master3 is joining back

creamy-pencil-82913

04/02/2025, 11:17 PM

what is the VIP attached to then, if not all three of your servers?

witty-computer-79149

04/02/2025, 11:17 PM

i see pods spinning up for things like cilium, cloud-controller-manager etc

witty-computer-79149

04/02/2025, 11:18 PM

it’s current attached to one of the nodes

witty-computer-79149

04/02/2025, 11:18 PM

master2

witty-computer-79149

04/02/2025, 11:19 PM

so master2 is advertising that it’s receiving traffic for the ranchervip.example.com

witty-computer-79149

04/02/2025, 11:20 PM

ok, all 3 nodes are marked as “ready”. kubectl get pods shows master3 with all the common kube-system pods

creamy-pencil-82913

04/02/2025, 11:20 PM

that is how it should work

witty-computer-79149

04/02/2025, 11:20 PM

yup

witty-computer-79149

04/02/2025, 11:20 PM

now if i reboot master3 goes it hell on master3

witty-computer-79149

04/02/2025, 11:21 PM

no errors no nothing that i can see

witty-computer-79149

04/02/2025, 11:21 PM

all “running” or completed status of all pods

witty-computer-79149

04/02/2025, 11:21 PM

i’m shutting master3 down

witty-computer-79149

04/02/2025, 11:22 PM

ok i’m doing a watch on kubectl get nodes

witty-computer-79149

04/02/2025, 11:23 PM

it is now marked as “NotReady”

witty-computer-79149

04/02/2025, 11:23 PM

server is off

witty-computer-79149

04/02/2025, 11:23 PM

pods still show they are running on master3 but you said that’s expected

witty-computer-79149

04/02/2025, 11:24 PM

i’m doing nothing but turning on master3

witty-computer-79149

04/02/2025, 11:29 PM

ok, it’s back and says ready…

witty-computer-79149

04/02/2025, 11:29 PM

back to (invalid)

creamy-pencil-82913

04/02/2025, 11:30 PM

did everything get cleaned out of /var/log/pods again?

creamy-pencil-82913

04/02/2025, 11:31 PM

although I suspect there’s more to it than that. I can reboot nodes all day long and the pods go to invalid.

creamy-pencil-82913

04/02/2025, 11:32 PM

Can you verify that the time sync is correct when the node boots? It’s not like, coming up, and starting rke2, and then a bit later ntp kicks off and changes the system time?

witty-computer-79149

04/02/2025, 11:32 PM

nope the log files are there

creamy-pencil-82913

04/02/2025, 11:32 PM

the journald logs should make that pretty obvious

witty-computer-79149

04/02/2025, 11:33 PM

i mean the date is right on the server and it’s only been up like 3 min or so

creamy-pencil-82913

04/02/2025, 11:33 PM

doesn’t matter if it is right now. does the time in the logs jump at some point after boot

creamy-pencil-82913

04/02/2025, 11:34 PM

you can dump the logs out of journald and share them somewhere if you’d like another set of eyes on them

witty-computer-79149

04/02/2025, 11:34 PM

ohhhh wtf weird…you are right…

witty-computer-79149

04/02/2025, 11:35 PM

it’s showing 11PM and the logs jump to different times

witty-computer-79149

04/02/2025, 11:35 PM

wtf

witty-computer-79149

04/02/2025, 11:35 PM

it’s a proxmox vm…

creamy-pencil-82913

04/02/2025, 11:35 PM

yep. the hardware clock is not synced. so it boots with one clock, and then ntp kicks in after its been up for a bit and time jumps forward or back.

creamy-pencil-82913

04/02/2025, 11:36 PM

that will break things. you need consistent system time

witty-computer-79149

04/02/2025, 11:36 PM

but shouldn’t it repair it self after the clock is fixed and so restart the rke2 services?

creamy-pencil-82913

04/02/2025, 11:38 PM

not really no. Like I said thats why you are getting stuff like invalid pod restart times, because the pod was restarted “in the future” because time went backwards - which very much needs to not happen.

creamy-pencil-82913

04/02/2025, 11:38 PM

you will also get thing like, certs that aren’t valid yet because they were issued and have a NotBefore date in the future.

creamy-pencil-82913

04/02/2025, 11:38 PM

fix your system time sync and your problems will go away.

witty-computer-79149

04/02/2025, 11:39 PM

got it, makes sense

creamy-pencil-82913

04/02/2025, 11:39 PM

or at the very least modify the rke2-server systemd unit to depend on

time-sync.target

witty-computer-79149

04/02/2025, 11:39 PM

i’ve never noticed this

witty-computer-79149

04/02/2025, 11:40 PM

which is wild

witty-computer-79149

04/03/2025, 12:52 AM

that looks like it fixed it all. it seems the default option in proxmox isn’t passing the time through correctly

witty-computer-79149

04/03/2025, 12:52 AM

have to explicitly set it to “yes”

witty-computer-79149

04/03/2025, 12:54 AM

@creamy-pencil-82913 thanks for the pointer

20 Views

Open in Slack

Previous Next