If i restart a node in HA mode, the server comes b...
# rke2
w
If i restart a node in HA mode, the server comes back but i start seeing kube-system cilium-gcmwc 1/1 Running 1(<invalid> ago) 15m Why am i seeing <invaid> i can’t see anything in the logs wrong but maybe missing something
c
Check the system time. That usually means that the last pod restart would be in the future according to the current system clock
w
yeah, i did system time looks fine
what’s weird is I’ll do a cordon and kubectl get pods -A says the pods are all running in thst seever but they can’t be bc i stoped the rke2 service
c
Stopping the service doesn't stop the pods. Nor does it delete them.
Eventually the node will go not ready but the pods will remain in their last state until the node is deleted or the pods are force deleted.
Without the kubelet running to report that the pods are actually gone they won't change state. The cluster has no way of knowing what is actually going on. It could still be up and running but there is a network outage, for example.
w
yeah it’s already in “NotReady” state
“NotReady,SchedulingDisabled”
c
It cannot reason about something it is not communicating with.
This is how Kubernetes works
w
got it, so red herring on the invalid part
but wouldn’t the pods disappear from the get pods list if i drain the node?
yeah, i started the node back up, services came back it seemed, it went from <invalid> to status “Unknown” looking at the pod with describe I see “Failed to create pod sand pod: rock error: code = unknown desc = failed to reserve sandbox name “cilium-gcmec_kube-system_….” name is reserved for “c8ef…”
c
you can’t drain the node while it’s down. you would need to do that first, before stopping it.
w
yeah, i did that before hand
i’ve tried both
c
what exactly are you trying to accomplish?
if you want it drained, drain it first. then stop the service. Node that not all types of pods will get drained, so there will likely be some that remain on the node even after draining.
w
so essentially what i’m doing is testing out kube-vip to make sure it’s performing HA failover as expected…I’ve done it two ways…. once by forcefully shutting a server down, get it back up and working (only reliable way to get rid of all these errors is to delete the whole node and rejoin it) then I also tried it by cordoning it rebooting and uncordonging it.
It’s 3 nodes all master nodes
kube-vip set up on them
like right now it’s in a state where the kube-controller-manager-master1 is in status “unknown”.
c
I suspect there’s something odd with your kube-vip config then. It is absolutely possible to shut down one more more nodes of a HA cluster and have everything come back up cleanly without needing to delete anything.
Either that or something else is wrong. Aside from whatever you have going on with kube-vip this is super basic stuff.
w
if i run kubectl logs -n kube-system kube-controller-manager-master1 it shows “failed to try and resolving symlinks in path”
yeah, it is super basic, i’m literally copy and pasting configs from kube-vip website and rke2 website
and it’s flaky
only reliable way to get the node back into a normal state is to delete it and rejoin
c
what is not an error I have ever seen before
are you setting your nodes up weird?
w
nope not at all
i open the ports i need for cilium, set the rke2 repo up, runs”dnf install rke2-server”
edit the /etc/rancher/rke2/config.yaml
c
why is it complaining about symlinks though. Are your filesystems set up weird? Are you missing stuff from /var/log after a reboot?
w
add cni: cilium, token, server: https://myvip.example.com and tls-san: which has the host name of that server and the myvip.example.com and that’s it.
c
it should just be looking at things under /var/log/pods with symlinks under /var/log/containers. If you have something that is clearing that out on reboot then yes, it may be confused. The kubelet uses those log files to track container restart count, and the files are not expected to be removed when the node restarts.
w
yup, it is… /var/log/pods/kube-system_cloud-controller-manager…/cloud-controller-manager/2.log
i check that folder and 2.log isn’t there
but i’m not deleting it
is there some persistence thing im missing?
i assumed the default config would just persist things on the nodes local disk
c
it does, yes
what distro are you running this on? how are your filesystems laid out?
w
rhel9
it’s literally the most basic config
100g drive, /boot, /boot/efi, /
like that log file for that pod is not there…but i’m not deleting it
ok, i’m resetting this node…did kubectl delete node <master3>
kubectl get pods -A shows no master3 pods running
all other pods on master1 and 2 are running and doesn’t seem to be complaining
c
we test every release on that distro… so I know it’ll work out of the box.
w
so now i’m going to add this node back
yeah, i mean it works…until it reboots
c
what are you using as the server address when joining nodes?
w
ranchervip.example.com is in the kube-vip config
and refers to the vip ip
c
and you’re sure that kube-vip isn’t advertising that vip on this node as soon as it comes up? so that it ends up talking to itself somehow?
w
i mean this node isn’t the system that has the vip attached to it
ok, master3 is joining back
c
what is the VIP attached to then, if not all three of your servers?
w
i see pods spinning up for things like cilium, cloud-controller-manager etc
it’s current attached to one of the nodes
master2
so master2 is advertising that it’s receiving traffic for the ranchervip.example.com
ok, all 3 nodes are marked as “ready”. kubectl get pods shows master3 with all the common kube-system pods
c
that is how it should work
w
yup
now if i reboot master3 goes it hell on master3
no errors no nothing that i can see
all “running” or completed status of all pods
i’m shutting master3 down
ok i’m doing a watch on kubectl get nodes
it is now marked as “NotReady”
server is off
pods still show they are running on master3 but you said that’s expected
i’m doing nothing but turning on master3
ok, it’s back and says ready…
back to (invalid)
c
did everything get cleaned out of /var/log/pods again?
although I suspect there’s more to it than that. I can reboot nodes all day long and the pods go to invalid.
Can you verify that the time sync is correct when the node boots? It’s not like, coming up, and starting rke2, and then a bit later ntp kicks off and changes the system time?
w
nope the log files are there
c
the journald logs should make that pretty obvious
w
i mean the date is right on the server and it’s only been up like 3 min or so
c
doesn’t matter if it is right now. does the time in the logs jump at some point after boot
you can dump the logs out of journald and share them somewhere if you’d like another set of eyes on them
w
ohhhh wtf weird…you are right…
it’s showing 11PM and the logs jump to different times
wtf
it’s a proxmox vm…
c
yep. the hardware clock is not synced. so it boots with one clock, and then ntp kicks in after its been up for a bit and time jumps forward or back.
that will break things. you need consistent system time
w
but shouldn’t it repair it self after the clock is fixed and so restart the rke2 services?
c
not really no. Like I said thats why you are getting stuff like invalid pod restart times, because the pod was restarted “in the future” because time went backwards - which very much needs to not happen.
you will also get thing like, certs that aren’t valid yet because they were issued and have a NotBefore date in the future.
fix your system time sync and your problems will go away.
w
got it, makes sense
c
or at the very least modify the rke2-server systemd unit to depend on
time-sync.target
w
i’ve never noticed this
which is wild
that looks like it fixed it all. it seems the default option in proxmox isn’t passing the time through correctly
have to explicitly set it to “yes”
@creamy-pencil-82913 thanks for the pointer