https://rancher.com/ logo
Title
c

crooked-scooter-58172

02/07/2023, 10:45 PM
We are using a 3-node cluster for our development. Setup is working fine, however sometime a specific Node is keep getting into “*Cordoned*” state with the message “*Kubelet stopped posting node status*”. I tried a lot but haven't been able to nail down the actual issue with that specific Node. Can someone please help me with some pointers towards the troubleshooting? Do we have any tool I can use for the troubleshooting?
b

bright-fireman-42144

02/07/2023, 11:05 PM
no idea what I'm talking about but: kubectl describe node <node-name>? might give you more info
check the dashboard for sure for any resource constraints... mine were mainly around disk pressure but I assume CPU and mem could be the issue as well.
c

crooked-scooter-58172

02/07/2023, 11:35 PM
Thanks @bright-fireman-42144: I tried both the options but didn't get any specific reason
image.png
All resources are looking good
It seems that POD "apply-system-agent-upgrader" is failing with "no route to host" message. Any idea what this pod does and how to stop this auto upgrade?
b

bright-fireman-42144

02/08/2023, 12:36 AM
again, no clue. What ns? I'll check mine.
g

great-bear-19718

02/08/2023, 12:41 AM
what is the version of harvester?
c

crooked-scooter-58172

02/08/2023, 12:41 AM
1.1.0
g

great-bear-19718

02/08/2023, 12:41 AM
what is the spec of the nodes?
c

crooked-scooter-58172

02/08/2023, 12:45 AM
image.png
image.png
g

great-bear-19718

02/08/2023, 1:00 AM
on the node where you see the failed status for kubelet
are you able to check the logs for rke2-agent
journalctl -fu rke2-agent
and also
journalctl -fu rke2-server
c

crooked-scooter-58172

02/08/2023, 1:03 AM
I am not able to ssh into the failed node. However when I run journalctl commands into other master node in the cluster, I am getting these logs
journalctl -fu rke2-server -- Logs begin at Wed 2022-12-21 08:06:17 UTC. -- Feb 08 00:24:30 iaas-node-001 rke2[3271]: time="2023-02-08T00:24:30Z" level=info msg="Event(v1.ObjectReference{Kind:\"HelmChart\", Namespace:\"kube-system\", Name:\"rke2-coredns\", UID:\"31bcd068-9c8f-4440-904d-adb5b6bf5d88\", APIVersion:\"helm.cattle.io/v1\", ResourceVersion:\"317\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-rke2-coredns" Feb 08 00:24:30 iaas-node-001 rke2[3271]: time="2023-02-08T00:24:30Z" level=info msg="Event(v1.ObjectReference{Kind:\"HelmChart\", Namespace:\"kube-system\", Name:\"rke2-multus\", UID:\"e6c9ddf0-8ee9-42b2-8145-fe28bc72e166\", APIVersion:\"helm.cattle.io/v1\", ResourceVersion:\"383\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-rke2-multus"
g

great-bear-19718

02/08/2023, 1:04 AM
i would need to check what is going in the failed node
any specific reason you cant ssh into it?
c

crooked-scooter-58172

02/08/2023, 1:04 AM
Looks like it lost connectivity as I am not even able to ping it anymore
g

great-bear-19718

02/08/2023, 1:05 AM
that would explain the kubelet error in cluster
that is likely to be the reason for the error
c

crooked-scooter-58172

02/08/2023, 1:06 AM
Actually the issue is that we have 3 nodes cluster for almost 3-4 months and we are facing similar issue with this only. It works for few weeks and suddenly loose connectivity.
Our network team analyze everything and didn't find any issue. They told that it could be a harvester specific issue
g

great-bear-19718

02/08/2023, 1:06 AM
that is hard to say without looking at the logs from the failed node
this should collect a lot of OS specific info
supportconfig -k -c
and once the node is up.. please also generate a harvester support-bundle
it is hard to pin point anything without the logs
c

crooked-scooter-58172

02/08/2023, 1:08 AM
Yes....