This message was deleted.
# harvester
a
This message was deleted.
b
no idea what I'm talking about but: kubectl describe node <node-name>? might give you more info
check the dashboard for sure for any resource constraints... mine were mainly around disk pressure but I assume CPU and mem could be the issue as well.
c
Thanks @bright-fireman-42144: I tried both the options but didn't get any specific reason
All resources are looking good
It seems that POD "apply-system-agent-upgrader" is failing with "no route to host" message. Any idea what this pod does and how to stop this auto upgrade?
b
again, no clue. What ns? I'll check mine.
g
what is the version of harvester?
c
1.1.0
g
what is the spec of the nodes?
on the node where you see the failed status for kubelet
are you able to check the logs for rke2-agent
journalctl -fu rke2-agent
and also
journalctl -fu rke2-server
c
I am not able to ssh into the failed node. However when I run journalctl commands into other master node in the cluster, I am getting these logs
journalctl -fu rke2-server -- Logs begin at Wed 2022-12-21 080617 UTC. -- Feb 08 002430 iaas-node-001 rke2[3271]: time="2023-02-08T002430Z" level=info msg="Event(v1.ObjectReference{Kind:\"HelmChart\", Namespace:\"kube-system\", Name:\"rke2-coredns\", UID:\"31bcd068-9c8f-4440-904d-adb5b6bf5d88\", APIVersion:\"helm.cattle.io/v1\", ResourceVersion:\"317\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-rke2-coredns" Feb 08 002430 iaas-node-001 rke2[3271]: time="2023-02-08T002430Z" level=info msg="Event(v1.ObjectReference{Kind:\"HelmChart\", Namespace:\"kube-system\", Name:\"rke2-multus\", UID:\"e6c9ddf0-8ee9-42b2-8145-fe28bc72e166\", APIVersion:\"helm.cattle.io/v1\", ResourceVersion:\"383\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyJob' Applying HelmChart using Job kube-system/helm-install-rke2-multus"
g
i would need to check what is going in the failed node
any specific reason you cant ssh into it?
c
Looks like it lost connectivity as I am not even able to ping it anymore
g
that would explain the kubelet error in cluster
that is likely to be the reason for the error
c
Actually the issue is that we have 3 nodes cluster for almost 3-4 months and we are facing similar issue with this only. It works for few weeks and suddenly loose connectivity.
Our network team analyze everything and didn't find any issue. They told that it could be a harvester specific issue
g
that is hard to say without looking at the logs from the failed node
this should collect a lot of OS specific info
Copy code
supportconfig -k -c
and once the node is up.. please also generate a harvester support-bundle
it is hard to pin point anything without the logs
c
Yes....