This message was deleted.
# harvester
a
This message was deleted.
g
just to confirm when you removed the node you deleted the node from harvester
and re-added it back?
d
I removed it from harvester and then replaced the drive in the machine and reinstalled, joining the existing cluster. After installing (and rebooting), it never joined the cluster, always showing 'UnReady' and never appearing in the host list. It was, originally, the first node that I installed and it feels as though it became the 'controller' (as I found it's IP in the rke2 config on the other nodes).
g
how many nodes were there in the cluster
d
3
g
any chance to get a support-bundle from the cluster
there might be some info in there that might help us identify what could be causing this
d
Unfortunately, not at this time. I've ended up tearing them all down and rebuilt from scratch to try other configuration scenarios. I can take down the first node and go through the same steps again to see if the issue occurs in the new build as well.
g
πŸ‘
d
What are the steps to obtain the support-bundle? I wasn't sure if this was a known issue, which is why I asked, even after tearing down the initial test cluster, for 'future knowledge'. Since it appears not, I would be more than happy to see if I can replicate the issue, or write it off as a one-off thing.
d
I'll repeat the steps either this evening or in the morning (Eastern Time here), and post an update in this thread.
It seems that the problem persists after removing/rebuilding 'Node 1' from a 3 node cluster. Nodes 2 and 3 are still online and 'Healthy', however Node 1 is shows as 'NotReady' and after 5-10 minutes post-reboot still not registered with the cluster. I created a support-bundle before performing the reload. I'm not sure if I'll be able to access the UI on the rebuilt node, but can grab another support-bundle from the now 2-node cluster, if that could be helpful in tracing the issue.
g
you should be able to access ui from the 2 node cluster too.. it should still be running
d
Yes, I did grab a second bundle after the node was removed to compare the difference. One item I did find was there were 2
custom-*-machin-plan
secrets which contain an applied-plan field, which references the old node's IP for the server field. I updated the secrets with an IP from the remaining nodes and rebuild Node 1 again, unfortunately, it seems a new plan secret has been created when the new node tried coming online, though (currently) it's empty.
I also noticed that the internal Longhorn images were still set to
replica 3
after removing the node. I changed that to 2 to take the volumes out of a the degraded state. Unfortunately, so far, nothing has been able to bring the rebuilt node #1 back in to the cluster. Grasping for straws, I'm wondering if there's a configuration item, either on the remaining servers, or within the internal K8s construct that is referencing the original Node #1's IP address, and passing that in to the node bootstrap config, instead of using an IP from one of the remaining nodes.
g
so you are not able to generate a support bundle?
d
I do have 2 of them. Should I upload them here, or is there a different preferred location? I created one prior to removing the node from the cluster, as a baseline. And created another one after removing the node from the cluster.
g
any chance i could have the 2nd bundle?
d
Slack is telling me that my support-bundle files will exceed my workspace limit.
g
ok.. how about creating a GH issue and attaching it there?
d
Will do.
g
πŸ‘
what is the status of
kubectl get clusters.cluster -A
resource?
because
tempharv1
is not in the cluster there is little info about it in the 2nd bundle
Copy code
(⎈ |default:default)➜  nodes k get machine -n fleet-local
NAME                  CLUSTER   NODENAME    PROVIDERID         PHASE          AGE    VERSION
custom-4287b915efeb   local     tempharv3   <rke2://tempharv3>   Running        123m
custom-b49899265d4e   local     tempharv2   <rke2://tempharv2>   Running        103m
custom-e6d3236f2c36   local                                    Provisioning   79m
it is trying to provision the node.. but there is not much info so may need to check logs for rke2 on the missing node
d
Copy code
kubectl get clusters.cluster -A
NAMESPACE     NAME    PHASE          AGE   VERSION
fleet-local   local   Provisioning   12h
Copy code
kubectl get machines -n fleet-local
NAME                  CLUSTER   NODENAME    PROVIDERID         PHASE          AGE   VERSION
custom-4287b915efeb   local     tempharv3   <rke2://tempharv3>   Running        11h
custom-8154486041ba   local                                    Provisioning   10h
custom-b49899265d4e   local     tempharv2   <rke2://tempharv2>   Running        11h
Obtained an SSH session to the 'new' node, looking for the rke2 logs and will share those once located.
/var/lib/rancher/rke2/agent/logs/kubelet.log
/var/lib/rancher/rke2/agent/containerd/containerd.log
Copy code
ping <http://registry-1.docker.io|registry-1.docker.io>
PING <http://registry-1.docker.io|registry-1.docker.io> (44.207.51.64) 56(84) bytes of data