https://rancher.com/ logo
Title
d

damp-vegetable-48645

07/07/2022, 4:43 PM
I have been unable to find documentation in regards to removing/replacing physical hosts within a harvester cluster. I built out a test cluster with undersized drives and didn't realize I could just add a drive to expand to, so took a node out and rebuilt it on the new drive. Even after numerous rebuilds it won't join the cluster (the node has been completely removed from the cluster), and putting the original drive back in isn't yielding any better results. As I can see node failures/replacements being a normal part of a lifecycle, the ability to roll nodes in/out of a running cluster would be imperative and something I'd like to test to get a LOE feel on it.
g

great-bear-19718

07/08/2022, 1:46 AM
just to confirm when you removed the node you deleted the node from harvester
and re-added it back?
d

damp-vegetable-48645

07/08/2022, 1:47 AM
I removed it from harvester and then replaced the drive in the machine and reinstalled, joining the existing cluster. After installing (and rebooting), it never joined the cluster, always showing 'UnReady' and never appearing in the host list. It was, originally, the first node that I installed and it feels as though it became the 'controller' (as I found it's IP in the rke2 config on the other nodes).
g

great-bear-19718

07/08/2022, 1:48 AM
how many nodes were there in the cluster
d

damp-vegetable-48645

07/08/2022, 1:48 AM
3
g

great-bear-19718

07/08/2022, 1:48 AM
any chance to get a support-bundle from the cluster
there might be some info in there that might help us identify what could be causing this
d

damp-vegetable-48645

07/08/2022, 1:49 AM
Unfortunately, not at this time. I've ended up tearing them all down and rebuilt from scratch to try other configuration scenarios. I can take down the first node and go through the same steps again to see if the issue occurs in the new build as well.
g

great-bear-19718

07/08/2022, 1:50 AM
👍
d

damp-vegetable-48645

07/08/2022, 1:51 AM
What are the steps to obtain the support-bundle? I wasn't sure if this was a known issue, which is why I asked, even after tearing down the initial test cluster, for 'future knowledge'. Since it appears not, I would be more than happy to see if I can replicate the issue, or write it off as a one-off thing.
d

damp-vegetable-48645

07/08/2022, 1:53 AM
I'll repeat the steps either this evening or in the morning (Eastern Time here), and post an update in this thread.
It seems that the problem persists after removing/rebuilding 'Node 1' from a 3 node cluster. Nodes 2 and 3 are still online and 'Healthy', however Node 1 is shows as 'NotReady' and after 5-10 minutes post-reboot still not registered with the cluster. I created a support-bundle before performing the reload. I'm not sure if I'll be able to access the UI on the rebuilt node, but can grab another support-bundle from the now 2-node cluster, if that could be helpful in tracing the issue.
g

great-bear-19718

07/08/2022, 2:52 AM
you should be able to access ui from the 2 node cluster too.. it should still be running
d

damp-vegetable-48645

07/08/2022, 2:56 AM
Yes, I did grab a second bundle after the node was removed to compare the difference. One item I did find was there were 2
custom-*-machin-plan
secrets which contain an applied-plan field, which references the old node's IP for the server field. I updated the secrets with an IP from the remaining nodes and rebuild Node 1 again, unfortunately, it seems a new plan secret has been created when the new node tried coming online, though (currently) it's empty.
I also noticed that the internal Longhorn images were still set to
replica 3
after removing the node. I changed that to 2 to take the volumes out of a the degraded state. Unfortunately, so far, nothing has been able to bring the rebuilt node #1 back in to the cluster. Grasping for straws, I'm wondering if there's a configuration item, either on the remaining servers, or within the internal K8s construct that is referencing the original Node #1's IP address, and passing that in to the node bootstrap config, instead of using an IP from one of the remaining nodes.
g

great-bear-19718

07/08/2022, 3:00 AM
so you are not able to generate a support bundle?
d

damp-vegetable-48645

07/08/2022, 3:01 AM
I do have 2 of them. Should I upload them here, or is there a different preferred location? I created one prior to removing the node from the cluster, as a baseline. And created another one after removing the node from the cluster.
g

great-bear-19718

07/08/2022, 3:04 AM
any chance i could have the 2nd bundle?
d

damp-vegetable-48645

07/08/2022, 3:05 AM
Slack is telling me that my support-bundle files will exceed my workspace limit.
g

great-bear-19718

07/08/2022, 3:05 AM
ok.. how about creating a GH issue and attaching it there?
d

damp-vegetable-48645

07/08/2022, 3:05 AM
Will do.
g

great-bear-19718

07/08/2022, 3:11 AM
👍
what is the status of
kubectl get clusters.cluster -A
resource?
because
tempharv1
is not in the cluster there is little info about it in the 2nd bundle
(⎈ |default:default)➜  nodes k get machine -n fleet-local
NAME                  CLUSTER   NODENAME    PROVIDERID         PHASE          AGE    VERSION
custom-4287b915efeb   local     tempharv3   <rke2://tempharv3>   Running        123m
custom-b49899265d4e   local     tempharv2   <rke2://tempharv2>   Running        103m
custom-e6d3236f2c36   local                                    Provisioning   79m
it is trying to provision the node.. but there is not much info so may need to check logs for rke2 on the missing node
d

damp-vegetable-48645

07/08/2022, 1:16 PM
kubectl get clusters.cluster -A
NAMESPACE     NAME    PHASE          AGE   VERSION
fleet-local   local   Provisioning   12h
kubectl get machines -n fleet-local
NAME                  CLUSTER   NODENAME    PROVIDERID         PHASE          AGE   VERSION
custom-4287b915efeb   local     tempharv3   <rke2://tempharv3>   Running        11h
custom-8154486041ba   local                                    Provisioning   10h
custom-b49899265d4e   local     tempharv2   <rke2://tempharv2>   Running        11h
Obtained an SSH session to the 'new' node, looking for the rke2 logs and will share those once located.
/var/lib/rancher/rke2/agent/logs/kubelet.log
/var/lib/rancher/rke2/agent/containerd/containerd.log
ping <http://registry-1.docker.io|registry-1.docker.io>
PING <http://registry-1.docker.io|registry-1.docker.io> (44.207.51.64) 56(84) bytes of data