This message was deleted.
# vsphere
a
This message was deleted.
s
how many control plane node you have ?
l
3 Control plane nodes, but none of them is running now
s
you are getting this erro when ? scaling up ? can you explain the scenario
l
I edited the cluster config to increase the node CPU and Memory, then new nodes were stuck in "Waiting for init node". I deleted those nodes and after that it started throwing the above error.
s
oh okay got it so what happens is when you deploy control plane nodes ( odd numbers ) because of RAF algo that etcd uses so that means the etcd main conf was on one of those nodes and when it got deleted ( when it was stuck at the init node ) that conf got deleted as well that was used by etcd . Now i am assuming you deleted all the three nodes and they started showing this error , because now the nodes think and try to find the existing etcd plane but thats not available now so its asking for a backup to restore the etcd
so onething you can try is go to the rancher backup and restore section in the documentation and try to check whether there is any backup folder created on your control plane nodes
if it is present then you can try restoring it , else scale these nodes to one and then delete the first node so that it creates control plane as a fresh node then scale it to 3
l
And as no backup is available so it's impossible to restore it, right? What do you suggest to avoid this situation in future? How many number of nodes should I select considering the application load was well handled by 3 nodes of 16 GB and 8vCPUs.
s
well to be honest its better to segregate the roles so that you have etcd seperately that way etcd will not get affected if you change any config on the masters node , second thing is when provisioning a cluster you can take backup of your etcds so that atleast you will sort of have a state with you to go back to incase of errors also which platform you are using to provision this cluster on ?
l
Running on Vsphere, and running a different etcd is currently not possible.
s
okay and how are you assigning ips to vm ? i am assuming you are using vapps ?
do you have sufficient ips and resources on your data store ? i am just trying to solve this node stuck at init problem for you first
l
Yes, it is automatically taking IPs from the network pool
s
my set up is also same with three master + etcd role nodes i have enabled the backup for etcd
that workss for me
this problem only arises when you either delete all the nodes or scale down to an even number ofcontrol plane
l
Where are you storing the backups?
s
well as of now locally only
l
So currently I can go with the odd number of nodes, right? And taking the regular backup of etcd will save from such problems in future.
s
yes bro also when you make changes in the config of control plane node never delete all the nodes at once as it will remove the etcd config plane and you will face the same error
just do keep a look on the vsphere logs of any node that gets stuck in init phase
logs on vsphere gives more clarity why it happens
l
Ok, thanks a lot for the help. Any idea what's the reason for the node getting stuck in init?
s
well it gets stuck in the init phase when either it doesnt get proper resources , seems more like a vsphere related error to me . coz when rancher shows " waiting for agent to apply initial plan " thats where vcenter provides network to the node using the cloud init file
feel free to reach out bro i will be around trying to make this community more active
👍 2
l
How can I go with segregating etcd nodes on Vsphere? Any suggestions?
I tried scaling down nodes to 1 and then deleting the last node but the new node is still stuck with the same "etcd error".
s
bro you have to create 3 seperate pools for 3 roles , master worker and etcd
l
Then how to connect them all? Will it be automatically done?
s
yes bro they connect automatically
rancher configures them all by themselves
l
Perfect, I will give it a try. Thanks 👍
s
yeah 😄
1042 Views