https://rancher.com/ logo
Title
c

creamy-autumn-87994

05/10/2023, 1:12 AM
Has anyone tried to change the template of an RKE2 cluster agent-pool on vSphere? It attempts to upgrade and create a node, but just stops there. Doesn’t register the node or do anything. I’m able to manually delete a node and it’ll auto-gen a new one using the old template, but it doesn’t affect the update process. Update process is broken.
a

agreeable-oil-87482

05/10/2023, 6:24 AM
I've done this before (accidentally!) And it worked ok. On the node it created SSH to it and grab the rancher-system-agent logs.
c

creamy-autumn-87994

05/10/2023, 5:02 PM
Hm. Yeah it looks to be like an image issue. I have one from 3/27 that works fine and the same one but updated on 5/04 is not working. I just noticed my /etc/passwd looks different.. old
docker:x:1028:100::/home/docker:/bin/sh
new
ubuntu:x:1028:1029:Ubuntu:/home/ubuntu:/bin/bash
how strange.. I believe rancher uses the docker user to ssh in, which would explain the problem, but I have no idea why I’m getting an ubuntu user now.
a

agreeable-oil-87482

05/10/2023, 5:03 PM
Probably the default user in the cloud image. If the docker user doesn't exist it hasn't ingested the new cloud config
Has the nodes hostname changed?
c

creamy-autumn-87994

05/10/2023, 5:04 PM
nope, never gets to that point
a

agreeable-oil-87482

05/10/2023, 5:04 PM
Yeah if the node name hasn't changed then it hasn't ingested the rancher generated cloud init user data config
It actually does that before invoking any rancher specific activity
c

creamy-autumn-87994

05/10/2023, 5:06 PM
Does rancher ssh into the node to start the process?
Or sent cloud-init data on provisioning?
a

agreeable-oil-87482

05/10/2023, 5:07 PM
Not with rke2. The cloud init user data config writes and executes a script that fires off the agent install. The cloud init config does include the new hostname value though
But the ssh key is used if you SSH into the node using the rancher ui
c

creamy-autumn-87994

05/10/2023, 5:08 PM
Hm. So I wonder if something is off with the new version of cloud-init
a

agreeable-oil-87482

05/10/2023, 5:08 PM
How did you make your new template?
c

creamy-autumn-87994

05/10/2023, 5:09 PM
jenkins job runs a packer job. Packer creates it identically as the last image, the only difference is time and the OS updates.
a

agreeable-oil-87482

05/10/2023, 5:09 PM
Did you run cloud init clean as part of the build?
c

creamy-autumn-87994

05/10/2023, 5:09 PM
Yup
a

agreeable-oil-87482

05/10/2023, 5:10 PM
Which OS is it?
c

creamy-autumn-87994

05/10/2023, 5:10 PM
ubuntu 20.04
I had to also remove contents of a dir to fully reset it
c

creamy-autumn-87994

05/10/2023, 5:12 PM
Oh there is one difference. Our images were trying to connect to the metadata URL and take several minutes to boot up, so we added this file
root@ubuntu-server:/etc/cloud/cloud.cfg.d# cat 99_disable_metadata.cfg
datasource_list: [ None ]
a

agreeable-oil-87482

05/10/2023, 5:13 PM
Ah. That'll instruct cloud init not to ingest from the iso rancher mounts
c

creamy-autumn-87994

05/10/2023, 5:13 PM
dangit!
Thank you for the extra pair of 👀
a

agreeable-oil-87482

05/10/2023, 5:13 PM
You need to have the nocloud data source added iirc
👍 1
No worries!
c

creamy-autumn-87994

05/10/2023, 5:14 PM
I’ll give that a shot, thank you!!
One more question if you have a second! If we are running longhorn with local PVs, does the auto-upgrade mess things up with volume replicas? It’s unfortunate we can’t choose which node to do in order and put the longhorn node in maintenance first.
a

agreeable-oil-87482

05/10/2023, 5:20 PM
What do you mean longhorn with local pvs?
c

creamy-autumn-87994

05/10/2023, 5:20 PM
with longhorn storing the volume replicas on the worker nodes. Then updating the template will replace the worker nodes with new ones.
a

agreeable-oil-87482

05/10/2023, 5:21 PM
Ah. Providing you have enough replicas and they're spread across worker nodes. Sure. You can influence how many worker nodes are upgraded in parallel in the cluster options
c

creamy-autumn-87994

05/10/2023, 5:22 PM
Typically you would go to the longhorn UI and put the node in maintenance and offload the volume. But with an update to the cluster, updates are automatic and it replaces worker nodes quickly. You don’t get a chance to go to the longhorn UI
a

agreeable-oil-87482

05/10/2023, 5:24 PM
Interesting. Can you post that in #longhorn-storage please?
c

creamy-autumn-87994

05/10/2023, 5:24 PM
Oh, didn’t know about the channel! Thank you!