https://rancher.com/ logo
Title
e

early-engineer-43393

11/29/2022, 11:18 AM
Hello, We are trying to deploy an RKE2 vsphere cluster, when we click on create cluster we are straight away hit with the following error:
waiting: waiting for viable init node
has anyone seen this before, we are not even sure how we can troubleshoot as we have no VM spun up to investigate and no other output from the logs. Thanks
We are using the rancher release 2.6.8, and are trying to deploy a k8s 1.22 cluster
We have got past this issue, however now we get “Waiting for agent to check in and apply initial plan” when we go onto the node we can’t see any containers which is strange..
c

creamy-pencil-82913

11/29/2022, 6:38 PM
what kind of containers are you looking for?
have you checked the rancher-system-agent and rke2-server logs in journald?
e

early-engineer-43393

11/29/2022, 8:11 PM
I would have expected to see a cluster agent image or something? When I say container i mean no docker images on the node
There was also no cluster agent service
I had to manually run the install.sh script for the service to get created
c

creamy-pencil-82913

11/29/2022, 8:54 PM
when you provision rke2 nodes from rancher, it installs rancher-system-agent as a systemd unit. That in turn installs RKE2 and manages the configuration on that node. Once RKE2 is up, cattle-cluster-agent is deployed to the cluster and checks in with the Rancher management cluster.
So, if things are working, you should see systemd units for both rancher-system-agent and rke2-server
If not, check the journald logs for those same units
RKE2 doesn’t use Docker. You shouldn’t have Docker installed on the node, and you will never see any Docker images on the node nor Docker containers running.
e

early-engineer-43393

11/29/2022, 8:57 PM
Thanks @creamy-pencil-82913 we are using zypper to install containerd as part of our VMware template, is this something we should still do?
c

creamy-pencil-82913

11/29/2022, 8:58 PM
no. RKE2 bundles its own version of containerd that has been modified to support our registry mirror rewrites.
You shouldn’t have containerd, docker, cri-io, or anything else already installed an running.
just a bare node.
e

early-engineer-43393

11/29/2022, 9:43 PM
Thank you, no cloud init either?
c

creamy-pencil-82913

11/29/2022, 10:03 PM
Well, reasonably bare. Nothing beyond the normal packages that you would need to bootstrap a node on that infra provider. That includes cloud-init on most platforms.
s

square-policeman-85866

11/29/2022, 10:08 PM
In my cluster for rke2 rancher-system-agent is not installed.
c

creamy-pencil-82913

11/29/2022, 10:10 PM
in your cluster, or on your node
Did you use provision the cluster via Rancher?
s

square-policeman-85866

11/29/2022, 10:11 PM
I ssh onto the node and it doesn’t exist. Yeah provisioned via Rancher
Now stuck in “waiting for agent to check in and apply initial plan”
c

creamy-pencil-82913

11/29/2022, 10:12 PM
Did you do a custom cluster (which gives you a command to run on existing nodes), or try to provision via one of the infrastructure providers?
s

square-policeman-85866

11/29/2022, 10:14 PM
In rancher Ui,create cluster, switch rke1 to rke2,selected VMware vsphere, filled in the machine pools info etc and created
c

creamy-pencil-82913

11/29/2022, 10:14 PM
Check the cloud-init logs and see if there’s anything in there about why the agent failed to install
s

square-policeman-85866

11/29/2022, 10:17 PM
Just seeing a bunch of DEBUG messages. Nothing relating to an error
c

creamy-pencil-82913

11/29/2022, 10:19 PM
do you see it attempting to install any rancher-related components, or download and run a script from your Rancher server?
s

square-policeman-85866

11/29/2022, 10:19 PM
I ran cat cloud-unit.log | grep rancher and nothing
c

creamy-pencil-82913

11/29/2022, 10:20 PM
what does the cloud-init userdata script provided to this node look like?
s

square-policeman-85866

11/29/2022, 10:23 PM
How would I check that? Not sure we set anything related to cloud-init user data. In the template we created we ran zypper install cloud-init. Also installed clone master cleanup and ran it. In rancher Ui did set anything related to cloud-init
c

creamy-pencil-82913

11/29/2022, 10:23 PM
injecting userdata is how Rancher tells the node to install things once it comes up
s

square-policeman-85866

11/29/2022, 10:24 PM
How would I do/set that?
c

creamy-pencil-82913

11/29/2022, 10:24 PM
it is done for you by Rancher when you ask it to provision nodes
s

square-policeman-85866

11/29/2022, 10:24 PM
Ah ok. How can I check what it is?
s

square-policeman-85866

11/29/2022, 10:29 PM
Is it in the /etc/cloud/cloud.cig file. Sorry but of a noobie here
c

creamy-pencil-82913

11/29/2022, 10:29 PM
instance
The /var/lib/cloud/instance directory is a symbolic link that points to the most recently used instance-id directory. This folder contains the information cloud-init received from datasources, including vendor and user data. This can be helpful to review to ensure the correct data was passed.
its the paragraph I linked directly to
s

square-policeman-85866

11/29/2022, 10:30 PM
Ah ok cool yes in there now
Starts with #cloud-config Got a bunch of stuff in it…
c

creamy-pencil-82913

11/29/2022, 10:39 PM
find anything interesting?
oh I see the DM.
ok so what’s that install script have in it?
just follow the crumbs here, figure out where somethings not working
s

square-policeman-85866

11/29/2022, 10:42 PM
The script is huge! :)
Maybe seems like that script isn’t getting run?
When I run that script manually it now created the rancher-system-agent….
Which gives a bunch of errors loading carts
c

creamy-pencil-82913

11/29/2022, 10:59 PM
the userdata showed that it was supposed to be run
you’d have to look at the cloud-init logs to figure out why it didn’t
but yeah, that’s what you should see until the cluster comes up
you can now check the rke2-server logs, since that should be in the process of starting
s

square-policeman-85866

11/29/2022, 11:00 PM
Ok cool will look into that. Thanks for the help
Got a path to the rke2-server logs?
c

creamy-pencil-82913

11/29/2022, 11:00 PM
in journald, same as the agent…
s

square-policeman-85866

11/29/2022, 11:01 PM
Journalctl -u rke2-server?
c

creamy-pencil-82913

11/29/2022, 11:01 PM
yeah should be
s

square-policeman-85866

11/29/2022, 11:02 PM
Just says no entries
c

creamy-pencil-82913

11/29/2022, 11:02 PM
look at the rancher-system-agent logs then, see if it logged an error running the rke2 install
s

square-policeman-85866

11/30/2022, 10:13 AM
Would there be any reason the install.sh script isn’t an executable? We are seeing this never gets run. When we ssh onto the node we need to run chmod +x install.sh and then we can run the script
c

creamy-pencil-82913

11/30/2022, 5:14 PM
that sounds like an issue with cloud-init or something on your nodes. What are you using for the base image?
We test all provisioning stuff as part of every release, on all the supported operating systems, so I know it works.
s

square-policeman-85866

11/30/2022, 5:39 PM
We are using sles15sp4
e

early-engineer-43393

11/30/2022, 6:14 PM
We where told to use the
SLE-15-SP4-Full-x86_64-GM-Media1.iso
, however also been made aware of a more container friendly iso? The version of cloud init we are using is
zypper install -y cloud-init-21.4-150100.8.58.1
s

square-policeman-85866

11/30/2022, 6:17 PM
The cloud init logs say changing ownership of /usr/local/custom_script/install.sh to 0:0
c

creamy-pencil-82913

11/30/2022, 6:41 PM
owernship is fine, I’m just curious why it’s not being executed at all
does it show it attempting to run it?
e

early-engineer-43393

11/30/2022, 6:42 PM
Is there a log we could check to find that, the journalctl log has no entries which I’m assuming is because the install script hasn’t been run to create the service etc?
c

creamy-pencil-82913

11/30/2022, 6:44 PM
it would all be in the cloud-init log, since that’s what should be responsible for executing it
your screenshot showed that the script was present in the cloud-init userdata, right?
s

square-policeman-85866

11/30/2022, 6:53 PM
It doesn’t show any attempt to running it. Just says writes to the script and changes the permissions. Doesn’t show any attempt to run. Yeah I’m the user data file there is a setting Path: /usr/local/custom_script/install.sh And a runcmd -sh /usr/local/custom_script/install.sh
e

early-engineer-43393

11/30/2022, 6:59 PM
In the user-data.txt we can see it has “groups - staff” when checking this group on the instance that group has “staff1000:docker” does that look accurate? It does look like this is the user getting created to run the script but unsure why it doesn’t
c

creamy-pencil-82913

11/30/2022, 7:01 PM
I would expect it to get run as root, not as staff but I’m not sure
e

early-engineer-43393

11/30/2022, 7:01 PM
Interesting that’s what was in the user data and we haven’t overwrote anything 🙁
Would you be available for a 30 min zoom session tomorrow?
c

creamy-pencil-82913

11/30/2022, 7:21 PM
no, sorry. Thats beyond what I’m allowed to do for community support. This is best-effort slack chat only 😉
also I’m a K3s/RKE2 engineer, if you were on with support they’d probably hook you up with someone from the Rancher team since that’s where all the provisioning stuff is handled.
e

early-engineer-43393

11/30/2022, 7:24 PM
No worries, we already have a support ticket opened (we pay for rancher/Suse support)
c

creamy-pencil-82913

11/30/2022, 7:28 PM
oh. then you should be working with them, they’re going to be way more hands-on than I am 😉
e

early-engineer-43393

11/30/2022, 7:45 PM
Your support has been pretty good 😂😂 thanks you 😁
s

square-policeman-85866

12/02/2022, 4:16 PM
Looks like we figured out why the install script wasn’t running. We were just enabling the cloud-init service. If we enable Cloud-init-local Cloud-config Cloud-final Then it kicks off the install script
c

creamy-pencil-82913

12/02/2022, 5:20 PM
that’d do it!