https://rancher.com/ logo
Title
b

bumpy-receptionist-35510

05/08/2023, 2:56 PM
Hi there, we have some trouble to get a vSphere based k8s cluster with version v1.25.9+rke2r1 provisioned by a Rancher v2.7.3 instance. Rancher itself is a fresh deployment on AKS and its trying to bring up a cluster on our on-premise vSphere infrastructure. Rancher installation went without any issues and it imported its "local" cluster just fine. For vSphere we are using the vSphere cloud provider and configured cpi/csi under the "Add-On Config" tab. At first everything seems to work fine. It creates VMs in the defined pools and the "Provisioning Log" tab shows messages as expected:
[INFO ] waiting for viable init node
[INFO ] configuring bootstrap node(s) rancher-test-pool1-5fd745d759-p6wvj: waiting for agent to check in and apply initial plan
[INFO ] configuring bootstrap node(s) rancher-test-pool1-5fd745d759-p6wvj: waiting for probes: etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
[INFO ] configuring bootstrap node(s) rancher-test-pool1-5fd745d759-p6wvj: waiting for probes: etcd, kube-apiserver, kube-controller-manager, kube-scheduler
[INFO ] configuring bootstrap node(s) rancher-test-pool1-5fd745d759-p6wvj: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler
[INFO ] configuring bootstrap node(s) rancher-test-pool1-5fd745d759-p6wvj: waiting for probes: kube-controller-manager, kube-scheduler
[INFO ] configuring bootstrap node(s) rancher-test-pool1-5fd745d759-p6wvj: waiting for cluster agent to connect
[INFO ] non-ready bootstrap machine(s) rancher-test-pool1-5fd745d759-p6wvj and join url to be available on bootstrap node
But here is gets stuck forever. The "Explore" button has turned blue and can be clicked and I can see the node details and a few other entities in the cluster. But it won't complete that last step and the node has the label
<http://plan.upgrade.cattle.io/system-agent-upgrader=7dbfe3bc7aa1f9e217597840afda8c191bdef89403968ed04de99eec|plan.upgrade.cattle.io/system-agent-upgrader=7dbfe3bc7aa1f9e217597840afda8c191bdef89403968ed04de99eec>
and the other nodes don't do anything either. Right now we selected the cilium network plugin but we ran into the same issue using calico. We also don't see any obvious error messages anywhere that would point us into the right direction. Any ideas or pointer what might be causing this? Thanks
a

agreeable-oil-87482

05/08/2023, 5:40 PM
Have all the pods spun up on the node? Any restarting or in a crash loop?
b

bumpy-receptionist-35510

05/08/2023, 6:02 PM
Doesn't look like it:
# /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pods -A
NAMESPACE             NAME                                                        READY   STATUS      RESTARTS     AGE
cattle-fleet-system   fleet-agent-74784f4cf6-27cfs                                1/1     Running     0            8h
cattle-system         cattle-cluster-agent-6bf7d49647-gv4f7                       1/1     Running     0            8h
cattle-system         rancher-webhook-74f44ffdb9-drmvg                            1/1     Running     0            8h
cattle-system         system-upgrade-controller-64f5b6857-hstq7                   1/1     Running     0            8h
kube-system           cilium-lv5fc                                                1/1     Running     0            9h
kube-system           cilium-operator-588b7fbb5f-fjjmv                            1/1     Running     0            9h
kube-system           cilium-operator-588b7fbb5f-p7csw                            0/1     Pending     0            9h
kube-system           etcd-rancher-test-pool1-17288e44-g2r9w                      1/1     Running     0            9h
kube-system           helm-install-rancher-vsphere-cpi-qcjlc                      0/1     Completed   0            9h
kube-system           helm-install-rancher-vsphere-csi-sfhqq                      0/1     Completed   0            9h
kube-system           helm-install-rke2-cilium-rlwjv                              0/1     Completed   0            9h
kube-system           helm-install-rke2-coredns-jhnnd                             0/1     Completed   0            9h
kube-system           helm-install-rke2-ingress-nginx-wwzz4                       0/1     Completed   0            9h
kube-system           helm-install-rke2-metrics-server-tv7nt                      0/1     Completed   0            9h
kube-system           helm-install-rke2-snapshot-controller-crd-67hxt             0/1     Completed   0            9h
kube-system           helm-install-rke2-snapshot-controller-xznk6                 0/1     Completed   1            9h
kube-system           helm-install-rke2-snapshot-validation-webhook-tvb49         0/1     Completed   0            9h
kube-system           kube-apiserver-rancher-test-pool1-17288e44-g2r9w            1/1     Running     0            9h
kube-system           kube-controller-manager-rancher-test-pool1-17288e44-g2r9w   1/1     Running     0            9h
kube-system           kube-proxy-rancher-test-pool1-17288e44-g2r9w                1/1     Running     0            9h
kube-system           kube-scheduler-rancher-test-pool1-17288e44-g2r9w            1/1     Running     0            9h
kube-system           rancher-vsphere-cpi-cloud-controller-manager-g7l9q          1/1     Running     0            9h
kube-system           rke2-coredns-rke2-coredns-6b9548f79f-pcntr                  1/1     Running     0            9h
kube-system           rke2-coredns-rke2-coredns-autoscaler-57647bc7cf-92hq2       1/1     Running     0            9h
kube-system           rke2-ingress-nginx-controller-ttfpx                         1/1     Running     0            8h
kube-system           rke2-metrics-server-7d58bbc9c6-cpvcn                        1/1     Running     0            8h
kube-system           rke2-snapshot-controller-7b5b4f946c-fm5rg                   1/1     Running     0            8h
kube-system           rke2-snapshot-validation-webhook-7748dbf6ff-7lbjz           1/1     Running     0            8h
kube-system           vsphere-csi-controller-674d956d9c-54w75                     6/6     Running     0            9h
kube-system           vsphere-csi-controller-674d956d9c-brg2w                     6/6     Running     0            9h
kube-system           vsphere-csi-controller-674d956d9c-m88qg                     6/6     Running     0            9h
kube-system           vsphere-csi-node-xfr6l                                      3/3     Running     2 (8h ago)   9h
root@rancher-test-pool1-17288e44-g2r9w:/var/lib/rancher/rke2/agent#
I thought that maybe there was a connection or websocket issue maybe between the node and rancher but I followed this kb article and it worked just fine: https://www.suse.com/support/kb/doc/?id=000020189
I am getting
{"name":"ping","data":{}}
as well as other event structures back
btw: using ubuntu cloudimage 22.04 for node creation. Cloud-init is only adding a user with ssh-key. And that works ok
a

agreeable-oil-87482

05/09/2023, 8:33 AM
Curious, can you describe this pod to see why it's pending?
kube-system           cilium-operator-588b7fbb5f-p7csw                            0/1     Pending     0            9h
b

bumpy-receptionist-35510

05/09/2023, 8:35 AM
Maybe because affinity rules prevented it from deploying two instances to the same node? I just scaled down the instance from 2 to 1 yesterday and everything seemed fine after that.
a

agreeable-oil-87482

05/09/2023, 8:35 AM
Is this a single node, all roles cluster?
b

bumpy-receptionist-35510

05/09/2023, 8:35 AM
Yep
Was for testing. We had 3 pools with 1 node in each with all roles
Rancher did create all 3 nodes in vsphere but it seems only the one that comes back up first is the one that is supposed to bootstrap the cluster.
a

agreeable-oil-87482

05/09/2023, 8:37 AM
Yeah it'll fire up the first control plane node before adding the workers
b

bumpy-receptionist-35510

05/09/2023, 8:38 AM
Can you maybe tell me what exactly it is waiting for then this message shows up on the Rancher UI
non-ready bootstrap machine(s) rancher-test-pool1-5fd745d759-p6wvj and join url to be available on bootstrap node
maybe I can then look at why it may not happen
a

agreeable-oil-87482

05/09/2023, 8:38 AM
Can you grab the logs from the cluster agent please?
b

bumpy-receptionist-35510

05/09/2023, 8:39 AM
Sure, let me grab the logs for you
Unbenannt.txt
a

agreeable-oil-87482

05/09/2023, 9:00 AM
Line 165 suggests a websocket issue. As a test, could you fire up a rancher instance inside your vsphere environment (standalone docker will do) and try and create a cluster from there?
b

bumpy-receptionist-35510

05/09/2023, 9:01 AM
Ok, can do. Not sure if I will be able to do so today but I'll post results into this thread once I am done.
@agreeable-oil-87482 so when I create a rancher docker instance on the same vsphere server and in the same subnet work works fine and creates a vSphere cluster without issues πŸ˜•
So does that mean there is an issue on the rancher cluster installation, maybe nginx-ingress?
a

agreeable-oil-87482

05/09/2023, 1:03 PM
I would say at this stage to assume it's a connectivity problem between your on prem and aws environment. Something blocking websocket connections perhaps?
b

bumpy-receptionist-35510

05/09/2023, 1:04 PM
The odd thing is I already went through this procedure to make sure it is NOT websockets: https://www.suse.com/support/kb/doc/?id=000020189
And I was able to get the output mentioned in the kb article from the vsphere node
a

agreeable-oil-87482

05/09/2023, 1:30 PM
Try running that from within a K8s Pod inside your on prem cluster.
b

bumpy-receptionist-35510

05/09/2023, 1:30 PM
ok
I works fine 😞 I created a netshot pod like this
root@rancher-test-pool1-17288e44-g2r9w:~# /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml run tmp-shell --rm -i --tty --image nicolaka/netshoot
on the cluster node that is stuck in provisioning. Then performing the kb article procedure from within the pod:
curl -s -i -N \
  --http1.1 \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Host: $FQDN" \
  -k https://$FQDN/v3/subscribe
HTTP/1.1 101 Switching Protocols
Date: Tue, 09 May 2023 13:34:22 GMT
Connection: upgrade
Upgrade: websocket
Sec-WebSocket-Accept: qGEgH3En71di5rrssAZTmtRTyFk=
Strict-Transport-Security: max-age=15724800; includeSubDomains

β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}
So both from the node networks as well as from the pod network websocket does seem to work. Maybe it isn't that? If you look into the logfile from the cluster-agent pod the "Error during subscribe websocket: close sent" message comes ~20 minutes after the last "Watching metadata for ...." message. So maybe its not related after all?
a

agreeable-oil-87482

05/09/2023, 6:44 PM
Yeah, could not be websocket related. Any proxy/firewalls between your on prem and AWS environments? How is connectivity established between them? ie VPC peering?
b

bumpy-receptionist-35510

05/10/2023, 7:36 AM
Oh boy, it was actually my fault. When I provisioned the Rancher instance using the helm chart and a values file I accidentally used
tls.source
instead of
ingress.tls.source
πŸ€¦β€β™‚οΈ Apologies for wasting your time with this. But thank you so much for trying to help. It was so wired that the UI and everything worked just fine so I didn't expect something to be wrong with the Rancher install itself.
a

agreeable-oil-87482

05/10/2023, 7:38 AM
Thanks for the update. Interesting find. Would have expected the cluster agent to complain about that but evidently not. Glad you got it sorted
b

bumpy-receptionist-35510

05/10/2023, 7:39 AM
Yeah, when everything seemed to be fine with the cluster creation I revisited every config line and noticed that the tls element was two spaces too far to the left and thus wasn't a child of the ingress element... 😞
But I do have an actual question about the vSphere provisioning: how can I set the HW compatibility version of the created nodes? I tried to set config parameter "virtualHW.version" with value "19" but the nodes are all using HW version 10.
a

agreeable-oil-87482

05/10/2023, 7:42 AM
It's inherited from the version set by the referenced template
b

bumpy-receptionist-35510

05/10/2023, 7:43 AM
So that info would be part of the Ubuntu Cloudimage OVA file?
a

agreeable-oil-87482

05/10/2023, 8:03 AM
Yeah the ova has the hw version set
Currently, we just clone and add nics to nodes created from templates
b

bumpy-receptionist-35510

05/10/2023, 8:05 AM
Ok, I'll see if I can figure out how to modify that then. Thanks