This message was deleted.
# vsphere
a
This message was deleted.
a
Have all the pods spun up on the node? Any restarting or in a crash loop?
b
Doesn't look like it:
Copy code
# /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pods -A
NAMESPACE             NAME                                                        READY   STATUS      RESTARTS     AGE
cattle-fleet-system   fleet-agent-74784f4cf6-27cfs                                1/1     Running     0            8h
cattle-system         cattle-cluster-agent-6bf7d49647-gv4f7                       1/1     Running     0            8h
cattle-system         rancher-webhook-74f44ffdb9-drmvg                            1/1     Running     0            8h
cattle-system         system-upgrade-controller-64f5b6857-hstq7                   1/1     Running     0            8h
kube-system           cilium-lv5fc                                                1/1     Running     0            9h
kube-system           cilium-operator-588b7fbb5f-fjjmv                            1/1     Running     0            9h
kube-system           cilium-operator-588b7fbb5f-p7csw                            0/1     Pending     0            9h
kube-system           etcd-rancher-test-pool1-17288e44-g2r9w                      1/1     Running     0            9h
kube-system           helm-install-rancher-vsphere-cpi-qcjlc                      0/1     Completed   0            9h
kube-system           helm-install-rancher-vsphere-csi-sfhqq                      0/1     Completed   0            9h
kube-system           helm-install-rke2-cilium-rlwjv                              0/1     Completed   0            9h
kube-system           helm-install-rke2-coredns-jhnnd                             0/1     Completed   0            9h
kube-system           helm-install-rke2-ingress-nginx-wwzz4                       0/1     Completed   0            9h
kube-system           helm-install-rke2-metrics-server-tv7nt                      0/1     Completed   0            9h
kube-system           helm-install-rke2-snapshot-controller-crd-67hxt             0/1     Completed   0            9h
kube-system           helm-install-rke2-snapshot-controller-xznk6                 0/1     Completed   1            9h
kube-system           helm-install-rke2-snapshot-validation-webhook-tvb49         0/1     Completed   0            9h
kube-system           kube-apiserver-rancher-test-pool1-17288e44-g2r9w            1/1     Running     0            9h
kube-system           kube-controller-manager-rancher-test-pool1-17288e44-g2r9w   1/1     Running     0            9h
kube-system           kube-proxy-rancher-test-pool1-17288e44-g2r9w                1/1     Running     0            9h
kube-system           kube-scheduler-rancher-test-pool1-17288e44-g2r9w            1/1     Running     0            9h
kube-system           rancher-vsphere-cpi-cloud-controller-manager-g7l9q          1/1     Running     0            9h
kube-system           rke2-coredns-rke2-coredns-6b9548f79f-pcntr                  1/1     Running     0            9h
kube-system           rke2-coredns-rke2-coredns-autoscaler-57647bc7cf-92hq2       1/1     Running     0            9h
kube-system           rke2-ingress-nginx-controller-ttfpx                         1/1     Running     0            8h
kube-system           rke2-metrics-server-7d58bbc9c6-cpvcn                        1/1     Running     0            8h
kube-system           rke2-snapshot-controller-7b5b4f946c-fm5rg                   1/1     Running     0            8h
kube-system           rke2-snapshot-validation-webhook-7748dbf6ff-7lbjz           1/1     Running     0            8h
kube-system           vsphere-csi-controller-674d956d9c-54w75                     6/6     Running     0            9h
kube-system           vsphere-csi-controller-674d956d9c-brg2w                     6/6     Running     0            9h
kube-system           vsphere-csi-controller-674d956d9c-m88qg                     6/6     Running     0            9h
kube-system           vsphere-csi-node-xfr6l                                      3/3     Running     2 (8h ago)   9h
root@rancher-test-pool1-17288e44-g2r9w:/var/lib/rancher/rke2/agent#
I thought that maybe there was a connection or websocket issue maybe between the node and rancher but I followed this kb article and it worked just fine: https://www.suse.com/support/kb/doc/?id=000020189
I am getting
{"name":"ping","data":{}}
as well as other event structures back
btw: using ubuntu cloudimage 22.04 for node creation. Cloud-init is only adding a user with ssh-key. And that works ok
a
Curious, can you describe this pod to see why it's pending?
Copy code
kube-system           cilium-operator-588b7fbb5f-p7csw                            0/1     Pending     0            9h
b
Maybe because affinity rules prevented it from deploying two instances to the same node? I just scaled down the instance from 2 to 1 yesterday and everything seemed fine after that.
a
Is this a single node, all roles cluster?
b
Yep
Was for testing. We had 3 pools with 1 node in each with all roles
Rancher did create all 3 nodes in vsphere but it seems only the one that comes back up first is the one that is supposed to bootstrap the cluster.
a
Yeah it'll fire up the first control plane node before adding the workers
b
Can you maybe tell me what exactly it is waiting for then this message shows up on the Rancher UI
non-ready bootstrap machine(s) rancher-test-pool1-5fd745d759-p6wvj and join url to be available on bootstrap node
maybe I can then look at why it may not happen
a
Can you grab the logs from the cluster agent please?
b
Sure, let me grab the logs for you
a
Line 165 suggests a websocket issue. As a test, could you fire up a rancher instance inside your vsphere environment (standalone docker will do) and try and create a cluster from there?
b
Ok, can do. Not sure if I will be able to do so today but I'll post results into this thread once I am done.
@agreeable-oil-87482 so when I create a rancher docker instance on the same vsphere server and in the same subnet work works fine and creates a vSphere cluster without issues πŸ˜•
So does that mean there is an issue on the rancher cluster installation, maybe nginx-ingress?
a
I would say at this stage to assume it's a connectivity problem between your on prem and aws environment. Something blocking websocket connections perhaps?
b
The odd thing is I already went through this procedure to make sure it is NOT websockets: https://www.suse.com/support/kb/doc/?id=000020189
And I was able to get the output mentioned in the kb article from the vsphere node
a
Try running that from within a K8s Pod inside your on prem cluster.
b
ok
I works fine 😞 I created a netshot pod like this
root@rancher-test-pool1-17288e44-g2r9w:~# /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml run tmp-shell --rm -i --tty --image nicolaka/netshoot
on the cluster node that is stuck in provisioning. Then performing the kb article procedure from within the pod:
Copy code
curl -s -i -N \
  --http1.1 \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Host: $FQDN" \
  -k https://$FQDN/v3/subscribe
HTTP/1.1 101 Switching Protocols
Date: Tue, 09 May 2023 13:34:22 GMT
Connection: upgrade
Upgrade: websocket
Sec-WebSocket-Accept: qGEgH3En71di5rrssAZTmtRTyFk=
Strict-Transport-Security: max-age=15724800; includeSubDomains

β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}β–’{"name":"ping","data":{}}
So both from the node networks as well as from the pod network websocket does seem to work. Maybe it isn't that? If you look into the logfile from the cluster-agent pod the "Error during subscribe websocket: close sent" message comes ~20 minutes after the last "Watching metadata for ...." message. So maybe its not related after all?
a
Yeah, could not be websocket related. Any proxy/firewalls between your on prem and AWS environments? How is connectivity established between them? ie VPC peering?
b
Oh boy, it was actually my fault. When I provisioned the Rancher instance using the helm chart and a values file I accidentally used
tls.source
instead of
ingress.tls.source
πŸ€¦β€β™‚οΈ Apologies for wasting your time with this. But thank you so much for trying to help. It was so wired that the UI and everything worked just fine so I didn't expect something to be wrong with the Rancher install itself.
a
Thanks for the update. Interesting find. Would have expected the cluster agent to complain about that but evidently not. Glad you got it sorted
b
Yeah, when everything seemed to be fine with the cluster creation I revisited every config line and noticed that the tls element was two spaces too far to the left and thus wasn't a child of the ingress element... 😞
But I do have an actual question about the vSphere provisioning: how can I set the HW compatibility version of the created nodes? I tried to set config parameter "virtualHW.version" with value "19" but the nodes are all using HW version 10.
a
It's inherited from the version set by the referenced template
b
So that info would be part of the Ubuntu Cloudimage OVA file?
a
Yeah the ova has the hw version set
Currently, we just clone and add nics to nodes created from templates
b
Ok, I'll see if I can figure out how to modify that then. Thanks
148 Views