This message was deleted Rancher Users #vsphere

Join Slack

This message was deleted.

# vsphere

adamant-kite-43734

05/08/2023, 2:56 PM

This message was deleted.

agreeable-oil-87482

05/08/2023, 5:40 PM

Have all the pods spun up on the node? Any restarting or in a crash loop?

bumpy-receptionist-35510

05/08/2023, 6:02 PM

Doesn't look like it:

Copy code

# /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pods -A
NAMESPACE             NAME                                                        READY   STATUS      RESTARTS     AGE
cattle-fleet-system   fleet-agent-74784f4cf6-27cfs                                1/1     Running     0            8h
cattle-system         cattle-cluster-agent-6bf7d49647-gv4f7                       1/1     Running     0            8h
cattle-system         rancher-webhook-74f44ffdb9-drmvg                            1/1     Running     0            8h
cattle-system         system-upgrade-controller-64f5b6857-hstq7                   1/1     Running     0            8h
kube-system           cilium-lv5fc                                                1/1     Running     0            9h
kube-system           cilium-operator-588b7fbb5f-fjjmv                            1/1     Running     0            9h
kube-system           cilium-operator-588b7fbb5f-p7csw                            0/1     Pending     0            9h
kube-system           etcd-rancher-test-pool1-17288e44-g2r9w                      1/1     Running     0            9h
kube-system           helm-install-rancher-vsphere-cpi-qcjlc                      0/1     Completed   0            9h
kube-system           helm-install-rancher-vsphere-csi-sfhqq                      0/1     Completed   0            9h
kube-system           helm-install-rke2-cilium-rlwjv                              0/1     Completed   0            9h
kube-system           helm-install-rke2-coredns-jhnnd                             0/1     Completed   0            9h
kube-system           helm-install-rke2-ingress-nginx-wwzz4                       0/1     Completed   0            9h
kube-system           helm-install-rke2-metrics-server-tv7nt                      0/1     Completed   0            9h
kube-system           helm-install-rke2-snapshot-controller-crd-67hxt             0/1     Completed   0            9h
kube-system           helm-install-rke2-snapshot-controller-xznk6                 0/1     Completed   1            9h
kube-system           helm-install-rke2-snapshot-validation-webhook-tvb49         0/1     Completed   0            9h
kube-system           kube-apiserver-rancher-test-pool1-17288e44-g2r9w            1/1     Running     0            9h
kube-system           kube-controller-manager-rancher-test-pool1-17288e44-g2r9w   1/1     Running     0            9h
kube-system           kube-proxy-rancher-test-pool1-17288e44-g2r9w                1/1     Running     0            9h
kube-system           kube-scheduler-rancher-test-pool1-17288e44-g2r9w            1/1     Running     0            9h
kube-system           rancher-vsphere-cpi-cloud-controller-manager-g7l9q          1/1     Running     0            9h
kube-system           rke2-coredns-rke2-coredns-6b9548f79f-pcntr                  1/1     Running     0            9h
kube-system           rke2-coredns-rke2-coredns-autoscaler-57647bc7cf-92hq2       1/1     Running     0            9h
kube-system           rke2-ingress-nginx-controller-ttfpx                         1/1     Running     0            8h
kube-system           rke2-metrics-server-7d58bbc9c6-cpvcn                        1/1     Running     0            8h
kube-system           rke2-snapshot-controller-7b5b4f946c-fm5rg                   1/1     Running     0            8h
kube-system           rke2-snapshot-validation-webhook-7748dbf6ff-7lbjz           1/1     Running     0            8h
kube-system           vsphere-csi-controller-674d956d9c-54w75                     6/6     Running     0            9h
kube-system           vsphere-csi-controller-674d956d9c-brg2w                     6/6     Running     0            9h
kube-system           vsphere-csi-controller-674d956d9c-m88qg                     6/6     Running     0            9h
kube-system           vsphere-csi-node-xfr6l                                      3/3     Running     2 (8h ago)   9h
root@rancher-test-pool1-17288e44-g2r9w:/var/lib/rancher/rke2/agent#

bumpy-receptionist-35510

05/08/2023, 6:28 PM

I thought that maybe there was a connection or websocket issue maybe between the node and rancher but I followed this kb article and it worked just fine: https://www.suse.com/support/kb/doc/?id=000020189

bumpy-receptionist-35510

05/08/2023, 6:30 PM

I am getting

{"name":"ping","data":{}}

as well as other event structures back

bumpy-receptionist-35510

05/08/2023, 6:31 PM

btw: using ubuntu cloudimage 22.04 for node creation. Cloud-init is only adding a user with ssh-key. And that works ok

agreeable-oil-87482

05/09/2023, 8:33 AM

Curious, can you describe this pod to see why it's pending?

Copy code

kube-system           cilium-operator-588b7fbb5f-p7csw                            0/1     Pending     0            9h

bumpy-receptionist-35510

05/09/2023, 8:35 AM

Maybe because affinity rules prevented it from deploying two instances to the same node? I just scaled down the instance from 2 to 1 yesterday and everything seemed fine after that.

agreeable-oil-87482

05/09/2023, 8:35 AM

Is this a single node, all roles cluster?

bumpy-receptionist-35510

05/09/2023, 8:35 AM

Yep

bumpy-receptionist-35510

05/09/2023, 8:36 AM

Was for testing. We had 3 pools with 1 node in each with all roles

bumpy-receptionist-35510

05/09/2023, 8:37 AM

Rancher did create all 3 nodes in vsphere but it seems only the one that comes back up first is the one that is supposed to bootstrap the cluster.

agreeable-oil-87482

05/09/2023, 8:37 AM

Yeah it'll fire up the first control plane node before adding the workers

bumpy-receptionist-35510

05/09/2023, 8:38 AM

Can you maybe tell me what exactly it is waiting for then this message shows up on the Rancher UI

non-ready bootstrap machine(s) rancher-test-pool1-5fd745d759-p6wvj and join url to be available on bootstrap node

maybe I can then look at why it may not happen

agreeable-oil-87482

05/09/2023, 8:38 AM

Can you grab the logs from the cluster agent please?

bumpy-receptionist-35510

05/09/2023, 8:39 AM

Sure, let me grab the logs for you

agreeable-oil-87482

05/09/2023, 9:00 AM

Line 165 suggests a websocket issue. As a test, could you fire up a rancher instance inside your vsphere environment (standalone docker will do) and try and create a cluster from there?

bumpy-receptionist-35510

05/09/2023, 9:01 AM

Ok, can do. Not sure if I will be able to do so today but I'll post results into this thread once I am done.

bumpy-receptionist-35510

05/09/2023, 12:55 PM

@agreeable-oil-87482 so when I create a rancher docker instance on the same vsphere server and in the same subnet work works fine and creates a vSphere cluster without issues 😕

bumpy-receptionist-35510

05/09/2023, 12:58 PM

So does that mean there is an issue on the rancher cluster installation, maybe nginx-ingress?

agreeable-oil-87482

05/09/2023, 1:03 PM

I would say at this stage to assume it's a connectivity problem between your on prem and aws environment. Something blocking websocket connections perhaps?

bumpy-receptionist-35510

05/09/2023, 1:04 PM

The odd thing is I already went through this procedure to make sure it is NOT websockets: https://www.suse.com/support/kb/doc/?id=000020189

bumpy-receptionist-35510

05/09/2023, 1:05 PM

And I was able to get the output mentioned in the kb article from the vsphere node

agreeable-oil-87482

05/09/2023, 1:30 PM

Try running that from within a K8s Pod inside your on prem cluster.

bumpy-receptionist-35510

05/09/2023, 1:30 PM

bumpy-receptionist-35510

05/09/2023, 1:40 PM

I works fine 😞 I created a netshot pod like this

root@rancher-test-pool1-17288e44-g2r9w:~# /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml run tmp-shell --rm -i --tty --image nicolaka/netshoot

on the cluster node that is stuck in provisioning. Then performing the kb article procedure from within the pod:

Copy code

curl -s -i -N \
  --http1.1 \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Host: $FQDN" \
  -k https://$FQDN/v3/subscribe
HTTP/1.1 101 Switching Protocols
Date: Tue, 09 May 2023 13:34:22 GMT
Connection: upgrade
Upgrade: websocket
Sec-WebSocket-Accept: qGEgH3En71di5rrssAZTmtRTyFk=
Strict-Transport-Security: max-age=15724800; includeSubDomains

▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}▒{"name":"ping","data":{}}

bumpy-receptionist-35510

05/09/2023, 1:43 PM

So both from the node networks as well as from the pod network websocket does seem to work. Maybe it isn't that? If you look into the logfile from the cluster-agent pod the "Error during subscribe websocket: close sent" message comes ~20 minutes after the last "Watching metadata for ...." message. So maybe its not related after all?

agreeable-oil-87482

05/09/2023, 6:44 PM

Yeah, could not be websocket related. Any proxy/firewalls between your on prem and AWS environments? How is connectivity established between them? ie VPC peering?

bumpy-receptionist-35510

05/10/2023, 7:36 AM

Oh boy, it was actually my fault. When I provisioned the Rancher instance using the helm chart and a values file I accidentally used

tls.source

instead of

ingress.tls.source

🤦‍♂️ Apologies for wasting your time with this. But thank you so much for trying to help. It was so wired that the UI and everything worked just fine so I didn't expect something to be wrong with the Rancher install itself.

agreeable-oil-87482

05/10/2023, 7:38 AM

Thanks for the update. Interesting find. Would have expected the cluster agent to complain about that but evidently not. Glad you got it sorted

bumpy-receptionist-35510

05/10/2023, 7:39 AM

Yeah, when everything seemed to be fine with the cluster creation I revisited every config line and noticed that the tls element was two spaces too far to the left and thus wasn't a child of the ingress element... 😞

bumpy-receptionist-35510

05/10/2023, 7:41 AM

But I do have an actual question about the vSphere provisioning: how can I set the HW compatibility version of the created nodes? I tried to set config parameter "virtualHW.version" with value "19" but the nodes are all using HW version 10.

agreeable-oil-87482

05/10/2023, 7:42 AM

It's inherited from the version set by the referenced template

bumpy-receptionist-35510

05/10/2023, 7:43 AM

So that info would be part of the Ubuntu Cloudimage OVA file?

agreeable-oil-87482

05/10/2023, 8:03 AM

Yeah the ova has the hw version set

agreeable-oil-87482

05/10/2023, 8:03 AM

Currently, we just clone and add nics to nodes created from templates

bumpy-receptionist-35510

05/10/2023, 8:05 AM

Ok, I'll see if I can figure out how to modify that then. Thanks

152 Views

Open in Slack

Previous Next