This message was deleted.
# vsphere
a
This message was deleted.
h
We ended up deploying all vSphere CSI and CPI manually direct from vSphere after heaps of problems with the rancher helm charts... With this method we have zero issues and it is easily repeatable across all our cluster deployments.
a
Thanks @hallowed-breakfast-56871 - I have been beating this issue for the past days and I build another cluster to try out the vsphere manual install - I had issues with the node selectors on that install with the generic vsphere install - did you hit any issues with that installing on a cluster you build from Rancher by any chance?
h
To cut another long story short, we also do not use Rancher to provision clusters on vSphere anymore, basically due to bugs, crashes, and issues with connecting with our private cloud. We instead terraform our RKE2 clusters, then enroll them to Rancher while bootstrapping. I've done this manual CSI and CPI method above on 6 clusters now, all work fine and actually granted me more granular control over my storage options with vSphere. Please also note that we are not using vSAN for storage.
Also note that you must setup your clusters without a cloud provider.
a
You have great timing. I just built a new cluster and was trying to install with the helm chart and it was not having it. I will tear down and build it without a cloud provider. For clarity sake are you doing these steps on your end: 1. installing the helm chart for the CPI (link #1 above) 2. Installing the CSI per link #2 above The reason I ask about the order is because now that I am doing this manually in testing, I came across this https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.0/vmware-vsphe[…]etting-started/GUID-0AB6E692-AA47-4B6A-8CEA-38B754E16567.html That link shows how to install the CPI yet a THIRD way so I think I will try the order above in a new cluster without a cloud provider but wanted to verify with you. Thanks Josh!
h
So just checking my notes / wiki.... I do the CPI first, which should be a helm chart which is now looked after by Kubernetes
<https://kubernetes.github.io/cloud-provider-vsphere>
Then I do the CSI which should be a a deployment pulled from the kubernetes git repo -
<https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v2.6.0/manifests/vanilla/vsphere-csi-driver.yaml>
a
thanks...I have the new cluster build running now without any cloud provider so I will do that first part via the helm link you provided above and let you know how I make out. Appreciate the notes on this - they have been helpful!
h
No worries. I would normally give our my full wiki page, but I'd need to redact a whole bunch. I still can if you still get stuck.
a
np....i think you have set me on the right path. I'll hit you up if I get stuck - thank you very much for all the help and insight!
🙌 1
well I thought I would have more luck, but after creating a new Rancher cluster (ubuntu 20, 1 etcd, 1cp, 3 workers) with no external cloud provider, deploying the helm chart or doing it manually seems to create the daemonset in kube-system, but its empty and no pods get created - is that expected from what you have tested? I had assumed to see controller pods running and for the life of me, I am not able to diagnose why this is the outcome: I feel I am missing something as I know the Rancher CPI would instantiate some controller pods so was curious if this rang any bells to you from your testing? thx!
h
Is the
vsphere-cloud-controller-manager
running?
a
Nope...nothing like that is running - which is what I was expecting to be running
its like the helm chart deployed OK, the DS was created empty, and it thinks things are OK...haha. But clearly there is no controller pod running which is odd
h
Perhaps try remove and re-install. I've personally not seen that before. The CPI is the easier part
a
yeah - i have found it through rancher to be the quickest and easiest part. I've tried a few times to delete/uninstall through helm and redo but it ends up in the same state each time - its really strange and of course nothing shows an actual error
h
Do you have a tool like OpenLens which you can dig around with?
a
unfortunately, I don't have that
h
Ha, that's a same, its very good and showing errors. Can you describe the DaemonSet? or it is straight up not being made by helm?
Ah. I think I see. Your node labels dont match a normal RKE2 install.
This is from one of my working clusters. I thin kthe issue you have is regarding your
controlplane
taint. It should be
control-plane
to match best practice, and what this chart requires.
a
Copy code
[root@vsphere-cpi]# helm upgrade --install vsphere-cpi vsphere-cpi/vsphere-cpi --namespace kube-system --set config.enabled=true --set config.vcenter=hci-vcenter.domain.local --set config.username=administrator@vsphere.local --set config.password=XXX --set config.datacenter=DataCenter
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/klaughman/rubrikk8s-vspherecsi.cfg
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/klaughman/rubrikk8s-vspherecsi.cfg
Release "vsphere-cpi" does not exist. Installing it now.
NAME: vsphere-cpi
LAST DEPLOYED: Thu Dec 22 15:12:01 2022
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing vsphere-cpi.

Your release is named vsphere-cpi.

[root@vsphere-cpi]# helm list -n kube-system
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/klaughman/rubrikk8s-vspherecsi.cfg
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/klaughman/rubrikk8s-vspherecsi.cfg
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
vsphere-cpi     kube-system     1               2022-12-22 15:12:01.300807621 -0500 EST deployed        vsphere-cpi-1.25.0      1.25.0




[root@vsphere-cpi]# kubectl get ds -n kube-system
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
canal         5         5         5       5            5           <http://kubernetes.io/os=linux|kubernetes.io/os=linux>   59m
vsphere-cpi   0         0         0       0            0           <none>                   48s
[root@cetech-lnx01 vsphere-cpi]# kubectl describe ds vsphere-cpi -n kube-system
Name:           vsphere-cpi
Selector:       app=vsphere-cpi
Node-Selector:  <none>
Labels:         app=vsphere-cpi
                <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                component=cloud-controller-manager
                tier=control-plane
                vsphere-cpi-infra=daemonset
Annotations:    deprecated.daemonset.template.generation: 1
                <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: vsphere-cpi
                <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: kube-system
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=vsphere-cpi
                    component=cloud-controller-manager
                    release=vsphere-cpi
                    tier=control-plane
                    vsphere-cpi-infra=daemonset
  Service Account:  cloud-controller-manager
  Containers:
   vsphere-cpi:
    Image:      <http://gcr.io/cloud-provider-vsphere/cpi/release/manager:v1.25.0|gcr.io/cloud-provider-vsphere/cpi/release/manager:v1.25.0>
    Port:       <none>
    Host Port:  <none>
    Args:
      --cloud-provider=vsphere
      --v=2
      --cloud-config=/etc/cloud/vsphere.conf
    Environment:  <none>
    Mounts:
      /etc/cloud from vsphere-config-volume (ro)
  Volumes:
   vsphere-config-volume:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               vsphere-cloud-config
    Optional:           false
  Priority Class Name:  system-node-critical
Events:                 <none>
h
Yeah, try adding the correct taint for your control plane and see what that does.
Copy code
nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
            operator: Exists
      - matchExpressions:
          - key: <http://node-role.kubernetes.io/master|node-role.kubernetes.io/master>
            operator: Exists
a
OK...that could be, I believe this image is using an RKE1 setup
h
Either that or add master.
Yeah RKE1 still using old taint for controlplane. I think the world has since moved on to
control-plane
a
ok....that is good to know. Admittedly, I have a good amount to still learn with Rancher so this is helpful on that front. So in your snippet above, are you saying to manually add the taint to the control-plane node like this:
Copy code
[root@cetech-lnx01 vsphere-cpi]# kubectl describe nodes | egrep "Taints:|Name:"
Name:               rubrikk8s-csi-cp1
Taints:             <http://node-role.kubernetes.io/controlplane=true:NoSchedule|node-role.kubernetes.io/controlplane=true:NoSchedule>
Name:               rubrikk8s-csi-etcd1
Taints:             <http://node-role.kubernetes.io/etcd=true:NoExecute|node-role.kubernetes.io/etcd=true:NoExecute>
Name:               rubrikk8s-csi-wkr1
Taints:             <http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>
Name:               rubrikk8s-csi-wkr2
Taints:             <http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>
Name:               rubrikk8s-csi-wkr3
Taints:             <http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>
[root@cetech-lnx01 vsphere-cpi]#
[root@cetech-lnx01 vsphere-cpi]# kubectl taint node rubrikk8s-csi-cp1 <http://node-role.kubernetes.io/control-plane=true:NoSchedule|node-role.kubernetes.io/control-plane=true:NoSchedule>
h
Yup, that should do it.
a
So i just got a new cluster to finish and I changed the taint on the contolplane to control-plane and the CPI installed through helm - THANK YOU!
now on the CSI driver install, i am hitting an issue I saw before and this feels like a silly question from me but the CSI pods have this as a selector in their pods definition. I have this cluster with 1 CP, 1ETCD, and 3 workers:
Copy code
QoS Class:                   BestEffort
Node-Selectors:              <http://node-role.kubernetes.io/control-plane=|node-role.kubernetes.io/control-plane=>
Tolerations:                 <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule> op=Exists
                             <http://node-role.kubernetes.io/master:NoSchedule|node-role.kubernetes.io/master:NoSchedule> op=Exists
                             <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
However, when applying the CSI driver manifest, the vsphere-csi-controller PODs never seem to schedule:
Copy code
vents:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  105s  default-scheduler  0/5 nodes are available: 1 node(s) had taint {<http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>: true}, that the pod didn't tolerate, 1 node(s) had taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  15s   default-scheduler  0/5 nodes are available: 1 node(s) had taint {<http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>: true}, that the pod didn't tolerate, 1 node(s) had taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.
What's throwing me off is the node selector is just blank: node-role.kubernetes.io/control-plane= I must be missing something with my taints, but I can't seem to toggle anything to match and its stumping me if you had any thoughts on that one 🙂
h
Yeah, I've seen this too
One sec, I have it in my notes...
Copy code
Next we need to pull down the latest deployment from the Kubernetes GitHub. We do this as we need to remove the node toleration, and edit the replica count if we are deploying to a single node. 
`wget <https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v2.6.0/manifests/vanilla/vsphere-csi-driver.yaml> `

Edit any requirements (e.g replica's or removing node toleration) in the yaml.
You might want to edit the lines like below, to read true. Then remove all tolerations after.
yaml nodeSelector: node-role.kubernetes.io/control-plane: ""```
So in that yaml you pulled down, check for tolerations, and remove them if required.
a
ahh....ok...this is helpful.....I was running this one right from the raw URL as in the docs, but let me pull this down and hack through it. I will let you know how I make out either tonight or tomorrow. Can't thank you enough for all your advice and help here Josh. thank you!
the empty "=" was killing me. I read it infers true and was like WTF should I set these too...hhaa
h
No worries. Happy to help!
122 Views