https://rancher.com/ logo
Title
a

adventurous-address-26812

12/16/2022, 3:21 PM
Hello, I am having a bit of trouble getting the latest 2.5.1 version of the vsphere-csi driver installed through the Rancher Marketplace to support or provide support for snapshots and curious if anyone has gotten this to work or integrate with backup providers that are looking for this support? The CPI/CSI installs ok and I can create PVC's from the storageclass that is created, but it doesn't seem like snapshotting is supported with this install even though the VMware docs seem to indicate it is supported in 2.5.0 and above. Looking at this documentation: https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.0/vmware-vsphe[…]etting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html I'm trying to figure out if this should be configured as part of the chart install through the marketplace or if I need to look into performing all of these steps in the document manually? I haven't found much in the Rancher docs so I am throwing it out here in case anyone has configured this before? Any help is greatly appreciated. thanks!
h

hallowed-breakfast-56871

12/20/2022, 11:30 PM
We ended up deploying all vSphere CSI and CPI manually direct from vSphere after heaps of problems with the rancher helm charts... With this method we have zero issues and it is easily repeatable across all our cluster deployments.
a

adventurous-address-26812

12/22/2022, 5:43 PM
Thanks @hallowed-breakfast-56871 - I have been beating this issue for the past days and I build another cluster to try out the vsphere manual install - I had issues with the node selectors on that install with the generic vsphere install - did you hit any issues with that installing on a cluster you build from Rancher by any chance?
h

hallowed-breakfast-56871

12/22/2022, 6:47 PM
To cut another long story short, we also do not use Rancher to provision clusters on vSphere anymore, basically due to bugs, crashes, and issues with connecting with our private cloud. We instead terraform our RKE2 clusters, then enroll them to Rancher while bootstrapping. I've done this manual CSI and CPI method above on 6 clusters now, all work fine and actually granted me more granular control over my storage options with vSphere. Please also note that we are not using vSAN for storage.
Also note that you must setup your clusters without a cloud provider.
a

adventurous-address-26812

12/22/2022, 7:04 PM
You have great timing. I just built a new cluster and was trying to install with the helm chart and it was not having it. I will tear down and build it without a cloud provider. For clarity sake are you doing these steps on your end: 1. installing the helm chart for the CPI (link #1 above) 2. Installing the CSI per link #2 above The reason I ask about the order is because now that I am doing this manually in testing, I came across this https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.0/vmware-vsphe[…]etting-started/GUID-0AB6E692-AA47-4B6A-8CEA-38B754E16567.html That link shows how to install the CPI yet a THIRD way so I think I will try the order above in a new cluster without a cloud provider but wanted to verify with you. Thanks Josh!
h

hallowed-breakfast-56871

12/22/2022, 7:07 PM
So just checking my notes / wiki.... I do the CPI first, which should be a helm chart which is now looked after by Kubernetes
<https://kubernetes.github.io/cloud-provider-vsphere>
Then I do the CSI which should be a a deployment pulled from the kubernetes git repo -
<https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v2.6.0/manifests/vanilla/vsphere-csi-driver.yaml>
a

adventurous-address-26812

12/22/2022, 7:09 PM
thanks...I have the new cluster build running now without any cloud provider so I will do that first part via the helm link you provided above and let you know how I make out. Appreciate the notes on this - they have been helpful!
h

hallowed-breakfast-56871

12/22/2022, 7:09 PM
No worries. I would normally give our my full wiki page, but I'd need to redact a whole bunch. I still can if you still get stuck.
a

adventurous-address-26812

12/22/2022, 7:30 PM
np....i think you have set me on the right path. I'll hit you up if I get stuck - thank you very much for all the help and insight!
🙌 1
h

hallowed-breakfast-56871

12/22/2022, 8:04 PM
Is the
vsphere-cloud-controller-manager
running?
a

adventurous-address-26812

12/22/2022, 8:05 PM
Nope...nothing like that is running - which is what I was expecting to be running
its like the helm chart deployed OK, the DS was created empty, and it thinks things are OK...haha. But clearly there is no controller pod running which is odd
h

hallowed-breakfast-56871

12/22/2022, 8:06 PM
Perhaps try remove and re-install. I've personally not seen that before. The CPI is the easier part
a

adventurous-address-26812

12/22/2022, 8:07 PM
yeah - i have found it through rancher to be the quickest and easiest part. I've tried a few times to delete/uninstall through helm and redo but it ends up in the same state each time - its really strange and of course nothing shows an actual error
h

hallowed-breakfast-56871

12/22/2022, 8:08 PM
Do you have a tool like OpenLens which you can dig around with?
a

adventurous-address-26812

12/22/2022, 8:08 PM
unfortunately, I don't have that
h

hallowed-breakfast-56871

12/22/2022, 8:10 PM
Ha, that's a same, its very good and showing errors. Can you describe the DaemonSet? or it is straight up not being made by helm?
Ah. I think I see. Your node labels dont match a normal RKE2 install.
a

adventurous-address-26812

12/22/2022, 8:16 PM
[root@vsphere-cpi]# helm upgrade --install vsphere-cpi vsphere-cpi/vsphere-cpi --namespace kube-system --set config.enabled=true --set config.vcenter=hci-vcenter.domain.local --set config.username=administrator@vsphere.local --set config.password=XXX --set config.datacenter=DataCenter
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/klaughman/rubrikk8s-vspherecsi.cfg
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/klaughman/rubrikk8s-vspherecsi.cfg
Release "vsphere-cpi" does not exist. Installing it now.
NAME: vsphere-cpi
LAST DEPLOYED: Thu Dec 22 15:12:01 2022
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing vsphere-cpi.

Your release is named vsphere-cpi.

[root@vsphere-cpi]# helm list -n kube-system
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/klaughman/rubrikk8s-vspherecsi.cfg
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/klaughman/rubrikk8s-vspherecsi.cfg
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
vsphere-cpi     kube-system     1               2022-12-22 15:12:01.300807621 -0500 EST deployed        vsphere-cpi-1.25.0      1.25.0




[root@vsphere-cpi]# kubectl get ds -n kube-system
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
canal         5         5         5       5            5           <http://kubernetes.io/os=linux|kubernetes.io/os=linux>   59m
vsphere-cpi   0         0         0       0            0           <none>                   48s
[root@cetech-lnx01 vsphere-cpi]# kubectl describe ds vsphere-cpi -n kube-system
Name:           vsphere-cpi
Selector:       app=vsphere-cpi
Node-Selector:  <none>
Labels:         app=vsphere-cpi
                <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
                component=cloud-controller-manager
                tier=control-plane
                vsphere-cpi-infra=daemonset
Annotations:    deprecated.daemonset.template.generation: 1
                <http://meta.helm.sh/release-name|meta.helm.sh/release-name>: vsphere-cpi
                <http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: kube-system
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=vsphere-cpi
                    component=cloud-controller-manager
                    release=vsphere-cpi
                    tier=control-plane
                    vsphere-cpi-infra=daemonset
  Service Account:  cloud-controller-manager
  Containers:
   vsphere-cpi:
    Image:      <http://gcr.io/cloud-provider-vsphere/cpi/release/manager:v1.25.0|gcr.io/cloud-provider-vsphere/cpi/release/manager:v1.25.0>
    Port:       <none>
    Host Port:  <none>
    Args:
      --cloud-provider=vsphere
      --v=2
      --cloud-config=/etc/cloud/vsphere.conf
    Environment:  <none>
    Mounts:
      /etc/cloud from vsphere-config-volume (ro)
  Volumes:
   vsphere-config-volume:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               vsphere-cloud-config
    Optional:           false
  Priority Class Name:  system-node-critical
Events:                 <none>
h

hallowed-breakfast-56871

12/22/2022, 8:18 PM
Yeah, try adding the correct taint for your control plane and see what that does.
nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
      - matchExpressions:
          - key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>
            operator: Exists
      - matchExpressions:
          - key: <http://node-role.kubernetes.io/master|node-role.kubernetes.io/master>
            operator: Exists
a

adventurous-address-26812

12/22/2022, 8:18 PM
OK...that could be, I believe this image is using an RKE1 setup
h

hallowed-breakfast-56871

12/22/2022, 8:18 PM
Either that or add master.
Yeah RKE1 still using old taint for controlplane. I think the world has since moved on to
control-plane
a

adventurous-address-26812

12/22/2022, 8:21 PM
ok....that is good to know. Admittedly, I have a good amount to still learn with Rancher so this is helpful on that front. So in your snippet above, are you saying to manually add the taint to the control-plane node like this:
[root@cetech-lnx01 vsphere-cpi]# kubectl describe nodes | egrep "Taints:|Name:"
Name:               rubrikk8s-csi-cp1
Taints:             <http://node-role.kubernetes.io/controlplane=true:NoSchedule|node-role.kubernetes.io/controlplane=true:NoSchedule>
Name:               rubrikk8s-csi-etcd1
Taints:             <http://node-role.kubernetes.io/etcd=true:NoExecute|node-role.kubernetes.io/etcd=true:NoExecute>
Name:               rubrikk8s-csi-wkr1
Taints:             <http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>
Name:               rubrikk8s-csi-wkr2
Taints:             <http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>
Name:               rubrikk8s-csi-wkr3
Taints:             <http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>
[root@cetech-lnx01 vsphere-cpi]#
[root@cetech-lnx01 vsphere-cpi]# kubectl taint node rubrikk8s-csi-cp1 <http://node-role.kubernetes.io/control-plane=true:NoSchedule|node-role.kubernetes.io/control-plane=true:NoSchedule>
h

hallowed-breakfast-56871

12/22/2022, 9:28 PM
Yup, that should do it.
a

adventurous-address-26812

12/22/2022, 9:30 PM
So i just got a new cluster to finish and I changed the taint on the contolplane to control-plane and the CPI installed through helm - THANK YOU!
now on the CSI driver install, i am hitting an issue I saw before and this feels like a silly question from me but the CSI pods have this as a selector in their pods definition. I have this cluster with 1 CP, 1ETCD, and 3 workers:
QoS Class:                   BestEffort
Node-Selectors:              <http://node-role.kubernetes.io/control-plane=|node-role.kubernetes.io/control-plane=>
Tolerations:                 <http://node-role.kubernetes.io/control-plane:NoSchedule|node-role.kubernetes.io/control-plane:NoSchedule> op=Exists
                             <http://node-role.kubernetes.io/master:NoSchedule|node-role.kubernetes.io/master:NoSchedule> op=Exists
                             <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
However, when applying the CSI driver manifest, the vsphere-csi-controller PODs never seem to schedule:
vents:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  105s  default-scheduler  0/5 nodes are available: 1 node(s) had taint {<http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>: true}, that the pod didn't tolerate, 1 node(s) had taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  15s   default-scheduler  0/5 nodes are available: 1 node(s) had taint {<http://node-role.kubernetes.io/etcd|node-role.kubernetes.io/etcd>: true}, that the pod didn't tolerate, 1 node(s) had taint {<http://node.cloudprovider.kubernetes.io/uninitialized|node.cloudprovider.kubernetes.io/uninitialized>: true}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.
What's throwing me off is the node selector is just blank: node-role.kubernetes.io/control-plane= I must be missing something with my taints, but I can't seem to toggle anything to match and its stumping me if you had any thoughts on that one 🙂
h

hallowed-breakfast-56871

12/22/2022, 9:35 PM
Yeah, I've seen this too
One sec, I have it in my notes...
Next we need to pull down the latest deployment from the Kubernetes GitHub. We do this as we need to remove the node toleration, and edit the replica count if we are deploying to a single node. 
`wget <https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v2.6.0/manifests/vanilla/vsphere-csi-driver.yaml> `

Edit any requirements (e.g replica's or removing node toleration) in the yaml.
You might want to edit the lines like below, to read true. Then remove all tolerations after.
yaml nodeSelector: node-role.kubernetes.io/control-plane: ""```
So in that yaml you pulled down, check for tolerations, and remove them if required.
a

adventurous-address-26812

12/22/2022, 9:38 PM
ahh....ok...this is helpful.....I was running this one right from the raw URL as in the docs, but let me pull this down and hack through it. I will let you know how I make out either tonight or tomorrow. Can't thank you enough for all your advice and help here Josh. thank you!
the empty "=" was killing me. I read it infers true and was like WTF should I set these too...hhaa
h

hallowed-breakfast-56871

12/22/2022, 9:39 PM
No worries. Happy to help!