This message was deleted.
# rke2
a
This message was deleted.
m
helm-install-rke2-snapshot-controller-lvzz6
in CrashLoopBackOff
Copy code
helm_v3 install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16 rke2-snapshot-controller /tmp/rke2-snapshot-controller.tgz
Error: INSTALLATION FAILED: execution error at (rke2-snapshot-controller/templates/validate-install-crd.yaml:13:7): Required CRDs are missing. Please install the corresponding CRD chart before installing this chart.
helm-install-rke2-snapshot-controller-crd-sd7xf
stuck in Running:
Copy code
E0404 18:58:24.099514       1 reflector.go:140] <http://github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:117|github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:117>: Failed to watch *v1.VolumeSnapshotClass: failed to list *v1.VolumeSnapshotClass: the server could not find the requested resource (get <http://volumesnapshotclasses.snapshot.storage.k8s.io|volumesnapshotclasses.snapshot.storage.k8s.io>)

[main] 2024/04/04 18:37:45 Storage driver is Secret
[main] 2024/04/04 18:37:45 Max history per release is 0
$HELM_HOME has been configured at /home/klipper-helm/.helm.
Not installing Tiller due to 'client-only' flag having been set
++ jq -r '.Releases | length'
++ timeout -s KILL 30 helm_v2 ls --all '^rke2-snapshot-controller-crd$' --output json
This causes snapshot validation webhook to be stuck in Crashloop.
Any ideas what would cause it?
c
how is the snapshot validation webhook installed if the snapshot controller chart hasn’t installed yet?
ah right, the CRDs are in the controller chart, but the webhook doesn’t bundle them
sorry, it’s been a bit since I touched that chart
I would probably try to figure out why that
helm_v2 ls
is hanging, that is unusual. Is there some problem with the node that the job pod is running on?
That should timeout and get killed after 30 seconds and then move on to just checking the normal v3 chart, but whatever’s causing v2 to fail is probably an issue for v3 as well. You’ll want to get the full log.
m
The node was healthy and in a Ready state. I was trying to see if it was a deeper failure so I tried deleting the helm install jobs that were stuck. Then they ran and completed successfully, and the webhook recovered after. But our tests timeout after 5 minutes, so it was stuck until then, that was the full log
After the Node reports ‘Ready’ we run this, and it times out on the 2nd condition every so often
Copy code
wait_for_condition \
    "Nodes to report 'Ready' status" \
    "kubectl wait --for=condition=Ready nodes --all --timeout=120s"
  sleep 5
  wait_for_condition \
    "kube-system Jobs to complete" \
    "kubectl -n kube-system wait --for=condition=complete --timeout=600s jobs --all"
  wait_for_condition \
    "kube-system Deployments to report available" \
    "kubectl -n kube-system wait --for=condition=available --timeout=600s deployments --all"