03/27/2023, 9:00 PM
I have an RKE2 cluster created in Rancher Manager 2.7.1. After the nodes were running, I adjusted the cluster config. This triggered a rolling replacement, which was cool automation to watch. When the node replacements finally finished, there were still numerous pods stuck in a transitional state, as viewed in Rancher Manager. I can see them with
too. The cluster seems to be functioning despite all the ruckus in the pod list. Is it normal for you to clean up a bunch of stuck pods after a cluster configuration change?
This appears to have succeeded, but it's hard to say whether there will be consequences down the road.
❯ k logs rke2-coredns-rke2-coredns-58fd75f64b-9zrfb -n kube-system
Error from server: Get "<>": proxy error from while dialing, code 503: 503 Service Unavailable

❯ k delete pod rke2-coredns-rke2-coredns-58fd75f64b-9zrfb -n kube-system --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "rke2-coredns-rke2-coredns-58fd75f64b-9zrfb" force deleted
I confirmed the pod no longer exists. Is there a playbook I should follow to clean them up one-by-one or is this what you routinely do with stuck pods?
❯ k get pods -A --sort-by=.metadata.creationTimestamp | grep -vE 'Running|Completed'
NAMESPACE             NAME                                                    READY   STATUS              RESTARTS      AGE
kube-system           rke2-coredns-rke2-coredns-58fd75f64b-9zrfb              1/1     Terminating         0             18h
kube-system           harvester-csi-driver-controllers-779c557d47-8sjxs       3/3     Terminating         0             18h
kube-system           harvester-cloud-provider-568dd85c97-zc9vj               1/1     Terminating         0             18h
kube-system           harvester-csi-driver-controllers-779c557d47-9hxjh       3/3     Terminating         0             18h
kube-system           harvester-csi-driver-controllers-779c557d47-mgxlg       3/3     Terminating         0             18h
kube-system           rke2-coredns-rke2-coredns-autoscaler-768bfc5985-gm84r   1/1     Terminating         0             18h
calico-system         calico-node-kv2j4                                       1/1     Terminating         0             18h
calico-system         calico-kube-controllers-858bd946f6-khksp                1/1     Terminating         0             18h
kube-system           rke2-metrics-server-67697454f8-jnbfv                    1/1     Terminating         0             18h
kube-system           rke2-ingress-nginx-controller-sptfv                     0/1     Terminating         0             18h
cattle-fleet-system   fleet-agent-8ccf6dbc8-hknlm                             1/1     Terminating         0             18h
cattle-system         cattle-cluster-agent-84476b4b9f-bvh89                   1/1     Terminating         0             18h
cattle-system         cattle-cluster-agent-84476b4b9f-6kbl5                   1/1     Terminating         0             18h
kube-system           harvester-cloud-provider-77d4df878b-r5h2x               0/1     ContainerCreating   0             18h
The v1.24.8+rke2r1 three-node cluster is deployed to Harvester 1.1.1 Cloud Provider.
I did not encounter the same problem with a new cluster created by Rancher 2.6.11, instead of 2.7.1, on the same Harvester 1.1.1 node. This time a newer version v1.24.10 +rke2r1 was available for the Harvester Cloud Provider, despite an older version of Rancher Manager, which I installed with Helm in Rancher Desktop's K3s.