https://rancher.com/ logo
Title
s

sticky-summer-13450

11/29/2022, 3:08 PM
I'm occasionally finding pods stuck in
Terminating
state and I want to know whether it's k3s, k8s, or me. Example: I have cluster with 1
server
node and several
worker
nodes, and I have workloads spread across the workers. Lets say a worker node dies - maybe it's never going to return.
$ kubectl get pods --context kube001 --all-namespaces -o=wide |grep Terminating 
kube-system        traefik-9c6dc6686-jdt9f                            1/1     Terminating   0              24d    10.42.1.4     kube002   <none>           <none>
active-mq          active-mq-6665f5d8b9-ztwnq                         1/1     Terminating   0              15d    10.42.1.82    kube002   <none>           <none>
Some of the pods get stuck in the terminating state and don't get replaced on other worker nodes. This means the cluster is no-longer respecting the declarative state. Is this a problem specific to me, a problem specific to k3s, a problem with k8s, or something else?
$ kubectl describe pod --context kube001 --namespace active-mq active-mq-6665f5d8b9-ztwnq
Name:                      active-mq-6665f5d8b9-ztwnq
Namespace:                 active-mq
Priority:                  0
Service Account:           default
Node:                      kube002/10.64.8.117
Start Time:                Sun, 13 Nov 2022 15:07:50 +0000
Labels:                    app=active-mq
                           pod-template-hash=6665f5d8b9
Annotations:               <http://kubernetic.com/restartedAt|kubernetic.com/restartedAt>: 2021-04-30T16:34:24+01:00
Status:                    Terminating (lasts 2d5h)
Termination Grace Period:  30s
...
In this example the pod has been in that state for more than 2 days.
c

creamy-pencil-82913

11/29/2022, 6:40 PM
if it’s never coming back you should delete it from the cluster
it can’t show as “terminated” because Kubernetes doesn’t know anything about that node. It’s not reporting in, so showing that it’s terminated would be incorrect. It could still be running but just unable to reach the server.
Kubernetes can’t reason about anything happening on a node that’s not checking in.
s

sticky-summer-13450

11/29/2022, 6:44 PM
I hope the node will come back, but currently Harvester/Longhorn has eaten the root disk. So maybe I should remove the node and it'll join again.
it can’t show as “terminated” because Kubernetes doesn’t know anything about that node.
So - why have some pods terminated correctly and some show that they are terminating. It's inconsistent.
c

creamy-pencil-82913

11/29/2022, 7:35 PM
This is covered in the Kubernetes documentation. https://kubernetes.io/docs/concepts/architecture/nodes/#node-status
The node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. You can see the pods that might be running on an unreachable node as being in the
Terminating
or
Unknown
state. In cases where Kubernetes cannot deduce from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects running on the node to be deleted from the API server and frees up their names.
Kubernetes is what is known as “eventually consistent”. If some part of the cluster is unavailable and can’t be reasoned about, that information can’t be updated until it either comes back, or is manually deleted.
👍 1
s

sticky-summer-13450

11/30/2022, 9:00 AM
I deleted the node and the VMs got themselves out of the Terminating state. Thank you. I really need to get the "cattle not pets" mantra locked even harder into my brain. The node is a VM that's playing up in Harvester so I didn't think that deleting the node from kubernetes was the right thing to do. Thanks for correcting me.
👍 1