This message was deleted.
# k3s
a
This message was deleted.
l
Another potentially interesting error:
Failed to get node when trying to set owner ref to the node lease
Anyone? Thank you
Now I also tried adding
--node-name
to have a base node name … still getting the same error.
Is the intention of
--with-node-id
that it should be used in combination with
--node-name
? In a way where the value to these two parameters match? Hmm doesn’t seem so … testing things here. Reading
How agent Node Registration Works
from the K3s docs … is using
--with-node-id
only relevant and valid if there’s already a node-password.k3s Secret in the kube-system? If not … then things will fail as I experience it?
c
It'll just make up a random id number and store it in the filesystem next to the node password file. As long as the file is retained it will use the same id, which is appended to the node name - however it is set.
idk you'll have to be a lot more specific than just "doesn't join the cluster"
l
I’m trying to be specific. I provided log messages above. I see ( in the worker journalctl 😞
Copy code
Nov 28 19:56:51 test-test-worker-129 k3s[4296]: E1128 19:56:51.145395    4296 kubelet_node_status.go:453] "Error getting the current node from lister" err="node \"test-test-worker-129-94ede7cb\" not found" ` in the K3s agent journal logs
And
Copy code
Failed to get node when trying to set owner ref to the node lease
Executing the hostname command on the worker gives the plain hostname ( without the node id appended ) > test-test-worker-129 Thank you
It’s k3s v1.31.2
c
The kubelet should create the node object. You'd need to find where in the logs it's failing to do that.
l
I’ll give it a go …
I did some digging and these logs are interesting on the control-plane leader —- Dec 02 103724 test-test-ctlplane-0 k3s[4357]: E1202 103724.315821    4357 reflector.go:158] "Unhandled Error" err="k8s.io/client-go@v1.31.2-k3s1/tools/cache/reflector.go:243: Failed to watch *v1.PartialObjectMetadata: an error on the server (\"unknown\") has prevented the request from succeeding"                                                        Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.481875    4357 actual_state_of_world.go:540] "Failed to update statusUpdateNeeded field in actual state of world" err="Failed to set statusUpdateNeeded to needed true, because nodeName=\"test-test-worker-129-70d99708\" does not exist"                                                            Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.522599    4357 range_allocator.go:422] "Set node PodCIDR" node="test-test-worker-129-70d99708" podCIDRs=["10.240.32.0/24"]                                                     Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.522630    4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708"                                                                               Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.522649    4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708"                                                                               Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.522686    4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708"                                                                               Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.531757    4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708"                                                                               Dec 02 103725 test-test-ctlplane-0 k3s[4357]: I1202 103725.289480    4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708"                                                                               Dec 02 103727 test-test-ctlplane-0 k3s[4357]: I1202 103727.487328    4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708"                                                                               Dec 02 103727 test-test-ctlplane-0 k3s[4357]: I1202 103727.495990    4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708"                                                                               Dec 02 103727 test-test-ctlplane-0 k3s[4357]: time="2024-12-02T103727Z" level=warning msg="Unable to remove node password: secrets \"test-test-worker-129-70d99708.node-password.k3s\" not found"                                         Dec 02 103727 test-test-ctlplane-0 k3s[4357]: I1202 103727.509702    4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708"                                                                               Dec 02 103727 test-test-ctlplane-0 k3s[4357]: time="2024-12-02T103727Z" level=warning msg="Unable to remove node password: secrets \"test-test-worker-129-70d99708.node-password.k3s\" not found"                                         Dec 02 103727 test-test-ctlplane-0 k3s[4357]: time="2024-12-02T103727Z" level=error msg="error syncing 'test-test-worker-129-70d99708': handler node: Operation cannot be fulfilled on nodes \"test-test-worker-129-70d99708\": StorageError: invalid object, Code: 4, Key: /registry/minions/test-test-worker-129-70d99708, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: c6eb6724-6f45-4452-bdc7-928ea7e05fab, UID in object meta: , requeuing"   Dec 02 103731 test-test-ctlplane-0 k3s[4357]: E1202 103731.585861    4357 reflector.go:158] "Unhandled Error" err="k8s.io/client-go@v1.31.2-k3s1/tools/cache/reflector.go:243: Failed to watch *v1.PartialObjectMetadata: an error on the server (\"unknown\") has prevented the request from succeeding" Found this issue: https://github.com/kubernetes/kubernetes/issues/124347 That’s interesting. But, I’ve no idea - so far - on how to fix it
c
Copy code
"Set node PodCIDR" node="test-test-worker-129-70d99708" podCIDRs=["10.240.32.0/24"]
level=warning msg="Unable to remove node password: secrets \"test-test-worker-129-70d99708.node-password.k3s\" not found"
The node is created and then deleted. What are you running that is deleting nodes from your cluster? Do you have some 3rd party cloud provider installed that is deleting nodes it does not recognize?
l
There was no node with that name already in the cluster. I’m trying to see if using with-node-id can be an okay workaround in the case that a worker was improperly removed and there’s therefore a dangling node-password secret in certain situations. I’m trying to avoid having to give rbac permissions to the service account used for auto-scaling on listening secrets in the kube-system - I don’t like that from a security perspective.
Also .. each time I try the id is different so would there ever be a node-password secret of the name of the new worker base name + id? Isn’t that the core idea of the with-node-id parameter? Thank you very much
But I’ll verify just in case …. But I never set the node joining the cluster … the auto scaling we have does not act on vm’s directly on the underlying hci
c
one of those messages is from node creation. the other is from node deletion. The reason they’re failing to join is that something is deleting the node as soon as it is created by the agent.
😳 1
there is nothing built in to K3s that will do that, so it is something you have deployed.
🤠 1
l
Aaah so this:
Copy code
level=warning msg="Unable to remove node password: secrets \"test-test-worker-129-70d99708.node-password.k3s\" not found"
Is a red herring in my context … it occurs because the node was deleted and it’s basically just because in this case there was no node-password
I’ll keep digging! And thank you for your patience.
Okay if I restart the k3s agent service I can see that the node tries to join … weird things are happening
hhmm
More info after another day of more troubleshooting …. I can get a new worker to join the k3s v1.31.2 cluster if I restart the k3s-agent service 3-4 times on the worker using --with-node-id. On other clusters on Google Compute Engine, as well as on vmware I do not have to restart the k3s-agent service. The other cluster on GCP Compute Engine running v1.31.1 of K3s is having the same resources on both the control-plane and worker side as this one whereon I have to restart the k3s-agent multiple times before the --with-node-id using worker joins. Any ideas?
c
did you figure out what’s deleting the node as it joins?
l
When following the log it seems that the CNI fails to start up.
I can’t see anything deleting it.
c
why is the cni failing to start?
cni waits on the cloud provider uninitialized taint to be removed
what cloud provider are you using? Have you looked at its logs?
l
The clusters are bootstrapped the same .. so the one on gcp google compute engine … that works - node not deleted/cni starts up … on the one not working … small differences on a k3s version one patch version ….
Google Cloud in this case … also have clusters on vmware where it works …
c
no but what cloud provider (cloud controller manager) are you using
l
aah sorry
openshift origin-gcp-cloud-controller-manager v4.14
Google Does not make there’s available publicly … we’re running on compute engine and not GKE.
c
I suggest you look at the logs and config for that. I suspect it’s deleting your nodes as they join because it cant find a matching gcp instance
l
hmmm very interesting …
c
that is a common occurrence with cloud providers. I’ve seen the vmware one do the same thing if the providerid isn’t set properly
l
oh man