This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

11/28/2024, 8:13 PM

This message was deleted.

late-needle-80860

11/28/2024, 8:31 PM

Another potentially interesting error:

Failed to get node when trying to set owner ref to the node lease

late-needle-80860

11/29/2024, 8:50 AM

Anyone? Thank you

late-needle-80860

12/01/2024, 9:00 PM

Now I also tried adding

--node-name

to have a base node name … still getting the same error.

late-needle-80860

12/01/2024, 10:08 PM

Is the intention of

--with-node-id

that it should be used in combination with

--node-name

? In a way where the value to these two parameters match? Hmm doesn’t seem so … testing things here. Reading

How agent Node Registration Works

from the K3s docs … is using

--with-node-id

only relevant and valid if there’s already a node-password.k3s Secret in the kube-system? If not … then things will fail as I experience it?

creamy-pencil-82913

12/02/2024, 4:27 AM

It'll just make up a random id number and store it in the filesystem next to the node password file. As long as the file is retained it will use the same id, which is appended to the node name - however it is set.

creamy-pencil-82913

12/02/2024, 6:32 AM

idk you'll have to be a lot more specific than just "doesn't join the cluster"

late-needle-80860

12/02/2024, 6:43 AM

I’m trying to be specific. I provided log messages above. I see ( in the worker journalctl 😞

Copy code

Nov 28 19:56:51 test-test-worker-129 k3s[4296]: E1128 19:56:51.145395    4296 kubelet_node_status.go:453] "Error getting the current node from lister" err="node \"test-test-worker-129-94ede7cb\" not found" ` in the K3s agent journal logs

And

Copy code

Failed to get node when trying to set owner ref to the node lease

Executing the hostname command on the worker gives the plain hostname ( without the node id appended ) > test-test-worker-129 Thank you

late-needle-80860

12/02/2024, 6:44 AM

It’s k3s v1.31.2

creamy-pencil-82913

12/02/2024, 9:23 AM

The kubelet should create the node object. You'd need to find where in the logs it's failing to do that.

late-needle-80860

12/02/2024, 9:25 AM

I’ll give it a go …

late-needle-80860

12/02/2024, 11:03 AM

I did some digging and these logs are interesting on the control-plane leader —- Dec 02 103724 test-test-ctlplane-0 k3s[4357]: E1202 103724.315821 4357 reflector.go:158] "Unhandled Error" err="k8s.io/client-go@v1.31.2-k3s1/tools/cache/reflector.go:243: Failed to watch *v1.PartialObjectMetadata: an error on the server (\"unknown\") has prevented the request from succeeding" Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.481875 4357 actual_state_of_world.go:540] "Failed to update statusUpdateNeeded field in actual state of world" err="Failed to set statusUpdateNeeded to needed true, because nodeName=\"test-test-worker-129-70d99708\" does not exist" Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.522599 4357 range_allocator.go:422] "Set node PodCIDR" node="test-test-worker-129-70d99708" podCIDRs=["10.240.32.0/24"] Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.522630 4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708" Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.522649 4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708" Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.522686 4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708" Dec 02 103724 test-test-ctlplane-0 k3s[4357]: I1202 103724.531757 4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708" Dec 02 103725 test-test-ctlplane-0 k3s[4357]: I1202 103725.289480 4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708" Dec 02 103727 test-test-ctlplane-0 k3s[4357]: I1202 103727.487328 4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708" Dec 02 103727 test-test-ctlplane-0 k3s[4357]: I1202 103727.495990 4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708" Dec 02 103727 test-test-ctlplane-0 k3s[4357]: time="2024-12-02T103727Z" level=warning msg="Unable to remove node password: secrets \"test-test-worker-129-70d99708.node-password.k3s\" not found" Dec 02 103727 test-test-ctlplane-0 k3s[4357]: I1202 103727.509702 4357 range_allocator.go:241] "Successfully synced" key="test-test-worker-129-70d99708" Dec 02 103727 test-test-ctlplane-0 k3s[4357]: time="2024-12-02T103727Z" level=warning msg="Unable to remove node password: secrets \"test-test-worker-129-70d99708.node-password.k3s\" not found" Dec 02 103727 test-test-ctlplane-0 k3s[4357]: time="2024-12-02T103727Z" level=error msg="error syncing 'test-test-worker-129-70d99708': handler node: Operation cannot be fulfilled on nodes \"test-test-worker-129-70d99708\": StorageError: invalid object, Code: 4, Key: /registry/minions/test-test-worker-129-70d99708, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: c6eb6724-6f45-4452-bdc7-928ea7e05fab, UID in object meta: , requeuing" Dec 02 103731 test-test-ctlplane-0 k3s[4357]: E1202 103731.585861 4357 reflector.go:158] "Unhandled Error" err="k8s.io/client-go@v1.31.2-k3s1/tools/cache/reflector.go:243: Failed to watch *v1.PartialObjectMetadata: an error on the server (\"unknown\") has prevented the request from succeeding" Found this issue: https://github.com/kubernetes/kubernetes/issues/124347 That’s interesting. But, I’ve no idea - so far - on how to fix it

creamy-pencil-82913

12/02/2024, 5:45 PM

Copy code

"Set node PodCIDR" node="test-test-worker-129-70d99708" podCIDRs=["10.240.32.0/24"]
level=warning msg="Unable to remove node password: secrets \"test-test-worker-129-70d99708.node-password.k3s\" not found"

The node is created and then deleted. What are you running that is deleting nodes from your cluster? Do you have some 3rd party cloud provider installed that is deleting nodes it does not recognize?

late-needle-80860

12/02/2024, 6:04 PM

There was no node with that name already in the cluster. I’m trying to see if using with-node-id can be an okay workaround in the case that a worker was improperly removed and there’s therefore a dangling node-password secret in certain situations. I’m trying to avoid having to give rbac permissions to the service account used for auto-scaling on listening secrets in the kube-system - I don’t like that from a security perspective.

late-needle-80860

12/02/2024, 6:05 PM

Also .. each time I try the id is different so would there ever be a node-password secret of the name of the new worker base name + id? Isn’t that the core idea of the with-node-id parameter? Thank you very much

late-needle-80860

12/02/2024, 6:07 PM

But I’ll verify just in case …. But I never set the node joining the cluster … the auto scaling we have does not act on vm’s directly on the underlying hci

creamy-pencil-82913

12/02/2024, 6:18 PM

one of those messages is from node creation. the other is from node deletion. The reason they’re failing to join is that something is deleting the node as soon as it is created by the agent.

😳 1

creamy-pencil-82913

12/02/2024, 6:19 PM

there is nothing built in to K3s that will do that, so it is something you have deployed.

🤠 1

late-needle-80860

12/02/2024, 6:19 PM

Aaah so this:

Copy code

level=warning msg="Unable to remove node password: secrets \"test-test-worker-129-70d99708.node-password.k3s\" not found"

Is a red herring in my context … it occurs because the node was deleted and it’s basically just because in this case there was no node-password

late-needle-80860

12/02/2024, 6:19 PM

I’ll keep digging! And thank you for your patience.

late-needle-80860

12/02/2024, 7:24 PM

Okay if I restart the k3s agent service I can see that the node tries to join … weird things are happening

late-needle-80860

12/02/2024, 7:25 PM

hhmm

late-needle-80860

12/03/2024, 8:11 PM

More info after another day of more troubleshooting …. I can get a new worker to join the k3s v1.31.2 cluster if I restart the k3s-agent service 3-4 times on the worker using --with-node-id. On other clusters on Google Compute Engine, as well as on vmware I do not have to restart the k3s-agent service. The other cluster on GCP Compute Engine running v1.31.1 of K3s is having the same resources on both the control-plane and worker side as this one whereon I have to restart the k3s-agent multiple times before the --with-node-id using worker joins. Any ideas?

creamy-pencil-82913

12/03/2024, 8:24 PM

did you figure out what’s deleting the node as it joins?

late-needle-80860

12/03/2024, 8:24 PM

When following the log it seems that the CNI fails to start up.

late-needle-80860

12/03/2024, 8:24 PM

I can’t see anything deleting it.

creamy-pencil-82913

12/03/2024, 8:25 PM

why is the cni failing to start?

creamy-pencil-82913

12/03/2024, 8:25 PM

cni waits on the cloud provider uninitialized taint to be removed

creamy-pencil-82913

12/03/2024, 8:25 PM

what cloud provider are you using? Have you looked at its logs?

late-needle-80860

12/03/2024, 8:26 PM

The clusters are bootstrapped the same .. so the one on gcp google compute engine … that works - node not deleted/cni starts up … on the one not working … small differences on a k3s version one patch version ….

late-needle-80860

12/03/2024, 8:26 PM

Google Cloud in this case … also have clusters on vmware where it works …

creamy-pencil-82913

12/03/2024, 8:27 PM

no but what cloud provider (cloud controller manager) are you using

late-needle-80860

12/03/2024, 8:28 PM

aah sorry

late-needle-80860

12/03/2024, 8:29 PM

openshift origin-gcp-cloud-controller-manager v4.14

late-needle-80860

12/03/2024, 8:29 PM

Google Does not make there’s available publicly … we’re running on compute engine and not GKE.

creamy-pencil-82913

12/03/2024, 8:36 PM

I suggest you look at the logs and config for that. I suspect it’s deleting your nodes as they join because it cant find a matching gcp instance

late-needle-80860

12/03/2024, 8:37 PM

hmmm very interesting …

creamy-pencil-82913

12/03/2024, 8:37 PM

that is a common occurrence with cloud providers. I’ve seen the vmware one do the same thing if the providerid isn’t set properly

late-needle-80860

12/03/2024, 8:37 PM

oh man

93 Views

Open in Slack

Previous Next