This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

08/14/2023, 7:35 PM

This message was deleted.

Untitled

creamy-pencil-82913

08/14/2023, 7:38 PM

did you try to change the token value?

creamy-pencil-82913

08/14/2023, 7:38 PM

something causing the token to be changed is really the only possible cause of that message.

bright-lifeguard-9803

08/14/2023, 7:43 PM

I did not try to change the token. Before this we saw fatal errors like this:

Copy code

Aug 13 14:25:23 notice k3s: time="2023-08-13T14:25:23Z" level=fatal msg="leaderelection lost for rke2-etcd"
Aug 13 14:43:34 notice k3s: time="2023-08-13T14:43:34Z" level=fatal msg="leaderelection lost for rke2"
Aug 13 14:50:23 notice k3s: time="2023-08-13T14:50:23Z" level=fatal msg="leaderelection lost for rke2-etcd"
Aug 13 19:01:33 notice k3s: time="2023-08-13T19:01:33Z" level=fatal msg="leaderelection lost for rke2-etcd"
Aug 13 20:09:24 notice k3s: time="2023-08-13T20:09:24Z" level=fatal msg="leaderelection lost for rke2"
Aug 13 22:48:29 notice k3s: time="2023-08-13T22:48:29Z" level=fatal msg="leaderelection lost for rke2-etcd"
Aug 14 00:51:37 notice k3s: time="2023-08-14T00:51:37Z" level=fatal msg="leaderelection lost for rke2"
Aug 14 00:56:27 notice k3s: time="2023-08-14T00:56:27Z" level=fatal msg="clusterrole: EnsureRBACPolicy failed: unable to initialize roles: timed out waiting for the condition"
Aug 14 00:59:40 notice k3s: time="2023-08-14T00:59:40Z" level=fatal msg="clusterrole: EnsureRBACPolicy failed: unable to initialize roles: timed out waiting for the condition"
Aug 14 01:12:03 notice k3s: time="2023-08-14T01:12:03Z" level=fatal msg="leaderelection lost for rke2-etcd"
Aug 14 02:35:26 notice k3s: time="2023-08-14T02:35:26Z" level=fatal msg="leaderelection lost for rke2-etcd"
Aug 14 04:44:22 notice k3s: time="2023-08-14T04:44:22Z" level=fatal msg="leaderelection lost for rke2"
Aug 14 17:00:44 notice k3s: time="2023-08-14T17:00:44Z" level=fatal msg="leaderelection lost for rke2"

creamy-pencil-82913

08/14/2023, 7:44 PM

why does it say k3s but you’re running rke2?

bright-lifeguard-9803

08/14/2023, 7:45 PM

It's an artifact of history... we started on k3s and moved to rke2

creamy-pencil-82913

08/14/2023, 7:46 PM

Something else is going on with this node. The token can be specified in the config; if it is not then it is written to disk in the token file. The token must match the contents of the datastore.

creamy-pencil-82913

08/14/2023, 7:46 PM

Did you somehow use the same datastore when switching from k3s to rke2?

bright-lifeguard-9803

08/14/2023, 7:47 PM

nope we used sqlight with k3s (which IMHO sqlight worked much better for our single node use case)

bright-lifeguard-9803

08/14/2023, 7:50 PM

A lot of fingers pointing to

etcd

which seems very displeased with the disk we have it running on.

creamy-pencil-82913

08/14/2023, 7:51 PM

yeah, the leader election messages would point to the datastore timing out on operations, but that would not account for the errors about the bootstrap data having a different token.

creamy-pencil-82913

08/14/2023, 7:51 PM

Did someone perhaps try to do some cleanup or move things around and accidentally delete the token file?

bright-lifeguard-9803

08/14/2023, 7:52 PM

What should the path to the token file be?

creamy-pencil-82913

08/14/2023, 7:54 PM

if you didn’t specify it in the config, one is generated for you and stored at

/var/lib/rancher/rke2/server/token

bright-lifeguard-9803

08/14/2023, 7:56 PM

Copy code

$ ls -l /data/rancher/rke2/server/token
-rw------- 1 root root 109 Aug 14 04:44 /data/rancher/rke2/server/token

bright-lifeguard-9803

08/14/2023, 7:57 PM

Copy code

Access: 2023-06-21 17:36:33.832657697 +0000
Modify: 2023-08-14 04:44:43.806223505 +0000
Change: 2023-08-14 04:44:43.806223505 +0000
 Birth: -

creamy-pencil-82913

08/14/2023, 7:57 PM

what happened at 4:44 AM?

creamy-pencil-82913

08/14/2023, 7:58 PM

sure looks like the file got deleted and a new token generated, that doesn’t match what’s in the datastore.

bright-lifeguard-9803

08/14/2023, 7:59 PM

Checking , but probably the ansible-run that runs our k8s deployment tooling.

bright-lifeguard-9803

08/14/2023, 8:00 PM

I'll run the tooling manually and see if it generates a new token. (although I don't think it'll run because rke2 is in a restart loop...)

bright-lifeguard-9803

08/14/2023, 8:02 PM

@creamy-pencil-82913 is there a way to "go back"?

creamy-pencil-82913

08/14/2023, 8:28 PM

having an old copy of the datastore won’t help. The token is used to encrypt the data in the datastore, and that data hasn’t changed. You’ve changed the token on disk, so you can no longer decrypt the bootstrap data.

bright-lifeguard-9803

08/14/2023, 8:33 PM

OK. I've

sudo mv /data/rancher/rke2/server/db /data/rancher/rke2/server/db.old

and that got RKE2 running...

creamy-pencil-82913

08/14/2023, 8:35 PM

you’ll have lost all your old cluster data though

bright-lifeguard-9803

08/14/2023, 8:39 PM

Understood, I just need to get the machine back into service. We have automation that sets up rke2 and applies all the k8s manifests. blowing away a single node is not that big of a deal.

bright-lifeguard-9803

08/14/2023, 8:40 PM

Without the prior token there's nothing we could have done anyway? correct?

creamy-pencil-82913

08/14/2023, 8:52 PM

yeah, you need the token. If you’re not specifying it in the config, you should back it up somewhere alongside the snapshots in case you ever want to do a restore.

495 Views

Open in Slack

Previous Next