```level=error msg="spc: failed to wait for node d...
# rke2
r
Copy code
level=error msg="spc: failed to wait for node deletion: timed out waiting for the condition"
I stopped the service, and noticed this line. It fails to restart due to
level=fatal msg="preparing server: failed to bootstrap cluster data: bootstrap data already found and encrypted with different token"
Is the 'wait for node deletion' related?
c
no
the fatal error is your problem
you are apparently not setting the correct cluster join token in your config or cli
or more likely you’re not setting one at all.
r
the latter, we don't hard-code a token
c
well you need to if you want to join the cluster
if you don’t set one, one is generated for you and written to the token file on the server when the first server in the cluster is starting up… but if you’re joining a new server to an existing cluster this file obviously won’t exist yet so you’ll need to put it in the config.
this is noted in the docs and at the top of the release notes for every release
> If your server (control-plane) nodes were not started with the
--token
CLI flag or config file key, a randomized token was generated during initial cluster startup. This key is used both for joining new nodes to the cluster, and for encrypting cluster bootstrap data within the datastore. Ensure that you retain a copy of this token, as is required when restoring from backup. > You may retrieve the token value from any server already joined to the cluster: >
Copy code
cat /var/lib/rancher/rke2/server/token
r
Right, not joining a new server though. It's a single node instance. I stopped the service, noticed the bad exit. Restarting it resulted in the mismatch error.
c
are you using an external db?
r
kine/sqlite
c
what version of rke2?
Did something happen that wiped out the contents of that token file?
r
1.33.1
For reasons I can't remember, we modify our service to remove
/data/rancher/rke2/server/cred/passwd
on startup, but never the token.
(we use /data instead of /var/lib)
For some further context, this was a 'stress test' situation where we filled the disk to nearly full, then restarted our cluster. we also have some custom kubelet args around evictions
Copy code
- eviction-hard=imagefs.available<1%,nodefs.available<1%
    - eviction-minimum-reclaim=imagefs.available=500Mi,nodefs.available=500Mi
    - image-gc-high-threshold=100
c
Something happened to make the contents of the token file no longer match the token previously used. Did the file perhaps get truncated on startup because the disk was full?
r
stat-ing the token it has not changed since before we filled the disk, so does not appear so
c
On startup tries to read the content from the token file if you haven’t set a token in the config, but that means you’re reliant on that token file not getting mangled: https://github.com/k3s-io/k3s/blob/master/pkg/cluster/bootstrap.go#L267-L278
The other possibility is that the sqlite db got corrupted
you can try using the sqlite cli tools to open the db file and delete any rows that start with
/bootstrap/
r
I'm wondering if a compaction + nearly full disk messed thing sup?
c
but regardless, I would recommend setting a fixed token in your config, even on a single node cluster.
there are all kinds of fun things that can happen when you intentionally fill your disks. I can only guess at what actually occured.
r
I'll take a look in the db w/r/t to
/bootstrap
and yea we can look at hardcoding a token for sure.
there are all kinds of fun things that can happen when you intentionally fill your disks. I can only guess at what actually occured.
indeed!
Thanks for the help!
c
gl!