This message was deleted.
# rke2
a
This message was deleted.
c
did you do what it says, and look in the log on that node?
b
@creamy-pencil-82913 yes, I can post screenshot soon.
☝️ Here is the log what I’ve found
c
can you get the actual logs instead of a screenshot? The lines are all truncated.
The bits of the logs that are visible suggest that this is the first time the agent has been run on this node, is that correct? Did this node complete the install before you attempted to delete it?
b
The screenshot I am showing is the full log.
Copy code
rancher-system-agent.service - Rancher System Agent
     Loaded: loaded (/etc/systemd/system/rancher-system-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2024-01-26 17:27:31 UTC; 9s ago
       Docs: <https://www.rancher.com>
   Main PID: 2684 (rancher-system-)
      Tasks: 14 (limit: 19154)
     Memory: 13.2M
     CGroup: /system.slice/rancher-system-agent.service
             └─2684 /usr/local/bin/rancher-system-agent sentinel

Jan 26 17:27:31 test-cp-6ecb68b2-v77bv systemd[1]: Started Rancher System Agent.
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is s>
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Using directory /var/lib/rancher/agent/work for wo>
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Starting remote watch of plans"
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: E0126 17:27:31.386641    2684 memcache.go:206] couldn't get resource list for management.cattl>
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Starting /v1, Kind=Secret controller"
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Detected first start, force-applying one-time inst>
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan>
Jan 26 17:27:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan>
c
it’s still truncated to terminal width. Don’t use
systemctl status
to view logs.
1
use
journalctl -u rancher-system-agent --no-pager
instead of
systemctl status rancher-system-agent
1
b
Copy code
Jan 26 17:33:26 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:26Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:41 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:41Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:46 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:46Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:51 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:51Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:56 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:56Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:01 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:01Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:06 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:06Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:11 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:11Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:16 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:16Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:21 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:21Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:26 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:26Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:41 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:41Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:46 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:46Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:51 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:51Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:56 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:56Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:01 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:01Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:06 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:06Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:11 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:11Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:16 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:16Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:21 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:21Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:26 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:26Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:41 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:41Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:46 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:46Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:51 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:51Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:56 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:56Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:01 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:01Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:06 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:06Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:11 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:11Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:16 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:16Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:21 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:21Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:26 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:26Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:41 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:41Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
https://rancher-users.slack.com/archives/C01PHNP149L/p1706290372505929?thread_ts=1706287552.898239&amp;cid=C01PHNP149L After restarting the rancher cluster nodes, only this control plane was not working. So I wanted to restore snapshot which was 39 days ago at the first. But it didn’t work. So after that, I decided to scale down this node to recover my cluster but after that this happened.
c
that log message isn’t very helpful, it just says that the plan has failed 5 times, and it won’t try anything new until the plan changes. Go back further in the logs to see one of the 5 failures.
b
Here is logs for that @creamy-pencil-82913. https://rancher-users.slack.com/files/U0694FACFGD/F06FDNWAYTZ/logs (Sorry, I was not able to create text snippet in this thread)
c
Copy code
Jan 26 13:27:03 test-cp-6ecb68b2-v77bv rancher-system-agent[8157]: time="2024-01-26T13:27:03Z" level=info msg="[ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a_4:stderr]: + exec rke2 server --cluster-reset --etcd-arg=advertise-client-urls=<https://127.0.0.1:2379> --etcd-disable-snapshots=false --cluster-reset-restore-path=db/snapshots/etcd-snapshot-test-cp-6ecb68b2-v77bv-1703707203 --etcd-s3=false"

Jan 26 13:27:13 test-cp-6ecb68b2-v77bv rancher-system-agent[8157]: time="2024-01-26T13:27:13Z" level=info msg="[ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a_4:stderr]: time=\"2024-01-26T13:27:13Z\" level=fatal msg=\"starting kubernetes: preparing server: start managed database: etcd: snapshot path does not exist: db/snapshots/etcd-snapshot-test-cp-6ecb68b2-v77bv-1703707203\""
You asked to restore an etcd snapshot that does not exist
or perhaps no longer exists
Did you try to restore a snapshot, that got stuck, and then you tried to delete the node that is stuck trying to restore the snapshot?
you’re not going to be able to delete the node until it finishes the restore first.
Figure out what snapshots you actually have on that node, and then edit the cluster YAML in Rancher to have it restore that snapshot instead.
b
Yes, I was trying to restore the snapshot which was at backed up at 39 days ago. Waiting for a while and got error so after that I decoded to delete. When I try to delete the node, I got the error which I see now. https://rancher-users.slack.com/archives/C01PHNP149L/p1706294875343839?thread_ts=1706287552.898239&cid=C01PHNP149L
Here are the list of snapshots.
c
what files do you see under
/var/lib/rancher/rke2/server/db/snapshots/
on that node?
in the UI? In cluster management, go into the cluster, click the config button, there should be an Edit as YAML option I believe?
b
Yup, I can see
c
I believe there should be a section somewhere in there that contains the name of the snapshot you’re trying to restore. Edit it to target another one that still exists on disk on that node.
b
It says permission denied
c
be root
or use sudo?
basic linux stuff
😉 1
b
Sorry, I am not mcuh experienced in linux.
Copy code
etcd-snapshot-test-cp-6ecb68b2-v77bv-1706176804  etcd-snapshot-test-cp-6ecb68b2-v77bv-1706212800  etcd-snapshot-test-cp-6ecb68b2-v77bv-1706245203
etcd-snapshot-test-cp-6ecb68b2-v77bv-1706194804  etcd-snapshot-test-cp-6ecb68b2-v77bv-1706227204
c
ok, so for some reason those are the only snapshots you have available. If you do
ls -la
it will show you the timestamps
Probably you don’t have any that are 38 days old or however far back you were trying to restore
Here is the content of YAML file. I am not sure what to update here.
c
Did you look for the snapshot name in there, and replace it with the name of another snapshot that does exist?
hint, look for the etcdSnapshotRestore section, and the name field…
you’ll want to increment the generation by 1 as well, to indicate a new operation.
b
Okay, I made changes
error retrieving etcdsnapshot fleet-default/etcd-snapshot-test-cp-6ecb68b2-v77bv-1706212800: <http://etcdsnapshots.rke.cattle.io|etcdsnapshots.rke.cattle.io> "etcd-snapshot-test-cp-6ecb68b2-v77bv-1706212800" not found
I saw this error message now.
c
hmm for some reason rancher appears to be way out of sync with the actual snapshots on those nodes. Look at the snapshots on disk on that node, and look at the snapshots listed in Rancher, and find one that exists in both.
b
Yes, both exists. but on UI, it is being displayed as 0 B
c
ah yeah. its confused.
that is a bug in rancher, should be fixed on newer releases I think
was the one you originally picked to restore also 0 bytes?
b
No, it was not.
c
interesting
if you don’t have any snapshots that are on disk but don’t show as 0 bytes in the UI, you could try picking one of the ones that is not 0 bytes, and restoring that, and just copy one of the files on disk to have the correct name.
b
There is nothing.
On disk, there snapshots are just within 1 - 2 days.
But on UI side, only snapshots for 29 - 30 days ago have non-zero byte. All the recent snapshots on UI have 0 byte within 2 days.
c
yeah it sounds like something was broken that prevented rancher from properly syncing the state of this cluster. The longer you leave it like that, the more stale the info gets.
b
So how can I just remove this node from my rancher cluster to recover my cluster?
c
you can’t delete a node while it’s trying to restore the cluster. You need to finish the first operation that you requested.
b
But right now, I think there is no way to continue my snapshot request.
Is there any way forcefully terminate my snapshot request?
c
you can try removing that whole restore section from the cluster yaml but I’m not sure what that’ll do.
were you not able to just copy one of the snapshot files on disk to allow the restore to work?
b
Where should I copy that file to?
I am not sure where these files are existing.
c
I would: 1. Copy one of the files in the snapshots dir, into a file with a name that matches the snapshot it was originally trying to restore. 2. edit the cluster yaml and change the snapshot name back to the original name, and increment the revision again. At that point the restore should succeed, as it has a file on disk that matches the snapshot it is trying to restore.
where should you copy it to? just in the same dir with all the other snapshot files.
cp <file that exists> <snapshot name that it was trying to restore>
b
I have also tried but it also doesn’t work
I think it doesn’t read the file from the disk even though I made the same file name with the UI.
Also, for etcdSnapshotRestore name section, I was need to name like
test-<disk-snapshotname>-local
c
You should have the file on disk before you change the snapshot name and increment the revision, otherwise you’ll see the same error about the file not existing.
b
Yes, I have confirmed. the file is existing and i incremented the generation
In this case, what should happen if I just remove the vm instance of the problematic cp node?
c
I am not sure. There is probably a way to force delete the node from the cluster despite the pending operation but I’m not a rancher dev, I don’t know for sure.
I work on RKE2 and k3s.