This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

01/26/2024, 4:45 PM

This message was deleted.

creamy-pencil-82913

01/26/2024, 4:51 PM

did you do what it says, and look in the log on that node?

better-cricket-7174

01/26/2024, 5:27 PM

@creamy-pencil-82913 yes, I can post screenshot soon.

better-cricket-7174

01/26/2024, 5:28 PM

☝️ Here is the log what I’ve found

creamy-pencil-82913

01/26/2024, 5:31 PM

can you get the actual logs instead of a screenshot? The lines are all truncated.

creamy-pencil-82913

01/26/2024, 5:32 PM

The bits of the logs that are visible suggest that this is the first time the agent has been run on this node, is that correct? Did this node complete the install before you attempted to delete it?

better-cricket-7174

01/26/2024, 5:33 PM

The screenshot I am showing is the full log.

Copy code

rancher-system-agent.service - Rancher System Agent
     Loaded: loaded (/etc/systemd/system/rancher-system-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2024-01-26 17:27:31 UTC; 9s ago
       Docs: <https://www.rancher.com>
   Main PID: 2684 (rancher-system-)
      Tasks: 14 (limit: 19154)
     Memory: 13.2M
     CGroup: /system.slice/rancher-system-agent.service
             └─2684 /usr/local/bin/rancher-system-agent sentinel

Jan 26 17:27:31 test-cp-6ecb68b2-v77bv systemd[1]: Started Rancher System Agent.
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is s>
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Using directory /var/lib/rancher/agent/work for wo>
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Starting remote watch of plans"
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: E0126 17:27:31.386641    2684 memcache.go:206] couldn't get resource list for management.cattl>
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Starting /v1, Kind=Secret controller"
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=info msg="Detected first start, force-applying one-time inst>
Jan 26 17:27:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan>
Jan 26 17:27:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:27:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan>

creamy-pencil-82913

01/26/2024, 5:35 PM

it’s still truncated to terminal width. Don’t use

systemctl status

to view logs.

✅ 1

creamy-pencil-82913

01/26/2024, 5:35 PM

use

journalctl -u rancher-system-agent --no-pager

instead of

systemctl status rancher-system-agent

✅ 1

better-cricket-7174

01/26/2024, 5:43 PM

Copy code

Jan 26 17:33:26 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:26Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:41 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:41Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:46 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:46Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:51 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:51Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:33:56 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:33:56Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:01 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:01Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:06 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:06Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:11 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:11Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:16 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:16Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:21 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:21Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:26 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:26Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:41 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:41Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:46 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:46Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:51 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:51Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:34:56 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:34:56Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:01 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:01Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:06 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:06Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:11 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:11Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:16 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:16Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:21 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:21Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:26 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:26Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:41 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:41Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:46 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:46Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:51 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:51Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:35:56 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:35:56Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:01 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:01Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:06 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:06Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:11 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:11Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:16 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:16Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:21 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:21Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:26 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:26Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:31 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:31Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:36 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:36Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"
Jan 26 17:36:41 test-cp-6ecb68b2-v77bv rancher-system-agent[2684]: time="2024-01-26T17:36:41Z" level=error msg="[K8s] Maximum failure threshold exceeded for plan with checksum value of ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a, (failures: 5, threshold: 1)"

better-cricket-7174

01/26/2024, 5:48 PM

https://rancher-users.slack.com/archives/C01PHNP149L/p1706290372505929?thread_ts=1706287552.898239&cid=C01PHNP149L After restarting the rancher cluster nodes, only this control plane was not working. So I wanted to restore snapshot which was 39 days ago at the first. But it didn’t work. So after that, I decided to scale down this node to recover my cluster but after that this happened.

creamy-pencil-82913

01/26/2024, 5:53 PM

that log message isn’t very helpful, it just says that the plan has failed 5 times, and it won’t try anything new until the plan changes. Go back further in the logs to see one of the 5 failures.

better-cricket-7174

01/26/2024, 5:57 PM

Here is logs for that @creamy-pencil-82913. https://rancher-users.slack.com/files/U0694FACFGD/F06FDNWAYTZ/logs (Sorry, I was not able to create text snippet in this thread)

logs

creamy-pencil-82913

01/26/2024, 6:45 PM

Copy code

Jan 26 13:27:03 test-cp-6ecb68b2-v77bv rancher-system-agent[8157]: time="2024-01-26T13:27:03Z" level=info msg="[ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a_4:stderr]: + exec rke2 server --cluster-reset --etcd-arg=advertise-client-urls=<https://127.0.0.1:2379> --etcd-disable-snapshots=false --cluster-reset-restore-path=db/snapshots/etcd-snapshot-test-cp-6ecb68b2-v77bv-1703707203 --etcd-s3=false"

Jan 26 13:27:13 test-cp-6ecb68b2-v77bv rancher-system-agent[8157]: time="2024-01-26T13:27:13Z" level=info msg="[ecdb389ebc74e35a952e966a63509235689fbb90894b0ae49954d463ee5ed52a_4:stderr]: time=\"2024-01-26T13:27:13Z\" level=fatal msg=\"starting kubernetes: preparing server: start managed database: etcd: snapshot path does not exist: db/snapshots/etcd-snapshot-test-cp-6ecb68b2-v77bv-1703707203\""

You asked to restore an etcd snapshot that does not exist

creamy-pencil-82913

01/26/2024, 6:45 PM

or perhaps no longer exists

creamy-pencil-82913

01/26/2024, 6:47 PM

Did you try to restore a snapshot, that got stuck, and then you tried to delete the node that is stuck trying to restore the snapshot?

creamy-pencil-82913

01/26/2024, 6:48 PM

you’re not going to be able to delete the node until it finishes the restore first.

creamy-pencil-82913

01/26/2024, 6:48 PM

Figure out what snapshots you actually have on that node, and then edit the cluster YAML in Rancher to have it restore that snapshot instead.

better-cricket-7174

01/26/2024, 6:51 PM

Yes, I was trying to restore the snapshot which was at backed up at 39 days ago. Waiting for a while and got error so after that I decoded to delete. When I try to delete the node, I got the error which I see now. https://rancher-users.slack.com/archives/C01PHNP149L/p1706294875343839?thread_ts=1706287552.898239&cid=C01PHNP149L

better-cricket-7174

01/26/2024, 6:54 PM

Here are the list of snapshots.

better-cricket-7174

01/26/2024, 6:57 PM

https://rancher-users.slack.com/archives/C01PHNP149L/p1706294935330999?thread_ts=1706287552.898239&cid=C01PHNP149L How can I edit cluster yaml?

creamy-pencil-82913

01/26/2024, 6:57 PM

what files do you see under

/var/lib/rancher/rke2/server/db/snapshots/

on that node?

creamy-pencil-82913

01/26/2024, 6:59 PM

in the UI? In cluster management, go into the cluster, click the config button, there should be an Edit as YAML option I believe?

better-cricket-7174

01/26/2024, 7:00 PM

Yup, I can see

creamy-pencil-82913

01/26/2024, 7:01 PM

I believe there should be a section somewhere in there that contains the name of the snapshot you’re trying to restore. Edit it to target another one that still exists on disk on that node.

better-cricket-7174

01/26/2024, 7:02 PM

It says permission denied

creamy-pencil-82913

01/26/2024, 7:02 PM

be root

creamy-pencil-82913

01/26/2024, 7:02 PM

or use sudo?

creamy-pencil-82913

01/26/2024, 7:02 PM

basic linux stuff

😉 1

better-cricket-7174

01/26/2024, 7:04 PM

Sorry, I am not mcuh experienced in linux.

better-cricket-7174

01/26/2024, 7:05 PM

Copy code

etcd-snapshot-test-cp-6ecb68b2-v77bv-1706176804  etcd-snapshot-test-cp-6ecb68b2-v77bv-1706212800  etcd-snapshot-test-cp-6ecb68b2-v77bv-1706245203
etcd-snapshot-test-cp-6ecb68b2-v77bv-1706194804  etcd-snapshot-test-cp-6ecb68b2-v77bv-1706227204

creamy-pencil-82913

01/26/2024, 7:05 PM

ok, so for some reason those are the only snapshots you have available. If you do

ls -la

it will show you the timestamps

creamy-pencil-82913

01/26/2024, 7:05 PM

Probably you don’t have any that are 38 days old or however far back you were trying to restore

better-cricket-7174

01/26/2024, 7:10 PM

https://rancher-users.slack.com/archives/C01PHNP149L/p1706295542880049?thread_ts=1706287552.898239&cid=C01PHNP149L For this, yes, I can see.

better-cricket-7174

01/26/2024, 7:12 PM

Here is the content of YAML file. I am not sure what to update here.

Config.yaml

creamy-pencil-82913

01/26/2024, 7:14 PM

Did you look for the snapshot name in there, and replace it with the name of another snapshot that does exist?

creamy-pencil-82913

01/26/2024, 7:15 PM

hint, look for the etcdSnapshotRestore section, and the name field…

creamy-pencil-82913

01/26/2024, 7:15 PM

you’ll want to increment the generation by 1 as well, to indicate a new operation.

better-cricket-7174

01/26/2024, 7:17 PM

Okay, I made changes

better-cricket-7174

01/26/2024, 7:20 PM

error retrieving etcdsnapshot fleet-default/etcd-snapshot-test-cp-6ecb68b2-v77bv-1706212800: <http://etcdsnapshots.rke.cattle.io|etcdsnapshots.rke.cattle.io> "etcd-snapshot-test-cp-6ecb68b2-v77bv-1706212800" not found

better-cricket-7174

01/26/2024, 7:20 PM

I saw this error message now.

creamy-pencil-82913

01/26/2024, 7:28 PM

hmm for some reason rancher appears to be way out of sync with the actual snapshots on those nodes. Look at the snapshots on disk on that node, and look at the snapshots listed in Rancher, and find one that exists in both.

better-cricket-7174

01/26/2024, 7:29 PM

Yes, both exists. but on UI, it is being displayed as 0 B

creamy-pencil-82913

01/26/2024, 7:30 PM

ah yeah. its confused.

creamy-pencil-82913

01/26/2024, 7:31 PM

that is a bug in rancher, should be fixed on newer releases I think

creamy-pencil-82913

01/26/2024, 7:31 PM

was the one you originally picked to restore also 0 bytes?

better-cricket-7174

01/26/2024, 7:32 PM

No, it was not.

creamy-pencil-82913

01/26/2024, 7:32 PM

interesting

creamy-pencil-82913

01/26/2024, 7:33 PM

if you don’t have any snapshots that are on disk but don’t show as 0 bytes in the UI, you could try picking one of the ones that is not 0 bytes, and restoring that, and just copy one of the files on disk to have the correct name.

better-cricket-7174

01/26/2024, 7:35 PM

There is nothing.

better-cricket-7174

01/26/2024, 7:36 PM

On disk, there snapshots are just within 1 - 2 days.

better-cricket-7174

01/26/2024, 7:36 PM

But on UI side, only snapshots for 29 - 30 days ago have non-zero byte. All the recent snapshots on UI have 0 byte within 2 days.

creamy-pencil-82913

01/26/2024, 7:40 PM

yeah it sounds like something was broken that prevented rancher from properly syncing the state of this cluster. The longer you leave it like that, the more stale the info gets.

better-cricket-7174

01/26/2024, 7:41 PM

So how can I just remove this node from my rancher cluster to recover my cluster?

creamy-pencil-82913

01/26/2024, 7:48 PM

you can’t delete a node while it’s trying to restore the cluster. You need to finish the first operation that you requested.

better-cricket-7174

01/26/2024, 7:51 PM

But right now, I think there is no way to continue my snapshot request.

better-cricket-7174

01/26/2024, 7:51 PM

Is there any way forcefully terminate my snapshot request?

creamy-pencil-82913

01/26/2024, 7:52 PM

you can try removing that whole restore section from the cluster yaml but I’m not sure what that’ll do.

creamy-pencil-82913

01/26/2024, 7:53 PM

were you not able to just copy one of the snapshot files on disk to allow the restore to work?

better-cricket-7174

01/26/2024, 7:54 PM

Where should I copy that file to?

better-cricket-7174

01/26/2024, 7:58 PM

I am not sure where these files are existing.

creamy-pencil-82913

01/26/2024, 8:02 PM

I would: 1. Copy one of the files in the snapshots dir, into a file with a name that matches the snapshot it was originally trying to restore. 2. edit the cluster yaml and change the snapshot name back to the original name, and increment the revision again. At that point the restore should succeed, as it has a file on disk that matches the snapshot it is trying to restore.

creamy-pencil-82913

01/26/2024, 8:02 PM

where should you copy it to? just in the same dir with all the other snapshot files.

cp <file that exists> <snapshot name that it was trying to restore>

better-cricket-7174

01/26/2024, 8:12 PM

I have also tried but it also doesn’t work

better-cricket-7174

01/26/2024, 8:14 PM

I think it doesn’t read the file from the disk even though I made the same file name with the UI.

better-cricket-7174

01/26/2024, 8:15 PM

Also, for etcdSnapshotRestore name section, I was need to name like

test-<disk-snapshotname>-local

creamy-pencil-82913

01/26/2024, 8:17 PM

You should have the file on disk before you change the snapshot name and increment the revision, otherwise you’ll see the same error about the file not existing.

better-cricket-7174

01/26/2024, 8:18 PM

Yes, I have confirmed. the file is existing and i incremented the generation

better-cricket-7174

01/26/2024, 8:32 PM

In this case, what should happen if I just remove the vm instance of the problematic cp node?

creamy-pencil-82913

01/26/2024, 8:33 PM

I am not sure. There is probably a way to force delete the node from the cluster despite the pending operation but I’m not a rancher dev, I don’t know for sure.

creamy-pencil-82913

01/26/2024, 8:33 PM

I work on RKE2 and k3s.

74 Views

Open in Slack

Previous Next