https://rancher.com/ logo
#rke2
Title
# rke2
h

hundreds-hairdresser-46043

10/31/2022, 12:53 PM
interesting observation on RKE2: if i restore a VM snapshot of a running RKE2 cluster the masters never recover. not even with a reboot. I have to rebuild the cluster very time so far.
s

sparse-fireman-14239

11/02/2022, 10:35 AM
Are you restoring one master or all masters? Were the snapshots taken at the same time?
h

hundreds-hairdresser-46043

11/02/2022, 11:09 AM
@sparse-fireman-14239 snap shot of every single server in the cluster which for the LAB/Testing is 2 masters, 3 storage nodes and 2 worker nodes. They are all snapshot at the same time. Their were installed using the airgap method (copying the images the the servers vs fetching them from the internet). I have tried a few things like rebooting them immediately after the restore so any date and time issue is resolved, any stale component is reset etc. At some point i though it might be faulty snapshots, so i delete every single VM and recreated them with fresh snapshots. same problem. Then i tried a rolling start (one at a time) until the services are running again. But the masters never come online again. Also the masters have issues with the ETCD database. It never starts. This is weird - since it is a freshly installed environment. So the cluster is clean.
s

sparse-fireman-14239

11/02/2022, 11:24 AM
Not sure if it’ll work with 2 masters. etcd is probably not able to achieve consensus due to how the Raft protocol works.
h

hundreds-hairdresser-46043

11/02/2022, 11:35 AM
@sparse-fireman-14239 I will add a 3rd node for a test. Only reason I went with 2 is because it is test/dev. Thanks for that hint
@sparse-fireman-14239 adding 3 master nodes had no effect. Still unable to restart the cluster after a snapshot restore. I even directly fetched rancher cluster containers vs airgap build this time around to eliminate a possible issue of the problem being the airgap build I retrieved the ETCD logs and i see this mostly:
2022-11-02T16:20:25.401593123+02:00 stderr F {"level":"warn","ts":"2022-11-02T14:20:25.401Z","caller":"etcdserver/cluster_util.go:288","msg":"failed to reach the peer URL","address":"<https://172.16.39.171:2380/version>","remote-member-id":"79db4e27fd0ed36b","error":"Get \"<https://172.16.39.171:2380/version>\": dial tcp 172.16.39.171:2380: connect: connection refused"}
2022-11-02T16:20:25.401688272+02:00 stderr F {"level":"warn","ts":"2022-11-02T14:20:25.401Z","caller":"etcdserver/cluster_util.go:155","msg":"failed to get version","remote-member-id":"79db4e27fd0ed36b","error":"Get \"<https://172.16.39.171:2380/version>\": dial tcp 172.16.39.171:2380: connect: connection refused"}
2022-11-02T16:20:25.950086969+02:00 stderr F {"level":"warn","ts":"2022-11-02T14:20:25.949Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"79db4e27fd0ed36b","rtt":"0s","error":"dial tcp 172.16.39.171:2380: connect: connection refused"}
2022-11-02T16:20:25.953507356+02:00 stderr F {"level":"warn","ts":"2022-11-02T14:20:25.953Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"79db4e27fd0ed36b","rtt":"0s","error":"dial tcp 172.16.39.171:2380: connect: connection refused"}
2022-11-02T16:20:29.40323715+02:00 stderr F {"level":"warn","ts":"2022-11-02T14:20:29.402Z","caller":"etcdserver/cluster_util.go:288","msg":"failed to reach the peer URL","address":"<https://172.16.39.171:2380/version>","remote-member-id":"79db4e27fd0ed36b","error":"Get \"<https://172.16.39.171:2380/version>\": dial tcp 172.16.39.171:2380: connect: connection refused"}
2022-11-02T16:20:29.403322405+02:00 stderr F {"level":"warn","ts":"2022-11-02T14:20:29.402Z","caller":"etcdserver/cluster_util.go:155","msg":"failed to get version","remote-member-id":"79db4e27fd0ed36b","error":"Get \"<https://172.16.39.171:2380/version>\": dial tcp 172.16.39.171:2380: connect: connection refused"}
2022-11-03T12:00:32.233886931+02:00 stderr F {"level":"info","ts":"2022-11-03T10:00:32.224Z","caller":"traceutil/trace.go:171","msg":"trace[1881690682] range","detail":"{range_begin:/registry/events/default/zacptprsvmkm97.17240a9719f51ea9; range_end:; }","duration":"16.595999869s","start":"2022-11-03T10:00:15.628Z","end":"2022-11-03T10:00:32.224Z","steps":["trace[1881690682] 'agreement among raft nodes before linearized reading'  (duration: 16.595821866s)"],"step_count":1}
2022-11-03T12:00:32.233898328+02:00 stderr F {"level":"warn","ts":"2022-11-03T10:00:32.224Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-11-03T10:00:15.628Z","time spent":"16.596100488s","remote":"127.0.0.1:36332","response type":"/etcdserverpb.KV/Range","request count":0,"request size":72,"response count":0,"response size":0,"request content":"key:\"/registry/events/default/zacptprsvmkm97.17240a9719f51ea9\" "}
2022-11-03T12:00:32.233913835+02:00 stderr F WARNING: 2022/11/03 10:00:32 [core] grpc: Server.processUnaryRPC failed to write status: connection *error:* desc = "transport is closing"
2022-11-03T12:00:32.510851538+02:00 stderr F {"level":"warn","ts":"2022-11-03T10:00:32.510Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"1.900992659s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/services/specs/\" range_end:\"/registry/services/specs0\" limit:500 ","response":"","error":"context deadline exceeded"}
2022-11-03T12:00:32.510912896+02:00 stderr F {"level":"info","ts":"2022-11-03T10:00:32.510Z","caller":"traceutil/trace.go:171","msg":"trace[1175778600] range","detail":"{range_begin:/registry/services/specs/; range_end:/registry/services/specs0; }","duration":"1.901132255s","start":"2022-11-03T10:00:30.609Z","end":"2022-11-03T10:00:32.510Z","steps":["trace[1175778600] 'agreement among raft nodes before linearized reading'  (duration: 1.900986403s)"],"step_count":1}
2022-11-03T12:00:32.511067295+02:00 stderr F {"level":"warn","ts":"2022-11-03T10:00:32.510Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-11-03T10:00:30.609Z","time spent":"1.901204626s","remote":"127.0.0.1:36490","response type":"/etcdserverpb.KV/Range","request count":0,"request size":57,"response count":0,"response size":0,"request content":"key:\"/registry/services/specs/\" range_end:\"/registry/services/specs0\" limit:500 "}
2022-11-03T12:00:32.562628822+02:00 stderr F {"level":"warn","ts":"2022-11-03T10:00:32.562Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":7871874594921114681,"retry-timeout":"500ms"}
2022-11-03T12:02:57.399955246+02:00 stderr F {"level":"error","ts":"2022-11-03T10:02:57.399Z","caller":"embed/etcd.go:259","msg":"checkInitialHashKV failed","error":"44fe8000a6b18322 found data inconsistency with peers","stacktrace":"<http://go.etcd.io/etcd/server/v3/embed.StartEtcd|go.etcd.io/etcd/server/v3/embed.StartEtcd>\n\t/go/src/go.etcd.io/etcd/server/embed/etcd.go:259\<http://ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd|ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd>\n\t/go/src/go.etcd.io/etcd/server/etcdmain/etcd.go:228\<http://ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2|ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2>\n\t/go/src/go.etcd.io/etcd/server/etcdmain/etcd.go:123\<http://ngo.etcd.io/etcd/server/v3/etcdmain.Main|ngo.etcd.io/etcd/server/v3/etcdmain.Main>\n\t/go/src/go.etcd.io/etcd/server/etcdmain/main.go:40\nmain.main\n\<http://tgo.etcd.io/etcd/server/main.go:32|tgo.etcd.io/etcd/server/main.go:32>\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
2022-11-03T12:02:57.410100318+02:00 stderr F {"level":"info","ts":"2022-11-03T10:02:57.409Z","caller":"embed/etcd.go:368","msg":"closing etcd server","name":"zacptprsvmkm98-11693588","data-dir":"/var/lib/rancher/rke2/server/db/etcd","advertise-peer-urls":["<http://localhost:2380>"],"advertise-client-urls":["<https://172.16.39.170:2379>"]}
2022-11-03T12:02:57.411478016+02:00 stderr F {"level":"info","ts":"2022-11-03T10:02:57.410Z","caller":"embed/etcd.go:370","msg":"closed etcd server","name":"zacptprsvmkm98-11693588","data-dir":"/var/lib/rancher/rke2/server/db/etcd","advertise-peer-urls":["<http://localhost:2380>"],"advertise-client-urls":["<https://172.16.39.170:2379>"]}
2022-11-03T12:02:57.411522255+02:00 stderr F {"level":"fatal","ts":"2022-11-03T10:02:57.410Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"44fe8000a6b18322 found data inconsistency with peers","stacktrace":"<http://go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2|go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2>\n\t/go/src/go.etcd.io/etcd/server/etcdmain/etcd.go:204\<http://ngo.etcd.io/etcd/server/v3/etcdmain.Main|ngo.etcd.io/etcd/server/v3/etcdmain.Main>\n\t/go/src/go.etcd.io/etcd/server/etcdmain/main.go:40\nmain.main\n\<http://tgo.etcd.io/etcd/server/main.go:32|tgo.etcd.io/etcd/server/main.go:32>\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
202 Views