This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

01/22/2024, 7:10 PM

This message was deleted.

creamy-pencil-82913

01/22/2024, 7:42 PM

sounds like the etcd datastore files got corrupted on disk

creamy-pencil-82913

01/22/2024, 7:42 PM

did you have an unexpected power outage or something?

famous-flag-15098

01/22/2024, 7:52 PM

yes, exactly what happened

famous-flag-15098

01/22/2024, 7:52 PM

now this master wont start and the others are complaining

famous-flag-15098

01/22/2024, 7:52 PM

is there a way to revert to older db?

creamy-pencil-82913

01/22/2024, 8:02 PM

you can restore from a snapshot, yes. if you took snapshots.

creamy-pencil-82913

01/22/2024, 8:02 PM

do you have more than one server?

famous-flag-15098

01/22/2024, 8:03 PM

it seems to automatically do so. root@k8s1:/var/lib/rancher/k3s/server/db/snapshots# ll total 94328 drwx------ 2 root root 4096 Jan 22 12:00 ./ drwx------ 5 root root 4096 Jan 22 20:02 ../ -rw------- 1 root root 19312672 Jan 20 12:00 etcd-snapshot-k8s1-1705752000 -rw------- 1 root root 19312672 Jan 21 00:00 etcd-snapshot-k8s1-1705795209 -rw------- 1 root root 19312672 Jan 21 12:00 etcd-snapshot-k8s1-1705838401 -rw------- 1 root root 19312672 Jan 22 00:00 etcd-snapshot-k8s1-1705881600 -rw------- 1 root root 19312672 Jan 22 12:00 etcd-snapshot-k8s1-1705924801

creamy-pencil-82913

01/22/2024, 8:03 PM

if so, you could also just do a cluster-reset on one of the working servers, and then rejoin the one that has bad datasotre files.

famous-flag-15098

01/22/2024, 8:03 PM

yes 2 other masters

creamy-pencil-82913

01/22/2024, 8:04 PM

if the other 2 are both fine, then just delete this node from the Kubernetes cluster, delete the db files from disk, and then rejoin it.

famous-flag-15098

01/22/2024, 8:04 PM

they dont start though unfortunately., complaining about the missing host

creamy-pencil-82913

01/22/2024, 8:04 PM

however, you mentioned that you can’t use kubectl on the other two nodes either, which makes me suspect that they are not in fact fine?

creamy-pencil-82913

01/22/2024, 8:04 PM

remove the server value from their config, and see if they come up ok.

creamy-pencil-82913

01/22/2024, 8:05 PM

what is the specific error

famous-flag-15098

01/22/2024, 8:05 PM

in systemd?

famous-flag-15098

01/22/2024, 8:05 PM

sec

famous-flag-15098

01/22/2024, 8:05 PM

Copy code

Jan 22 20:05:26 k8s6i k3s[29875]: {"level":"warn","ts":"2024-01-22T20:05:26.935Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"8c851274d41473a4","rtt":"0s","error":"dial tcp 192.168.1.151:2380: connect: connection refused"}

famous-flag-15098

01/22/2024, 8:06 PM

.151 is my primary master

famous-flag-15098

01/22/2024, 8:06 PM

this is .156

creamy-pencil-82913

01/22/2024, 8:06 PM

that’s not a fatal error

creamy-pencil-82913

01/22/2024, 8:07 PM

that’s just telling you that one of the etcd cluster members is not available

creamy-pencil-82913

01/22/2024, 8:07 PM

it will keep trying to reconnect as long as its a cluster member

famous-flag-15098

01/22/2024, 8:07 PM

Copy code

Jan 22 20:05:36 k8s6i k3s[29875]: {"level":"warn","ts":"2024-01-22T20:05:36.077Z","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0008c8540/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""}
Jan 22 20:05:36 k8s6i k3s[29875]: {"level":"info","ts":"2024-01-22T20:05:36.077Z","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}

creamy-pencil-82913

01/22/2024, 8:07 PM

it’s hard to work against just snippets of logs

famous-flag-15098

01/22/2024, 8:07 PM

ok let me grab something longer

creamy-pencil-82913

01/22/2024, 8:08 PM

you’re likely omitting important other stuff that provides context on why it’s not starting up

creamy-pencil-82913

01/22/2024, 8:08 PM

you’re sure you had 3 working etcd nodes before this?

famous-flag-15098

01/22/2024, 8:08 PM

AFAIK yes, this cluster is 6 nodes, 3 masters

creamy-pencil-82913

01/22/2024, 8:08 PM

it kinda sounds like maybe you were skating along with 2/3 before this, and now you don’t even have 2 to get quorum.

famous-flag-15098

01/22/2024, 8:15 PM

last 1000 lines of the .156 master

famous-flag-15098

01/22/2024, 8:17 PM

Last 1000 lines of .151 master (the one with the db issue)

2.log

creamy-pencil-82913

01/22/2024, 8:18 PM

what about the other node

creamy-pencil-82913

01/22/2024, 8:18 PM

you need 2 of the 3 nodes online at the same time for the etcd cluster to have quorum

creamy-pencil-82913

01/22/2024, 8:19 PM

Are the two working ones are just taking turns crashing due to lack of quorum, and never trying to come up at the same time?

🙌 1

creamy-pencil-82913

01/22/2024, 8:19 PM

stop k3s on all three servers, and try to start the two working ones simultaneously

famous-flag-15098

01/22/2024, 8:22 PM

ok that one seems to be coming up

famous-flag-15098

01/22/2024, 8:22 PM

yep, that one is in ready status in kubectl

famous-flag-15098

01/22/2024, 8:23 PM

let me try that

famous-flag-15098

01/22/2024, 8:24 PM

yessss. those two are up now

famous-flag-15098

01/22/2024, 8:24 PM

Copy code

NAME    STATUS                        ROLES                       AGE      VERSION
k8s1    NotReady                      control-plane,etcd,master   3y17d    v1.27.2+k3s1
k8s2    NotReady,SchedulingDisabled   <none>                      2y1d     v1.27.2+k3s1
k8s3    Ready                         <none>                      2y344d   v1.27.2+k3s1
k8s4i   Ready                         control-plane,etcd,master   2y       v1.27.2+k3s1
k8s5i   NotReady                      <none>                      616d     v1.27.2+k3s1
k8s6i   Ready                         control-plane,etcd,master   468d     v1.27.2+k3s1

creamy-pencil-82913

01/22/2024, 8:30 PM

there you go

creamy-pencil-82913

01/22/2024, 8:30 PM

now just

kubectl delete node

the broken server, delete the etcd files from disk, and then rejoin it to the cluster as if it was coming in new.

famous-flag-15098

01/22/2024, 8:32 PM

should I run the k3s uninstall script?

famous-flag-15098

01/22/2024, 8:32 PM

this is amazing help thank you

famous-flag-15098

01/22/2024, 8:42 PM

is this the correct way to make sure traefik and servicelb are disabled?

Copy code

curl -sfL <https://get.k3s.io> | INSTALL_K3S_VERSION=v1.23.10+k3s1 INSTALL_K3S_CHANNEL=latest K3S_TOKEN="12345" sh -s - server --server <https://192.168.1.151:6443> --disable traefik --disable servicelb

creamy-pencil-82913

01/22/2024, 8:44 PM

no you don’t need to uninstall/reinstall. just stop the service, delete the node from the cluster on one of the working nodes, delete the files from disk on that node, then restart it with

--server

pointing at one of the working nodes.

creamy-pencil-82913

01/22/2024, 8:44 PM

yes that’d work

creamy-pencil-82913

01/22/2024, 8:44 PM

but just editing the systemd unit file and doing a daemon-reload+restart is probably less fragile

famous-flag-15098

01/22/2024, 8:45 PM

ok cool ! thanks again! Love k3s by the way

🙌 1

167 Views

Open in Slack

Previous Next