This message was deleted.
# k3s
a
This message was deleted.
c
sounds like the etcd datastore files got corrupted on disk
did you have an unexpected power outage or something?
f
yes, exactly what happened
now this master wont start and the others are complaining
is there a way to revert to older db?
c
you can restore from a snapshot, yes. if you took snapshots.
do you have more than one server?
f
it seems to automatically do so. root@k8s1:/var/lib/rancher/k3s/server/db/snapshots# ll total 94328 drwx------ 2 root root 4096 Jan 22 12:00 ./ drwx------ 5 root root 4096 Jan 22 20:02 ../ -rw------- 1 root root 19312672 Jan 20 12:00 etcd-snapshot-k8s1-1705752000 -rw------- 1 root root 19312672 Jan 21 00:00 etcd-snapshot-k8s1-1705795209 -rw------- 1 root root 19312672 Jan 21 12:00 etcd-snapshot-k8s1-1705838401 -rw------- 1 root root 19312672 Jan 22 00:00 etcd-snapshot-k8s1-1705881600 -rw------- 1 root root 19312672 Jan 22 12:00 etcd-snapshot-k8s1-1705924801
c
if so, you could also just do a cluster-reset on one of the working servers, and then rejoin the one that has bad datasotre files.
f
yes 2 other masters
c
if the other 2 are both fine, then just delete this node from the Kubernetes cluster, delete the db files from disk, and then rejoin it.
f
they dont start though unfortunately., complaining about the missing host
c
however, you mentioned that you can’t use kubectl on the other two nodes either, which makes me suspect that they are not in fact fine?
remove the server value from their config, and see if they come up ok.
what is the specific error
f
in systemd?
sec
Copy code
Jan 22 20:05:26 k8s6i k3s[29875]: {"level":"warn","ts":"2024-01-22T20:05:26.935Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"8c851274d41473a4","rtt":"0s","error":"dial tcp 192.168.1.151:2380: connect: connection refused"}
.151 is my primary master
this is .156
c
that’s not a fatal error
that’s just telling you that one of the etcd cluster members is not available
it will keep trying to reconnect as long as its a cluster member
f
Copy code
Jan 22 20:05:36 k8s6i k3s[29875]: {"level":"warn","ts":"2024-01-22T20:05:36.077Z","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc0008c8540/127.0.0.1:2379>","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""}
Jan 22 20:05:36 k8s6i k3s[29875]: {"level":"info","ts":"2024-01-22T20:05:36.077Z","logger":"etcd-client","caller":"v3@v3.5.7-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
c
it’s hard to work against just snippets of logs
f
ok let me grab something longer
c
you’re likely omitting important other stuff that provides context on why it’s not starting up
you’re sure you had 3 working etcd nodes before this?
f
AFAIK yes, this cluster is 6 nodes, 3 masters
c
it kinda sounds like maybe you were skating along with 2/3 before this, and now you don’t even have 2 to get quorum.
f
last 1000 lines of the .156 master
Last 1000 lines of .151 master (the one with the db issue)
c
what about the other node
you need 2 of the 3 nodes online at the same time for the etcd cluster to have quorum
Are the two working ones are just taking turns crashing due to lack of quorum, and never trying to come up at the same time?
🙌 1
stop k3s on all three servers, and try to start the two working ones simultaneously
f
ok that one seems to be coming up
yep, that one is in ready status in kubectl
let me try that
yessss. those two are up now
Copy code
NAME    STATUS                        ROLES                       AGE      VERSION
k8s1    NotReady                      control-plane,etcd,master   3y17d    v1.27.2+k3s1
k8s2    NotReady,SchedulingDisabled   <none>                      2y1d     v1.27.2+k3s1
k8s3    Ready                         <none>                      2y344d   v1.27.2+k3s1
k8s4i   Ready                         control-plane,etcd,master   2y       v1.27.2+k3s1
k8s5i   NotReady                      <none>                      616d     v1.27.2+k3s1
k8s6i   Ready                         control-plane,etcd,master   468d     v1.27.2+k3s1
c
there you go
now just
kubectl delete node
the broken server, delete the etcd files from disk, and then rejoin it to the cluster as if it was coming in new.
f
should I run the k3s uninstall script?
this is amazing help thank you
is this the correct way to make sure traefik and servicelb are disabled?
Copy code
curl -sfL <https://get.k3s.io> | INSTALL_K3S_VERSION=v1.23.10+k3s1 INSTALL_K3S_CHANNEL=latest K3S_TOKEN="12345" sh -s - server --server <https://192.168.1.151:6443> --disable traefik --disable servicelb
c
no you don’t need to uninstall/reinstall. just stop the service, delete the node from the cluster on one of the working nodes, delete the files from disk on that node, then restart it with
--server
pointing at one of the working nodes.
yes that’d work
but just editing the systemd unit file and doing a daemon-reload+restart is probably less fragile
f
ok cool ! thanks again! Love k3s by the way
🙌 1
145 Views