This message was deleted.
# k3s
a
This message was deleted.
c
How much notice do you have that power is going? We have a reboot function in our cluster, and we basically just execute the reboot command (which shuts down services cleanly - we think!)
(these are proper servers running ubuntu though, not an embedded device)
q
I am not certain on that yet. Its still under design, but I am guessing very little, like seconds…
I guess one final fsync to the database would probably keep us safe? I think its sqlite? I wonder if there is a command I can give it.
n
If your running K3s as a service, you can always just stop the service.
There isn't a tool for backing up the sqlite db, your only option is to make a copy of
/var/lib/rancher/k3s/server
. You would need to switch to the embedded etcd db for that functionality. See https://docs.k3s.io/backup-restore#backup-and-restore-with-embedded-etcd-datastore-experimental
q
Do you see much a performance difference when you use the etcd version @nutritious-tomato-14686? We are right at the very very edge on CPU performance. And do you find one to be less corruptible than the other? We actually dont care about the ‘state’ of any of our deployments, as long as they come back up on next restart.
So I guess my only concern is, what can I harm by shutting k3s down ‘hard’, without notice, like… ten thousand times 🙂. Reading up on sqlite makes me feel more comfortable, its recovery mechanism sounds fine? Do you think I should switch to etcd? k3s and (I am guessing) containerD, just seem to take a long time to spin down and I think often (if not always) will still in the process before we loss power. However, k3s comes back up fast and happy on restart…. I would like some way to verify it will stay this way? So I am just wondering if I need some safety measure to ensure k3s stays happy? Or what other people do on a tiny single node production install. Maybe I need education on forcing pods /containers to shut down faster (if its our fault in design?).
n
Performance?) The etcd version will use more CPU at scale compared to sqlite. Corruption?) I have not heard or seen a major difference between the two when it comes to corruption. If your fine with sqlite and that recovery method, then you should be good to go. Killing K3s 10K times?) Yes this likely will cause issues over time. If you have an active cluster, with pods writing to DB and scheduling things, and traffic coming across, killing K3s hard (via either a
k3s-killall.sh
or a power loss) will cause corruption in your DB. You might get lucky and those lost writes are not needed, but do it a few thousand times and I guarantee you'll run into issues. I'm not sure what others are doing to "ensure k3s stays happy", there isn't a way AFAIK to verify the integrity of the K3s DB. At the end of the day, this is why High Availability K8s is so valuable, its very resilient when you have 10s of clusters made up of 3 nodes vs 100s of single node clusters.
c
We have certainly had sqlite database corruption in places where folks just yanked the power cords out of the backs of machines 😬
(We switched to etcd and implemented a more graceful shutdown!)
q
Thank you both for the replies! So our setup is very basic and only a single node with 4 tiny deployments that all work together. Their state never changes, I am thinking maybe I can nuke the whole setup if anything goes wrong and just re-deploy on startup? Is there a database health check I can run on k3s? Like at startup? And if it doesn’t show healthy can I just copy the old original database back in place and re-deploy?
Oh, maybe these guys? https://kubernetes.io/docs/reference/using-api/health-checks/ Like if I CURL that in a cron task and if it doesn’t report happy, maybe I nuke? Just slack thinking 😉, if you guys have ideas. This is automotive, so power sudden death is going to happen at times.
Hmmm.
Copy code
kubectl get --raw='/livez/etcd'
ok
That is working. But I am on sqlite. @nutritious-tomato-14686 any chance that is checking sqlite and I could use that as my healthy check? If it if ever fails, I just nuke (like I said above) and re-deploy?
n
Hey @quiet-memory-19288 I don't think that will helps with corruption on the DB. It just reports whether the DB is alive and reachable. AFAIK we don't have any translation support in kine for a
PRAGMA quick_check
q
Any chance there is a k3s document on like IoT deployments? Where we always come up in a stable state? Like lets say I install k3s and my deployments now 5+ years old. I dont care about anything that happens (ever) as long as every time it restarts we go back to that one known state? Maybe we upgrade every other year? But besides that, I just want a black box setup where the end users never knows it’s there or has to ‘do’ anything?
Is there a read only DB mode? haha, probably not.