This message was deleted.
# k3s
a
This message was deleted.
c
I highly doubt it, what makes you suspect it is at all related?
Do you have journal logs to share from the nodes that are creating/deleting the snapshots?
I'm also confused when you say the apiserver container is crashing, k3s doesn't run the apiserver in a container...
b
I have all the logs in Loki so can get you whatever you need. I have k3s control plane agent-less container running as a pod in AKS - we sort of briefly talked about it long time ago when I had issues with the reverse connection to the nodes.
The etcd snapshots would be on the PVC where I store the database
Found this in the log
time="2024-03-23T03:05:40Z" level=error msg="Failed to record snapshots for cluster: nodes \"k3st-control-plane-0\" not found"
this is correct as this is an agentless instance - no node is registered for it
Other than that, no other errors
c
oh yeah. we don’t technically support embedded etcd with --disable-agent. It will get really confused when there is no node for where etcd is running.
if you want to open an issue on GH I can take a look, but I am not at all surprised that it freaks out
b
I can imagine, that's why I only run a single control plane pod. Either way, something must have changed as it worked fine before and I've been doing this for 2+ years. I can roll back to 1.27.7 to double check. And sure, can open an issue.
c
Yes, we completely redid how snapshots are recorded in the cluster. There is now a CRD type that records what node took the snapshot, alongside other metadata
When started with the
--disable-agent
flag, servers do not run the kubelet, container runtime, or CNI. They do not register a Node resource in the cluster, and will not appear in
kubectl get nodes
output. Because they do not host a kubelet, they cannot run pods or be managed by operators that rely on enumerating cluster nodes, including the embedded etcd controller and the system upgrade controller.
why are you running a single node with embedded etcd instead of kine?
b
Well, at first I had 3 control plane nodes at home so I started off with etcd and just didn't want to start again after I moved this cluster's control plane to AKS. it was too unreliable replicating etcd through a VPN between two locations and having a single control plane instance in AKS is far more reliable.
So at the time I just moved the etcd database and performed a restore there. Worked fine since.
With a little init container workaround that fixes the etcd node IP as those change everytime the pod starts.
Would disabling snapshots resolve it?
c
most likely. You could also try creating a dummy node object that matches the container hostname, and set the various labels and annotations that the controller expects.
Please do open an issue though, I think we could probably improve the handling of this for the April releases, so that you can keep using snapshots but without the extra bits of functionality that expect a node resource to exist for the host that’s running etcd
this is actually super easy to repo, I made one for you: https://github.com/k3s-io/k3s/issues/9774
b
Thank you Brandon. I went to bed as it was 4 am and woke up to you having done all the repro work. Luckily it was an easy repro - in the meantime I think I'll step back to a release that doesn't have this issue. Could you let me know which release of 1.27 introduced the new behavior? Looks like 1.27.8+k3s2?
c
first 1.27 release with the new snapshot management in it was v1.27.7+k3s1
b
Thanks for doing the fix so quick and yeah, that's strange - I rolled back to 1.27.7+k3s2 and I don't have this issue any more. Snapshots work fine. Let me know if there's a dev build you want me to try.
c
What you’re running into might have been specific to some of the enhancements we added in one of the subsequent patch releases. There was some tinkering with it after the fact.
👍 1