Hey o/ I was hoping someone could give me some cla...
# general
d
Hey o/ I was hoping someone could give me some clarity on this: We keep losing contact with the rancher-system-agent on nodes randomly — sometimes it resolves itself, other times it gets stuck and we need to manually restart the agent. Sometimes we notice machine conditions getting stuck (like PlanApplied=False for months). I'm trying to figure out why and how to resolve this. I think it's related to how we provision the VMs — we host them on Proxmox servers and then register them to Rancher (not importing existing clusters). We have clusters showing: driver: imported provider: rke2 The docs say "Additional Features for Registered RKE2 and K3s Clusters" include version management, but also say imported clusters are limited. We're in this weird middle ground: We can control the rancher-system-agent But can't use Rancher for VM lifecycle management (obviously) Our setup: Rancher creates cluster via rancher2_cluster_v2 Nodes join via registration commands Result shows driver: imported but behaves like documented "Registered RKE2 clusters" Question: For stuck nodes on fully managed infrastructure (AWS/etc), how are they handled — are they destroyed and rebuilt automatically? If so, that's what our setup is lacking and I guess that would resolve the issues we are seeing. I feel we should move over to HarvesterOS, but the more senior engineer sees this as a big red flag and does not feel comfortable getting closer to the Rancher ecosystem and would prefer to rip Rancher out and use upstream K8s Just looking for further insight into this which I can hopefully use to argue the case for keeping rancher and adding harvester :)
b
I got this all the time!
For us, it was dhcp. Kubernetes expects the IPs to stay static for nodes. We had to go in and reserve the IPs for the VMs as what they originally got registered as, or we had to stonith and let them deploy new ones.
d
All our k8 nodes are static, DHCP is only used for a select few situations but everything in prod is given a static IP. But interesting that you got this behavior also!