Hey o/ I was hoping someone could give me some clarity on this:
We keep losing contact with the rancher-system-agent on nodes randomly — sometimes it resolves itself, other times it gets stuck and we need to manually restart the agent. Sometimes we notice machine conditions getting stuck (like PlanApplied=False for months). I'm trying to figure out why and how to resolve this.
I think it's related to how we provision the VMs — we host them on Proxmox servers and then register them to Rancher (not importing existing clusters). We have clusters showing:
driver: imported
provider: rke2
The docs say "Additional Features for Registered RKE2 and K3s Clusters" include version management, but also say imported clusters are limited.
We're in this weird middle ground:
We can control the rancher-system-agent
But can't use Rancher for VM lifecycle management (obviously)
Our setup:
Rancher creates cluster via rancher2_cluster_v2
Nodes join via registration commands
Result shows driver: imported but behaves like documented "Registered RKE2 clusters"
Question: For stuck nodes on fully managed infrastructure (AWS/etc), how are they handled — are they destroyed and rebuilt automatically?
If so, that's what our setup is lacking and I guess that would resolve the issues we are seeing.
I feel we should move over to HarvesterOS, but the more senior engineer sees this as a big red flag and does not feel comfortable getting closer to the Rancher ecosystem and would prefer to rip Rancher out and use upstream K8s
Just looking for further insight into this which I can hopefully use to argue the case for keeping rancher and adding harvester :)