Hi. Question around Rancher upgrades. I just did a Rancher upgrade from v2.7.9 yo v2.8.2. Upgrade for Rancher itself went really smoothly. I have a number of downstream clusters, a mixture of RKE2 (Rancher created) and Harvester (imported, managed or whatever the right term is...).
One of our Harvester clusters started going haywire. I have about 100 dev VMs on this particular cluster, and the VMs were constantly being paused, restarted etc. and thousands of events about Longhorn and failure to mount volumes.
Eventually it calmed down, and stop /start of the VMs brought them back.
After some investigation, couldn't find any useful logs, but I think what appears to have happened is while Rancher was upgrading, every downstream cluster including Harvester clusters was triggered to do a snapshot of every volume at the same time. For our downstream k8s clusters on Longhorn 1.5.1 and our smaller Harvester clusters with Longhorn 1.4.3 I saw a massive network spike during the snapshotting, but the clusters handled it and recovered fine. Our cluster with all the VMs however seemed to overwhelm our 10G network and then Longhorn got in a bit of a state.
I guess my question is, I can't find any documentation about the downstream Longhorn snapshot thing. Is it a feature of the upgrades? Can it be turned off or toggled? Should we be disconnecting Harvester from Rancher during upgrades? Or have I mistaken correlation with causation and it was just a big coincidence.