So We ve been having rare but occasional hiccups with networ Rancher Users #harvester

So We've been having rare, but occasional hiccups ...

bland-article-62755

08/12/2025, 6:43 PM

So We've been having rare, but occasional hiccups with networking in our harvester clusters. Typically all our hosts have 2 SFPs (25G or 100G link speeds depending on the host/cluster) with 1 of them dedicated for the mgmt network and the other dedicated to trunk in all the VM networks/VLANs. When we had originally designed the clusters a few years ago, we didn't know (or you couldn't at the time) that you could peel off the VM networks from the mgmt interface instead of having to have a separate link for them. For stability/redundancy it seems better to have the two 100G links in a LAG for the mgmt link and use that lagged connection for the VM Networks. Our bandwith, even combined, seems to be well within the limits of a single connection, but the havoc that having a bad card or SFP failure seems to be much more detrimental. Here's my questions in relation to if anyone has opinions or guesses: • Any foreseeable issues with that design? • What's the best way to reconfig the networking for each node? ◦ Edit the /oem/90_custom.yaml and reboot? ◦ Write a new file and reboot? (Are there any validation options or tools that might generate this?) ◦ Reinstall? ▪︎ Can you leave the node in and just reinstall with the same IP/Name, or does it need to be removed first? ◦ Live patch and edit/update the yaml? • Anything else come to mind, or general advice?

better-garage-30620

08/13/2025, 7:52 PM

i will warn you that your vm network config can only have one configuration. i just built new nodes, and i had to add all of them, shut everything down, and change the config to swing everything over to the new nodes.

better-garage-30620

08/13/2025, 7:53 PM

i'm hoping that the kube-ovn stuff coming in 1.6.0 will make it easier to reconfigure networking. 🤷

bland-article-62755

08/13/2025, 7:57 PM

yeah I figure that'd likely have to be our plan as well. Or at least 1/2 the nodes and restart the VMs to flip to the new network.

better-garage-30620

08/13/2025, 8:00 PM

also just fyi, it appears that live migrations still happen over the management network unless you patch the kubevirt resources to use another network.

bland-article-62755

08/13/2025, 8:03 PM

I think that should be fine. We're essentially going to be reducing the Cluster Network Config down to the mgmt link, but not the VM Network mgmt ... if that makes sense. (confusing because it's named the same in both places.)

better-garage-30620

08/13/2025, 8:04 PM

it matters to me because my nodes now have 1gig for the management network and 10gig for the VM & storage. i just wanted to mention it. 🙂

bland-article-62755

08/13/2025, 8:04 PM

I think storage happens over that by default as well right?

better-garage-30620

08/13/2025, 8:05 PM

over the management network, yeah. i flipped it to use one of the VLANs on my VM network.

bland-article-62755

08/13/2025, 8:05 PM

Yeah basically we're going from 100G for each, to two combined 100G.

better-garage-30620

08/13/2025, 8:07 PM

nice. i don't get to have nice hardware like that anymore. 😉

bland-article-62755

08/13/2025, 8:07 PM

Well it's only nice if it works....

💯 1

bland-article-62755

08/13/2025, 8:07 PM

Currently it's not working all the time, hence the lagging to try to help shore it up.

bland-article-62755

08/13/2025, 8:08 PM

Plus we hit a really awful bug with the broadcom nics.

better-garage-30620

08/13/2025, 8:09 PM

at $previous_job we had Cisco UCS with lots of uplinks. when we initially put everything in, we had a batch of bad cables. that was nearly impossible to diagnose. i hope for your sake you don't have a problem like that. 🙂

bland-article-62755

08/13/2025, 8:10 PM

It might be. Or the SFPs...

bland-article-62755

08/13/2025, 8:10 PM

But for sure https://bugzilla.opensuse.org/show_bug.cgi?id=1241662

bland-article-62755

08/13/2025, 8:10 PM

tldr; virtio guests got reduced down to dialup/old dsl speeds.

better-garage-30620

08/13/2025, 8:11 PM

ouch

bland-article-62755

08/13/2025, 8:11 PM

But only the guests, and only for the particular model we had purchased. It got announced on the kernel list, but no one tagged broadcom.

brainy-kilobyte-33711

08/14/2025, 7:03 AM

We are looking to do similar quite soon. Currently each node has 5 nics, 1 mgmt, 2 compute, 2 storage and VM migrations use the storage network. For better resiliency we are going to use 2 compute NICs for mgmt cluster network and create a new compute virtual machine network from that. Our rough plan is 1. Label current nodes with oldcompute: true 2. Update current compute networking configuration to only run on nodes with that label. 3. We use PXE boot to install harvester so we will update the management NIC config in the configuration files for that. 4. Reinstall each node with the same IP but different name (we use node-index-install_datetime as a format) 5. Create a new compute virtual machine network from the mgmt cluster network 6. Move VMs across one by one until node is full and then repeat with other nodes

brainy-kilobyte-33711

08/14/2025, 7:11 AM

We will lose the ability to only run the new compute network on a subset of nodes but don't foresee that being an issue

bland-article-62755

08/14/2025, 7:14 AM

When you do 4 - since it has the same IP does it replace the old node, or do you remove it before you do the install?

brainy-kilobyte-33711

08/14/2025, 7:16 AM

Yep we remove it - I'll add step 3.5 - drain the node and then delete it

brainy-kilobyte-33711

08/14/2025, 7:16 AM

For other reasons we have done this numerous times before and never hit an issue reusing the same IP with a different name

brainy-kilobyte-33711

08/14/2025, 7:17 AM

If you are also PXE booting make sure that your node-0 pxe boot config gets changed from CREATE to JOIN! Otherwise you will have a bad time and get two clusters

bland-article-62755

08/14/2025, 7:25 AM

Ah, we're doing PXE but just running through the normal TUI installer.

brainy-kilobyte-33711

08/14/2025, 7:27 AM

I think it should be possible to do what you said

Edit the /oem/90_custom.yaml

but we didn't have confidence something wouldn't come along at some point and overwrite the file with what we specified at install time

brainy-kilobyte-33711

08/14/2025, 7:27 AM

So opting for the full reinstall of the node

bland-article-62755

08/14/2025, 7:28 AM

Yeah, the switches our boxes are connected to need a full FW update and reboot so we're looking at a little bit of a down time anyways.

brainy-kilobyte-33711

08/14/2025, 7:29 AM

For our use case the VMs are part of HA deployments so hopefully can avoid any downtime if we are in control of the when things get shutdown and moved (unlike the upgrade process...)

Open in Slack

Previous Next