https://rancher.com/ logo
Title
b

big-judge-33880

02/22/2023, 11:49 AM
We’re seeing our cluster network bond (702.3ad) for VMs going down on nodes every 10th hour, e.g.
[Sat Feb 18 17:47:24 2023] device vm-bo left promiscuous mode
[Sun Feb 19 03:47:24 2023] device vm-bo left promiscuous mode
[Sun Feb 19 13:47:24 2023] device vm-bo left promiscuous mode
[Sun Feb 19 23:47:25 2023] device vm-bo left promiscuous mode
[Mon Feb 20 09:47:25 2023] device vm-bo left promiscuous mode
[Mon Feb 20 19:47:25 2023] device vm-bo left promiscuous mode
[Tue Feb 21 05:47:25 2023] device vm-bo left promiscuous mode
[Tue Feb 21 15:47:25 2023] device vm-bo left promiscuous mode
[Wed Feb 22 01:47:25 2023] device vm-bo left promiscuous mode
[Wed Feb 22 11:47:25 2023] device vm-bo left promiscuous mode
Is there anywhere within Harvester that’s useful for figuring out why this happens? The links usually go down only about a second and according to the switches we connect to, the physical links never go down. Due to the low downtime I’m not too concerned, but since we’re planning to move storage to this network I’d like to figure out the cause
h

happy-cat-90847

02/23/2023, 4:17 AM
Unfortunately, I don’t have a switch to check an 802.3ad config. But I don’t see this with other bond modes. Have you tried another bond mode to see if the issues also happens? Are you certain the switches are running the latest firmware? And what does the vendor of the switches say about this?
b

big-judge-33880

02/23/2023, 3:54 PM
Will get the Juniper switches updated to the latest - I’m required to use 802.3ad for this but will try to set up another system using another bonding mode after verifying whether it also suffers from the same issue 👍
@happy-cat-90847 is there a mode you’d recommend for testing? Switch SW is fully updated now, but bonds managed by the cluster controller still go down every 10 hours
h

happy-cat-90847

03/29/2023, 12:12 PM
You can use alb - hopefully it will show you if the same issue continues.
b

big-judge-33880

03/31/2023, 1:26 PM
Switched to alb, but problem remains, so filed https://github.com/harvester/harvester/issues/3744 about it 👍
(that is, set up a completely fresh cluster first with 802.3ad, then switched)
h

happy-cat-90847

03/31/2023, 1:27 PM
I’d argue this might be a problem with the switch or the NICs + switch. We don’t see this in our hardware. Have you tried installing SLES or another Linux with a similar kernel version on that hardware?
I’ll poke around and see if anyone can comment.
b

big-judge-33880

03/31/2023, 1:29 PM
Yeah, that’s a possibility, though we see this over two types of switches and two types of nics - the management bonds on the same machine are unaffected, but they’re also handled differently (by wickedd)
I guess a good test in that case is running management on the bonds we have issues with
Seems we’re seeing the same behaviour regardless of switch/nic as long as it’s a bond managed by the network-controller pods in harvester
Updated the issue, hopefully someone will be able to help us experience that eureka moment 🙂