Anyone feeling generous enough to tell me what I am doing wr Rancher Users #harvester

Anyone feeling generous enough to tell me what I a...

brave-garden-49376

08/18/2025, 4:31 PM

Anyone feeling generous enough to tell me what I am doing wrong with IP Pools and external DHCP servers and Harvester Network Clustering and Config and VM Networks? I deployed 1.2.1 a while back and got this all working a with a vlan on the switches - but back then it might have been over the same port as management. These days with the additional NIC requirements I'm somehow fumbling in the dark. Will wait to hook someone before posting all the details in a thread reply. TIA!

bland-article-62755

08/18/2025, 4:46 PM

I think it's a requirement for the subnet/gateway to be accessible from the mgmt network. (ie you can only have pools in the mgmt vlan because it attaches to that network interface - mgmt-bo - on the hosts)

brave-garden-49376

08/18/2025, 5:06 PM

Interesting, but i'm not fully following the suggested requirement. Bare metal nodes. For the Network Cfg I selected all the p1s (p0 is mgmt) on the nodes - and have those on trunked vlan 102-200 switch ports. My external DHCP server is also on a trunked switch port - and has a vlan102 interface over which it is serving a /24 My external 'ingress' node (nginx) is also on a trunked switch port and has a vlan102 interface, and a vlan200 interface to act as defgw for those networks. VM network for 102 is using that vlan id and route is "Auto DHCP" as the external DHCP server at 192.168.201.1 has the ingress node ip 192.169.102.254 as the defgw for that network. (This is how DHCP is setup for mgmt and it is working with server and gateway being different nodes, just fine.) VM network for 200 is using an IP Pool, vlan id 200 192.168.200.0/22 gw 192.168.203.254 start ip 192.168.200.10 end ip 192.168.203.250. Only VM network 200 shows the "active" route connectivity - so does harvester not DHCP to get the route from external servers if mgmt isn't on this vlan? Perhaps this is what your quick reply was saying. VMs on either of the two VM nets fail to DHCP and IP and get to the defgw.

bland-article-62755

08/18/2025, 5:09 PM

I opened a ticket for attaching IPs to VM networks with support. let me dig up what they sent.

🙏 1

bland-article-62755

08/18/2025, 5:11 PM

Copy code

Assuming the Harvester nodes' management interfaces are attached to the VLAN 1 network. The user creates an additional cluster network called provider (this implies using a secondary network interface other than the management one on the nodes) and then creates three VM networks, net-1, net-100, and net-200, with VLAN 1, 100, and 200, respectively. The net-1 and net-100 networks are associated with the default mgmt cluster network. The remaining one, net-200, is associated with the provider cluster network. The user then creates three LB IP pools for the three VM networks, called pool-1, pool-100, and pool-200.

Case 1: The VM is attached to the net-1 network, and an LB is created from the pool-1 IP pool. This configuration is straightforward, and it works out of the box.

Case 2: The VM is attached to the net-1 network, and an LB is created from the pool-100 IP pool. This configuration doesn't work because the LB IP address is currently always bound to the mgmt-br interface. Since the management interface is attached to the VLAN 1 network, VLAN 100 traffic won't reach the mgmt-br interface.

Case 3: The VM is attached to the net-1 network, and an LB is created from the pool-200 IP pool. This configuration doesn't work for a similar reason as case 2.

Case 4: The VM is attached to the net-100 network, and the LB is created from the pool-1 IP pool. This configuration works as the LB IP address, which is bound to the mgmt-br interface, is in the same VLAN network as the node's management interface. Traffic will be DNAT'd and routed by the default gateway to the backend VM once it reaches the mgmt-br interface.

Case 5: The VM is attached to the net-100 network, and the LB is created from the pool-100 IP pool. This configuration doesn't work for a similar reason as case 2.

Case 6: The VM is attached to the net-100 network, and the LB is created from the pool-200 IP pool. This configuration doesn't work for a similar reason as case 2.

Case 7: The VM is attached to the net-200 network, and the LB is created from the pool-1 IP pool. This configuration works for a similar reason as case 4. The only difference is the DNAT'd and routed traffic goes through the provider-br bridge but not mgmt-br.

Case 8: The VM is attached to the net-200 network, and the LB is created from the pool-100 IP pool. This configuration doesn't work for a similar reason as case 2.

Case 9: The VM is attached to the net-200 network, and the LB is created from the pool-200 IP pool. This configuration doesn't work for a similar reason as case 2.

brave-garden-49376

08/18/2025, 6:38 PM

ok - i was def doing case 9 for the vlan 200 network Didn't think I'd need an LB for the 120 (external DHCP) network to get those plumbed - how does havester manage adding the vlans to the 'provider' interface when it isn't going to have an address for them at the OS layer? I am not sure I understand what it means to construct an lb from the mgmt-ip for the vlan 200 network. This feels really odd - mixing networks this way, esp as the Case 7 describes the DNAT'd and routed traffic going thru the separate physical port. I guess I'll poke at some stuff with this clue. Were some things in this breakdown of work/non-work to be fixed in future? Or is this it going forward?

bland-article-62755

08/18/2025, 6:40 PM

I have a feature request in with the ticket I created.

bland-article-62755

08/18/2025, 6:40 PM

Because attaching it to different interfaces would potentially "fix" any of the Cases.

bland-article-62755

08/18/2025, 6:40 PM

It has not yet been accepted.

bland-article-62755

08/18/2025, 6:40 PM

You chiming into Suse (not here) might help move that along though.

brave-garden-49376

08/18/2025, 6:46 PM

Ah. Super helpful write-up. Right, I haven't been on opensuse bugzilla in a while, but when i was last there I didn't see anything for rancher/harvester. Yah, nothing for rancher/harvester on either opensuse or suse bzs. Where is such hosted for rancher / harvester - github only?

brave-garden-49376

08/20/2025, 8:48 PM

Hey @bland-article-62755 -- I have one node in my 12 node harvester cluster that is properly plumbed for my external DHCP server - when VMs get deployed there the vlan102 interface gets an IP from my server.... so far it is the only node i have found that works properly and I know I did nothing to make it so. sigh.

brave-garden-49376

08/20/2025, 8:49 PM

i'm going to create scheduled VMs on all 12 and see what the hit/miss ratio is.

brave-garden-49376

08/20/2025, 9:29 PM

answer. exactly one. and surprisingly not the node with the vip. ou31r is the working one diff -r ou22c-compute/cnet1-bo.txt ou31r-compute/cnet1-bo.txt

Copy code

5,6c5,6
< Currently Active Slave: None
< MII Status: down
---
> Currently Active Slave: enp59s0f0np0
> MII Status: up
10a11,18
>
> Slave Interface: enp59s0f0np0
> MII Status: up
> Speed: 25000 Mbps
> Duplex: full
> Link Failure Count: 0
> Permanent HW addr: 7c:fe:90:cb:73:62
> Slave queue ID: 0

Somehow the Network Config is failing to bring up the bond on all but one node.

brave-garden-49376

08/21/2025, 12:30 PM

Harvester Network Cluster == bond is not working as expected :D In my deployment of 1.5.1 it is super inconsistent. The 1:1 seems to be failing, sometimes even the cleanup of the /proc/net/bonding -bo filel after a cluster has been deleted. Tagged the nodes where I added PCIe NICs post-harvester install with 'haspcienic : true' and use this to filter the nodes into the Network Cluster. I added the cards after install of Harvester and modified the /oem/90_custom.yaml to describe the new NIC ports and they show up fine in the UI. I add only a single port for the active/backup bond (this doesn't seem to be an issue for the mgmt cluster). Created and deleted the Network Clusters many times for these ports and only seen 1 of 11 nodes properly show an active slave once. Every other -bo file has 'Currently Active Slave: None' (see the thread for the bond file compare of non- to working, if interested). Anyone know how this gets policed by the network cluster manager? Have any suggestions other than dmesg and /proc/net/bonding to help me collect "status" or "desired state" before submitting a github issue?

brave-garden-49376

08/21/2025, 9:47 PM

for example

Copy code

ou26l-compute:
----------------
mgmt-bo

ou26r-compute:
----------------
cnet1-bo
cnet2-bo
mgmt-bo

ou31c-compute:
----------------
mgmt-bo

should I clean those cnet*-bo files by hand?

brave-garden-49376

08/22/2025, 1:35 PM

Reinstalled all the hardware - including only the devices with the PCIe NICs in the new cluster - using 1.6.0 rc6. I know. Too many changes to finger point. Everything works as expected with VLAN on port and external DHCP server. Hooray.

brash-petabyte-67855

08/22/2025, 6:18 PM

Hi @brave-garden-49376, Congratulations! I guess you ended up with DHCP/LB on the same interface port as mgmt network, right?

brash-petabyte-67855

08/22/2025, 6:21 PM

From what I see this port should be the fastest one if there are multiple ports on the node (and inter VM/cluster network traffic is not as busy as the traffic from LB), is this correct assumption?

bland-article-62755

08/22/2025, 6:50 PM

For default configs, yes. That may not be true if you reconfigure the defaults to have other backend traffic (like longhorn) use different interfaces.

👍 1

brave-garden-49376

08/22/2025, 7:03 PM

@brash-petabyte-67855 I have one network cluster for mgmt (by default) mapped to the physical port described in the yaml for the PXE install; and, another network cluster mapped by network config to a different physical port configured as type VLAN. Further, there's a vm network set to use vlan 102 and get the route via DHCP. This vlan is on the trunk switch ports and served DHCP by a completely external DHCP server. This wasn't working earlier with the 1.5.1 deployment. VMs configured to use the vlan 102 vm network were not getting DHCP addresses because the physical ports were not in the bonding configuration for the cluster. the 102 should be quiet and for VMs only (other than DHCP server and external nginx + default gateway box). as for speed, iperf3 showed 17-18Gibps for these 25Gib ports between two VMs. So it doesn't suck. 😄 but it is ~5Gibps slower than I typically get for OS on the physical nodes.

👍 1

3 Views

Open in Slack

Previous Next