Hi, I have a 3 node cluster with some mellanox con...
# harvester
t
Hi, I have a 3 node cluster with some mellanox connectx-3 cards (those are plug and play driver wise), I was looking to upgrade to cards with dual SFP+ ports but if I go the newer connectx cards route I have to do driver install with a script provided by Nvidia (maybe there is a simpler way) which is not ideal. Or maybe I should look out for other models, is there a "recommended" or list of cards that are known to work well with harvester ? Edit: maybe it's as simple as adding the mlx5_corr module in the harvester config
b
Hello. Partially sighted leading the partially sighted here, but I think you're right. You shouldn't have to install any drivers for "mellanox connectx" cards. I have some dual port things that load the typical linux modules: rancher@c2g-ou24r-compute:~>
lsmod | grep mlx
Copy code
mlx5_ib               401408  0
ib_uverbs             172032  2 rdma_ucm,mlx5_ib
ib_core               430080  10 rdma_cm,ib_ipoib,rpcrdma,iw_cm,ib_iser,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx5_core            1847296  1 mlx5_ib
mlxfw                  36864  1 mlx5_core
psample                20480  1 mlx5_core
pci_hyperv_intf        16384  1 mlx5_core
tls                   110592  5 cxgb4,bonding,mlx5_core
How you modify the /oem files to make this happen, and introduce the new NIC ports might be the real challenge. Looks like /oem/90_custom.yaml might be one place to start. Though mine only has these for modprobe lines:
Copy code
name: Harvester Configuration
stages:
    initramfs:
        - commands:
            - modprobe kvm
            - modprobe vhost_net
            - modprobe kvm
            - modprobe vhost_net
That's the file that has the ifcfg file and bonding config too. Finally, I can say that I installed with mlx single port NICs and then swapped those out for dual, and when I configured clusters (or whatever is supposed to map to the bonds) they frequently failed to incorporate the NIC ports into the bond configuration - so basically nothing worked. A fresh install with the dual-port NIC in place fixed all errant behavior for this issue. So, perhaps a rolling fresh install to be safe and least troublesome? I can't speak to how to do this with my limited experience, but it seems like it should be a possible path.
t
I have a qsfp connectx4 on another server I can swap it and try when I have time. Since I have three nodes it's the perfect opportunity to try the resiliency/maintenance features 😁
😂 1
🤞 1
a
You can use a cloudinit CR on your Harvester nodes?
t
I think I was initially asking myself the question because the recommended way to install mlnx_ofed was to get that iso from Nvidia, mount it, and run the script to install. But I haven't checked if I just need to enable mlx5_core or something like that.
a
I'm doing a similar thing for my nodes to prevent crashes due to faulty kernel modules for Intel i219-lm nics. You can use the cloudinit CR to script install or config of drivers.
t
I am not sure wym by CR but yeah running a script from cloud-init
a
t
Ahh right I see I see