Hello, after upgrading the harvester cluster from ...
# harvester
m
Hello, after upgrading the harvester cluster from 1.4.0 to 1.4.1, the nvidia-driver-runtime pods had to pull again the driver image from the endpoint and only for two nodes out of four, the error shown below appears. There are no vGPUs/SR-IOV GPUs or PCI Devices enabled from the GPUs on those nodes. Is there a known solution for this error? Or is there a way to reset the kernel on the harvester host? I tried reapplying the nvidia-runtime and pcidevice-controller multiple times. Thank you!
Copy code
ERROR: Unable to load the kernel module 'nvidia-vgpu-vfio.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
Kernel module compilation complete.
Kernel module load error: No such file or directory
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at <http://www.nvidia.com|www.nvidia.com>.
Kernel messages:
[172690.894725] nvidia_vgpu_vfio: Unknown symbol mdev_register_driver (err -2)
[172690.894764] nvidia_vgpu_vfio: Unknown symbol mtype_get_type_group_id (err -2)
[172691.119872] nvidia-nvlink: Unregistered Nvlink Core, major device number 234
[173024.822566] nvidia: externally supported module, setting X kernel taint flag.
[173024.823988] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[173024.823997] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.127.06  Release Build  (dvs-builder@U16-I1-N07-18-8)  Wed Oct  9 12:29:53 UTC 2024
[173024.842728] nvidia_vgpu_vfio: Unknown symbol mtype_get_parent_dev (err -2)
[173024.842829] nvidia_vgpu_vfio: Unknown symbol mdev_parent_dev (err -2)
[173024.842861] nvidia_vgpu_vfio: Unknown symbol mdev_unregister_driver (err -2)
[173024.842996] nvidia_vgpu_vfio: Unknown symbol mdev_register_device (err -2)
[173024.843032] nvidia_vgpu_vfio: Unknown symbol mdev_unregister_device (err -2)
[173024.843062] nvidia_vgpu_vfio: Unknown symbol mdev_register_driver (err -2)
[173024.843089] nvidia_vgpu_vfio: Unknown symbol mtype_get_type_group_id (err -2)
[173025.065669] nvidia-nvlink: Unregistered Nvlink Core, major device number 234
r
I have also issues with that. Scheduling fails as pcidevices? is not registering devices?... Insufficient nvidia.com/NVIDIA_RTXA5000-24Q
n
cc:@great-bear-19718
m
Solved it using the following on the affected hosts: :~ # lsmod | grep mdev (--> should be activated) :~ # sudo modprobe mdev :~ # lsmod | grep mdev mdev 20480 0
What do you think caused mdev not to be loaded?
n
@most-airport-92897 is this mdev module related to nvidia-driver-toolkit add-on ? There is a similar issue for PCIe module getting unloaded after few hours when not being used. https://github.com/harvester/harvester/issues/7815 The above fixes it by loading it again whenever PCI passthrough is enabled again. cc: @clean-cpu-90380 @prehistoric-balloon-31801