adamant-kite-43734
07/22/2024, 8:46 PMgreat-bear-19718
07/22/2024, 11:18 PMgreat-bear-19718
07/22/2024, 11:19 PMnvidia-smi
outputnumerous-angle-77908
07/22/2024, 11:21 PMgreat-bear-19718
07/22/2024, 11:21 PMnumerous-angle-77908
07/22/2024, 11:22 PMgreat-bear-19718
07/22/2024, 11:22 PMgreat-bear-19718
07/22/2024, 11:22 PMnumerous-angle-77908
07/22/2024, 11:22 PMgreat-bear-19718
07/22/2024, 11:23 PMnumerous-angle-77908
07/22/2024, 11:23 PMnumerous-angle-77908
07/22/2024, 11:23 PMgreat-bear-19718
07/22/2024, 11:23 PMnumerous-angle-77908
07/22/2024, 11:23 PMgreat-bear-19718
07/22/2024, 11:24 PMgreat-bear-19718
07/22/2024, 11:24 PMnumerous-angle-77908
07/22/2024, 11:24 PMgreat-bear-19718
07/22/2024, 11:24 PMnumerous-angle-77908
07/22/2024, 11:25 PMgreat-bear-19718
07/22/2024, 11:25 PMnumerous-angle-77908
07/22/2024, 11:26 PMnumerous-angle-77908
07/22/2024, 11:29 PMgreat-bear-19718
07/22/2024, 11:30 PMnumerous-angle-77908
07/22/2024, 11:30 PMnumerous-angle-77908
07/22/2024, 11:31 PMnumerous-angle-77908
07/22/2024, 11:32 PMnumerous-angle-77908
07/22/2024, 11:34 PMgreat-bear-19718
07/22/2024, 11:47 PMnvidia-driver-runtime 0 0 0 0 0 <http://sriovgpu.harvesterhci.io/driver-needed=true|sriovgpu.harvesterhci.io/driver-needed=true> 10d
driver is not loaded because none of the nodes have been labelled by pcidevices controllergreat-bear-19718
07/22/2024, 11:47 PMgreat-bear-19718
07/22/2024, 11:48 PMnumerous-angle-77908
07/22/2024, 11:48 PMgreat-bear-19718
07/22/2024, 11:49 PMgreat-bear-19718
07/22/2024, 11:49 PM<http://sriovgpu.harvesterhci.io/driver-needed=true|sriovgpu.harvesterhci.io/driver-needed=true>
that will force deployment of drivergreat-bear-19718
07/22/2024, 11:49 PMgreat-bear-19718
07/22/2024, 11:50 PMnumerous-angle-77908
07/22/2024, 11:51 PMPost-install sanity check passed.
2024-07-22T22:57:42.791341063Z
2024-07-22T22:57:42.791359988Z Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 550.90.05) is now complete.
2024-07-22T22:57:42.791374576Z
Mon, Jul 22 2024 4:57:42 pm
running nvidia vgpud
2024-07-22T22:57:42.898278808Z Creating '/dev/char' directory
great-bear-19718
07/22/2024, 11:51 PMnumerous-angle-77908
07/22/2024, 11:52 PMgreat-bear-19718
07/22/2024, 11:52 PMnumerous-angle-77908
07/22/2024, 11:52 PMgreat-bear-19718
07/22/2024, 11:52 PMgreat-bear-19718
07/22/2024, 11:52 PMgreat-bear-19718
07/22/2024, 11:52 PMgreat-bear-19718
07/22/2024, 11:52 PMgreat-bear-19718
07/22/2024, 11:53 PMsriov_vf_device
is the file we checkgreat-bear-19718
07/22/2024, 11:53 PMgreat-bear-19718
07/22/2024, 11:53 PM/sys/bus/pci/devices/ADDRESS/
numerous-angle-77908
07/22/2024, 11:56 PMgreat-bear-19718
07/22/2024, 11:57 PMgreat-bear-19718
07/22/2024, 11:57 PMgreat-bear-19718
07/22/2024, 11:57 PMnumerous-angle-77908
07/22/2024, 11:58 PMPost-install sanity check passed.
2024-07-22T22:57:42.825717826Z
2024-07-22T22:57:42.825728506Z Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 550.90.05) is now complete.
2024-07-22T22:57:42.825737192Z
running nvidia vgpud
2024-07-22T22:57:42.924907965Z Creating '/dev/char' directory
great-bear-19718
07/22/2024, 11:58 PMgreat-bear-19718
07/22/2024, 11:58 PMgreat-bear-19718
07/22/2024, 11:59 PMvgpud
daemon and waits for subsequent requests from pcidevices controller when you configure vgpusnumerous-angle-77908
07/23/2024, 12:02 AMnumerous-angle-77908
07/23/2024, 12:03 AMnumerous-angle-77908
07/23/2024, 12:03 AMgreat-bear-19718
07/23/2024, 12:08 AMgreat-bear-19718
07/23/2024, 12:09 AMgreat-bear-19718
07/23/2024, 12:09 AMnumerous-angle-77908
07/23/2024, 12:09 AMnumerous-angle-77908
07/23/2024, 12:10 AMgreat-bear-19718
07/23/2024, 12:10 AMnumerous-angle-77908
07/23/2024, 12:10 AMgreat-bear-19718
07/23/2024, 12:10 AMgreat-bear-19718
07/23/2024, 12:10 AMnumerous-angle-77908
07/23/2024, 12:10 AMnumerous-angle-77908
07/23/2024, 12:13 AMgreat-bear-19718
07/23/2024, 12:14 AMnvidia-smi
and see what the gpu saysgreat-bear-19718
07/23/2024, 12:14 AMsriov-manage -e ALL
will enable sriov gpus if they are support and will tell us what we need to knownumerous-angle-77908
07/23/2024, 12:14 AMgreat-bear-19718
07/23/2024, 12:18 AMgreat-bear-19718
07/23/2024, 12:18 AMlspci
?great-bear-19718
07/23/2024, 12:18 AMnumerous-angle-77908
07/23/2024, 12:19 AM0b:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
0b:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
great-bear-19718
07/23/2024, 12:19 AMgreat-bear-19718
07/23/2024, 12:20 AM/sys/bus/pci/devices/0000:0b:00.0
do you see the sriov_vf_device file?numerous-angle-77908
07/23/2024, 12:21 AMnumerous-angle-77908
07/23/2024, 12:22 AMgreat-bear-19718
07/23/2024, 12:23 AMnumerous-angle-77908
07/23/2024, 12:31 AMnumerous-angle-77908
07/23/2024, 12:35 AMgreat-bear-19718
07/23/2024, 12:41 AMnvidia-smi -q
should see the gpusnumerous-angle-77908
07/23/2024, 12:47 AMnumerous-angle-77908
07/23/2024, 12:47 AMgreat-bear-19718
07/23/2024, 12:50 AMgreat-bear-19718
07/23/2024, 12:50 AMnumerous-angle-77908
07/23/2024, 12:50 AMgreat-bear-19718
07/23/2024, 12:52 AMgreat-bear-19718
07/23/2024, 12:52 AMdmesg
from the hostgreat-bear-19718
07/23/2024, 12:52 AMnumerous-angle-77908
07/23/2024, 12:55 AMgreat-bear-19718
07/23/2024, 12:56 AMgreat-bear-19718
07/23/2024, 12:56 AMls -lart /sys/bus/pci/devices/0000:0b:00.0/
would be nicegreat-bear-19718
07/23/2024, 12:57 AM[ 4696.401058] NVRM: RmFetchGspRmImages: No firmware image found
[ 4696.401066] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x61:0x56:1697)
[ 4696.401618] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 4696.404266] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
numerous-angle-77908
07/23/2024, 12:59 AMnvidia-driver-runtime-jdv44:/ # ls -lart /sys/bus/pci/devices/0000:0b:00.0/
total 0
-r--r--r-- 1 root root 4096 Jul 22 22:54 vendor
-rw-r--r-- 1 root root 4096 Jul 22 22:54 uevent
-r--r--r-- 1 root root 4096 Jul 22 22:54 subsystem_device
lrwxrwxrwx 1 root root 0 Jul 22 22:54 subsystem -> ../../../../bus/pci
-rw------- 1 root root 524288 Jul 22 22:54 rom
-r--r--r-- 1 root root 4096 Jul 22 22:54 revision
-rw------- 1 root root 128 Jul 22 22:54 resource5
-rw------- 1 root root 33554432 Jul 22 22:54 resource3_wc
-rw------- 1 root root 33554432 Jul 22 22:54 resource3
-rw------- 1 root root 268435456 Jul 22 22:54 resource1_wc
-rw------- 1 root root 268435456 Jul 22 22:54 resource1
-rw------- 1 root root 16777216 Jul 22 22:54 resource0
-r--r--r-- 1 root root 4096 Jul 22 22:54 resource
-rw-r--r-- 1 root root 4096 Jul 22 22:54 reset_method
--w------- 1 root root 4096 Jul 22 22:54 reset
--w------- 1 root root 4096 Jul 22 22:54 rescan
--w--w---- 1 root root 4096 Jul 22 22:54 remove
-r--r--r-- 1 root root 4096 Jul 22 22:54 power_state
drwxr-xr-x 2 root root 0 Jul 22 22:54 power
-rw-r--r-- 1 root root 4096 Jul 22 22:54 numa_node
-rw-r--r-- 1 root root 4096 Jul 22 22:54 msi_bus
-r--r--r-- 1 root root 4096 Jul 22 22:54 modalias
-r--r--r-- 1 root root 4096 Jul 22 22:54 max_link_width
-r--r--r-- 1 root root 4096 Jul 22 22:54 max_link_speed
-r--r--r-- 1 root root 4096 Jul 22 22:54 local_cpus
-r--r--r-- 1 root root 4096 Jul 22 22:54 local_cpulist
drwxr-xr-x 2 root root 0 Jul 22 22:54 link
-r--r--r-- 1 root root 4096 Jul 22 22:54 irq
lrwxrwxrwx 1 root root 0 Jul 22 22:54 iommu_group -> ../../../../kernel/iommu_groups/19
lrwxrwxrwx 1 root root 0 Jul 22 22:54 iommu -> ../../0000:00:00.2/iommu/ivhd0
-rw-r--r-- 1 root root 4096 Jul 22 22:54 enable
-rw-r--r-- 1 root root 4096 Jul 22 22:54 driver_override
-r--r--r-- 1 root root 4096 Jul 22 22:54 dma_mask_bits
-r--r--r-- 1 root root 4096 Jul 22 22:54 device
-rw-r--r-- 1 root root 4096 Jul 22 22:54 d3cold_allowed
-r--r--r-- 1 root root 4096 Jul 22 22:54 current_link_width
-r--r--r-- 1 root root 4096 Jul 22 22:54 current_link_speed
lrwxrwxrwx 1 root root 0 Jul 22 22:54 consumer:pci:0000:0b:00.1 -> ../../../virtual/devlink/pci:0000:0b:00.0--pci:0000:0b:00.1
-r--r--r-- 1 root root 4096 Jul 22 22:54 consistent_dma_mask_bits
-rw-r--r-- 1 root root 4096 Jul 22 22:54 config
-r--r--r-- 1 root root 4096 Jul 22 22:54 class
-rw-r--r-- 1 root root 4096 Jul 22 22:54 broken_parity_status
-r--r--r-- 1 root root 4096 Jul 22 22:54 boot_vga
-r--r--r-- 1 root root 4096 Jul 22 22:54 ari_enabled
-r--r--r-- 1 root root 4096 Jul 22 22:54 aer_dev_nonfatal
-r--r--r-- 1 root root 4096 Jul 22 22:54 aer_dev_fatal
-r--r--r-- 1 root root 4096 Jul 22 22:54 aer_dev_correctable
drwxr-xr-x 13 root root 0 Jul 22 22:54 ..
drwxr-xr-x 4 root root 0 Jul 22 22:54 .
-r--r--r-- 1 root root 4096 Jul 22 22:57 subsystem_vendor
lrwxrwxrwx 1 root root 0 Jul 23 00:44 driver -> ../../../../bus/pci/drivers/nvidia
great-bear-19718
07/23/2024, 1:00 AMgreat-bear-19718
07/23/2024, 1:00 AMnumerous-angle-77908
07/23/2024, 1:01 AMgreat-bear-19718
07/23/2024, 1:01 AMnumerous-angle-77908
07/23/2024, 1:04 AMnumerous-angle-77908
07/23/2024, 1:06 AMgreat-bear-19718
07/23/2024, 1:07 AMnumerous-angle-77908
07/23/2024, 1:14 AMnumerous-angle-77908
07/23/2024, 1:16 AMnumerous-angle-77908
07/23/2024, 1:21 AMnumerous-angle-77908
07/23/2024, 1:26 AMnumerous-angle-77908
07/23/2024, 7:01 PMnumerous-angle-77908
07/23/2024, 7:31 PMnumerous-angle-77908
07/23/2024, 7:33 PMaddons:
harvester_vm_import_controller:
enabled: false
values_content: ""
harvester_pcidevices_controller:
enabled: false
values_content: ""
rancher_monitoring:
enabled: false
values_content: ""
rancher_logging:
enabled: false
values_content: ""
harvester_seeder:
enabled: false
values_content: ""
nvidia_driver_toolkit:
enabled: true
values_content: ""
numerous-angle-77908
07/23/2024, 8:50 PMgreat-bear-19718
07/23/2024, 11:01 PM[ 183.294677] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[ 183.298169] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[ 183.308629] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[ 183.311716] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[ 225.816298] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
has to be this..great-bear-19718
07/23/2024, 11:02 PMnumerous-angle-77908
07/24/2024, 5:34 PMnumerous-angle-77908
07/24/2024, 6:04 PMnumerous-angle-77908
07/24/2024, 9:11 PMnumerous-angle-77908
07/24/2024, 9:41 PMgreat-bear-19718
07/24/2024, 11:03 PMgreat-bear-19718
07/24/2024, 11:03 PMnumerous-angle-77908
07/24/2024, 11:11 PMnumerous-angle-77908
07/26/2024, 4:56 PMnumerous-angle-77908
07/26/2024, 8:32 PMsudo bash -c 'echo "options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'
So I guess the question becomes how do I get this nvidia.conf file injected into the Pod before the Driver install.numerous-angle-77908
07/26/2024, 8:58 PMsudo /usr/bin/nvidia-uninstall
sudo rm -rf /etc/modprobe.d/nvidia.conf
sudo rm -rf /etc/dracut.conf.d/nvidia.conf
sudo find /lib/modules/$(uname -r) -name 'nvidia*' -exec rm -rf {} +
sudo dracut --force
sudo bash -c 'echo "options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'
sudo /tmp/NVIDIA.run
nvidia-smi
The next step is how do I get the vGPU's to showup in Harvesternumerous-angle-77908
07/26/2024, 9:49 PMnumerous-angle-77908
07/26/2024, 10:01 PMvGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
https://pastebin.com/raw/wuVMyCCRgreat-bear-19718
07/28/2024, 11:43 PMgreat-bear-19718
07/28/2024, 11:44 PM/usr/local/nvidia/sriov-manage -e 00000000:41:00.0
and see what happensnumerous-angle-77908
07/29/2024, 2:29 PMnvidia-driver-runtime-9dz9v:/usr/local # /usr/lib/nvidia/sriov-manage -e 0000:41:00.0
nvidia-driver-runtime-9dz9v:/usr/local # ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn
Is there a Log to look at to see why they are not being created?
Also I still have pcidevices-controller addon disabled so I can not see anything in the UI yet. I tried enabling but when I do that it deletes the nvidia-driver-runtime pods that are created from the DaemonSets. What does the the pcidevices-controller look for to determine if the pod needs to be created or deleted?
nvidia-driver-runtime-9dz9v:/ # lsmod | grep vfio
nvidia_vgpu_vfio 69632 0
mdev 28672 1 nvidia_vgpu_vfio
vfio_iommu_type1 40960 0
vfio 45056 3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
kvm 1056768 2 kvm_amd,nvidia_vgpu_vfio
irqbypass 16384 2 nvidia_vgpu_vfio,kvm
nvidia-driver-runtime-9dz9v:/ # modinfo nvidia_vgpu_vfio
filename: /lib/modules/5.14.21-150400.24.119-default/kernel/drivers/video/nvidia-vgpu-vfio.ko
softdep: pre: nvidia
import_ns: IOMMUFD
version: 535.183.04
supported: external
license: Dual MIT/GPL
suserelease: SLE15-SP4
srcversion: 81DFBEABED825A6138217B0
alias: pci:v000010DEd*sv*sd*bc06sc80i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: vfio,mdev,irqbypass,kvm
retpoline: Y
name: nvidia_vgpu_vfio
vermagic: 5.14.21-150400.24.119-default SMP preempt mod_unload modversions
great-bear-19718
07/29/2024, 11:04 PMgreat-bear-19718
07/29/2024, 11:04 PMsriov-manage -e pciaddress
do you see vf's create in /sys/bus/pci/devices/pciaddress ?great-bear-19718
07/29/2024, 11:07 PMnumerous-angle-77908
07/30/2024, 3:37 PMnumerous-angle-77908
07/30/2024, 3:39 PMnvidia-driver-runtime-9dz9v:/ # lspci | grep NVIDIA
41:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
41:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
nvidia-driver-runtime-9dz9v:/ # sudo /usr/lib/nvidia/sriov-manage -e 00:41:0000.0
nvidia-driver-runtime-9dz9v:/ # ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn
nvidia-driver-runtime-9dz9v:/ #
numerous-angle-77908
07/30/2024, 3:41 PMnvidia-driver-runtime-9dz9v:/ # ls -l /sys/bus/pci/devices/0000:41:00.0/
total 0
-r--r--r-- 1 root root 4096 Jul 27 12:34 aer_dev_correctable
-r--r--r-- 1 root root 4096 Jul 27 12:34 aer_dev_fatal
-r--r--r-- 1 root root 4096 Jul 27 12:34 aer_dev_nonfatal
-r--r--r-- 1 root root 4096 Jul 27 12:34 ari_enabled
-r--r--r-- 1 root root 4096 Jul 27 12:34 boot_vga
-rw-r--r-- 1 root root 4096 Jul 27 12:34 broken_parity_status
-r--r--r-- 1 root root 4096 Jul 27 12:33 class
-rw-r--r-- 1 root root 4096 Jul 27 12:33 config
-r--r--r-- 1 root root 4096 Jul 27 12:34 consistent_dma_mask_bits
lrwxrwxrwx 1 root root 0 Jul 27 12:34 consumer:pci:0000:41:00.1 -> ../../../virtual/devlink/pci:0000:41:00.0--pci:0000:41:00.1
-r--r--r-- 1 root root 4096 Jul 27 12:34 current_link_speed
-r--r--r-- 1 root root 4096 Jul 27 12:34 current_link_width
-rw-r--r-- 1 root root 4096 Jul 27 12:34 d3cold_allowed
-r--r--r-- 1 root root 4096 Jul 27 12:33 device
-r--r--r-- 1 root root 4096 Jul 27 12:34 dma_mask_bits
lrwxrwxrwx 1 root root 0 Jul 27 12:34 driver -> ../../../../bus/pci/drivers/nvidia
-rw-r--r-- 1 root root 4096 Jul 27 12:34 driver_override
-rw-r--r-- 1 root root 4096 Jul 27 12:34 enable
drwxr-xr-x 3 root root 0 Jul 27 12:33 i2c-10
drwxr-xr-x 3 root root 0 Jul 27 12:33 i2c-5
drwxr-xr-x 3 root root 0 Jul 27 12:33 i2c-6
drwxr-xr-x 3 root root 0 Jul 27 12:33 i2c-7
drwxr-xr-x 3 root root 0 Jul 27 12:33 i2c-8
drwxr-xr-x 3 root root 0 Jul 27 12:33 i2c-9
lrwxrwxrwx 1 root root 0 Jul 27 12:34 iommu -> ../../0000:40:00.2/iommu/ivhd2
lrwxrwxrwx 1 root root 0 Jul 27 12:34 iommu_group -> ../../../../kernel/iommu_groups/44
-r--r--r-- 1 root root 4096 Jul 27 12:33 irq
drwxr-xr-x 2 root root 0 Jul 27 12:34 link
-r--r--r-- 1 root root 4096 Jul 27 12:34 local_cpulist
-r--r--r-- 1 root root 4096 Jul 27 12:34 local_cpus
-r--r--r-- 1 root root 4096 Jul 27 12:34 max_link_speed
-r--r--r-- 1 root root 4096 Jul 27 12:34 max_link_width
-r--r--r-- 1 root root 4096 Jul 27 12:34 modalias
-rw-r--r-- 1 root root 4096 Jul 27 12:34 msi_bus
drwxr-xr-x 2 root root 0 Jul 27 12:34 msi_irqs
-rw-r--r-- 1 root root 4096 Jul 27 12:34 numa_node
drwxr-xr-x 2 root root 0 Jul 27 12:34 power
-r--r--r-- 1 root root 4096 Jul 27 12:34 power_state
--w--w---- 1 root root 4096 Jul 27 12:34 remove
--w------- 1 root root 4096 Jul 27 12:34 rescan
--w------- 1 root root 4096 Jul 27 12:34 reset
-rw-r--r-- 1 root root 4096 Jul 27 12:34 reset_method
-r--r--r-- 1 root root 4096 Jul 27 12:33 resource
-rw------- 1 root root 16777216 Jul 27 12:34 resource0
-rw------- 1 root root 268435456 Jul 27 12:34 resource1
-rw------- 1 root root 268435456 Jul 27 12:34 resource1_wc
-rw------- 1 root root 33554432 Jul 27 12:34 resource3
-rw------- 1 root root 33554432 Jul 27 12:34 resource3_wc
-rw------- 1 root root 128 Jul 27 12:34 resource5
-r--r--r-- 1 root root 4096 Jul 27 12:33 revision
-rw------- 1 root root 524288 Jul 27 12:34 rom
lrwxrwxrwx 1 root root 0 Jul 27 12:31 subsystem -> ../../../../bus/pci
-r--r--r-- 1 root root 4096 Jul 27 12:33 subsystem_device
-r--r--r-- 1 root root 4096 Jul 27 12:33 subsystem_vendor
-rw-r--r-- 1 root root 4096 Jul 27 12:31 uevent
-r--r--r-- 1 root root 4096 Jul 27 12:31 vendor
great-bear-19718
07/30/2024, 9:58 PMgreat-bear-19718
07/30/2024, 9:59 PMnumerous-angle-77908
07/31/2024, 4:22 PMsudo /usr/lib/nvidia/sriov-manage -e 0000:0b:00.0
Enabling VFs on 0000:0b:00.0
ls -l /sys/bus/pci/devices/0000:0b:00.0/ | grep virtfn
Runs and creates the devices. SO now I am moving that card over to the Harvester Node and I will proceed with testing there.numerous-angle-77908
07/31/2024, 4:35 PMnvidia-driver-runtime-mv584:/ # sudo /usr/lib/nvidia/sriov-manage -e 0000:0b:00.0
Enabling VFs on 0000:0b:00.0
nvidia-driver-runtime-mv584:/ # ls -l /sys/bus/pci/devices/0000:0b:00.0/ | grep virtfn
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn0 -> ../0000:0b:00.4
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn1 -> ../0000:0b:00.5
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn10 -> ../0000:0b:01.6
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn11 -> ../0000:0b:01.7
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn12 -> ../0000:0b:02.0
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn13 -> ../0000:0b:02.1
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn14 -> ../0000:0b:02.2
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn15 -> ../0000:0b:02.3
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn16 -> ../0000:0b:02.4
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn17 -> ../0000:0b:02.5
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn18 -> ../0000:0b:02.6
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn19 -> ../0000:0b:02.7
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn2 -> ../0000:0b:00.6
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn20 -> ../0000:0b:03.0
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn21 -> ../0000:0b:03.1
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn22 -> ../0000:0b:03.2
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn23 -> ../0000:0b:03.3
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn24 -> ../0000:0b:03.4
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn25 -> ../0000:0b:03.5
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn26 -> ../0000:0b:03.6
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn27 -> ../0000:0b:03.7
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn28 -> ../0000:0b:04.0
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn29 -> ../0000:0b:04.1
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn3 -> ../0000:0b:00.7
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn30 -> ../0000:0b:04.2
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn31 -> ../0000:0b:04.3
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn4 -> ../0000:0b:01.0
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn5 -> ../0000:0b:01.1
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn6 -> ../0000:0b:01.2
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn7 -> ../0000:0b:01.3
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn8 -> ../0000:0b:01.4
lrwxrwxrwx 1 root root 0 Jul 31 16:26 virtfn9 -> ../0000:0b:01.5
great-bear-19718
07/31/2024, 10:59 PMgreat-bear-19718
07/31/2024, 10:59 PMnumerous-angle-77908
08/01/2024, 1:51 AMgreat-bear-19718
08/01/2024, 1:58 AMgreat-bear-19718
08/01/2024, 1:59 AMnumerous-angle-77908
08/01/2024, 2:49 AMnumerous-angle-77908
08/01/2024, 2:49 AMgreat-bear-19718
08/01/2024, 2:51 AMgreat-bear-19718
08/01/2024, 2:51 AMnumerous-angle-77908
08/01/2024, 3:29 AMnumerous-angle-77908
08/01/2024, 3:29 AMnumerous-angle-77908
08/01/2024, 4:43 AMnumerous-angle-77908
08/01/2024, 5:05 AMFailed creating server [fleet-default/gputest-gpu4-b6c0da6d-glzm7] of kind (HarvesterMachine) for machine gputest-gpu4-554fffb4fbxz9xx4-lvh65 in infrastructure provider: CreateError: Downloading driver from <https://k8s.koat.ai/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Creating machine... Error creating machine: Error in driver during machine creation: Too many retries waiting for machine to be Running. Last error: Maximum number of retries (120) exceeded
numerous-angle-77908
08/01/2024, 2:17 PMVirt-launcher pod has not yet been scheduled
numerous-angle-77908
08/01/2024, 3:53 PM{"component":"virt-controller","kind":"","level":"error","msg":"Updating the VirtualMachine status failed.","name":"testrocky","namespace":"koatprod","pos":"vm.go:387","reason":"Operation cannot be fulfilled on <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> \"testrocky\": the object has been modified; please apply your changes to the latest version and try again","timestamp":"2024-08-01T15:51:44.827179Z","uid":"4ff39231-46d3-40bb-9b2a-ce9ce1d6f500"}
2024-08-01T15:51:44.827308116Z {"component":"virt-controller","level":"info","msg":"re-enqueuing VirtualMachine koatprod/testrocky","pos":"vm.go:281","reason":"Operation cannot be fulfilled on <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> \"testrocky\": the object has been modified; please apply your changes to the latest version and try again","timestamp":"2024-08-01T15:51:44.827228Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:44.861151Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:44.931519Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:45.040344Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:45.234341Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:45.586475Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:46.250799Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:47.564281Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:50.160744Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:55.319318Z"}
{"component":"virt-controller","level":"info","msg":"Updating VMIs phase metrics","pos":"collector.go:245","timestamp":"2024-08-01T15:52:04.474730Z"}
{"component":"virt-controller","level":"info","msg":"phase map[{Phase:pending OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:}:1 {Phase:running OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:harvesterdev2}:1 {Phase:running OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:harvesterdev7}:1 {Phase:running OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:harvesterdev8}:1]","pos":"collector.go:247","timestamp":"2024-08-01T15:52:04.474862Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:52:05.590622Z"}
{"component":"virt-controller","level":"info","msg":"TSC Freqency node update status: 0 updated, 8 skipped, 0 errors","pos":"nodetopologyupdater.go:44","timestamp":"2024-08-01T15:52:10.898343Z"}
great-bear-19718
08/04/2024, 11:45 PMgreat-bear-19718
08/04/2024, 11:46 PMnumerous-angle-77908
08/06/2024, 2:51 AMnumerous-angle-77908
08/06/2024, 3:02 AMgreat-bear-19718
08/06/2024, 3:04 AMgreat-bear-19718
08/06/2024, 3:04 AMgreat-bear-19718
08/06/2024, 3:05 AMgreat-bear-19718
08/06/2024, 3:06 AMnumerous-angle-77908
08/06/2024, 12:38 PMnumerous-angle-77908
08/06/2024, 1:32 PMnumerous-angle-77908
08/06/2024, 2:21 PMgreat-bear-19718
08/07/2024, 5:33 AMgreat-bear-19718
08/07/2024, 5:33 AMgreat-bear-19718
08/07/2024, 5:33 AM<http://nvidia.com/NVIDIA_RTX5000-ADA-8Q|nvidia.com/NVIDIA_RTX5000-ADA-8Q>
great-bear-19718
08/07/2024, 5:33 AM<http://nvidia.com/NVIDIA_RTX5000-Ada-8Q|nvidia.com/NVIDIA_RTX5000-Ada-8Q>
which is not the same as it is case sensitivenumerous-angle-77908
08/07/2024, 1:37 PMnvidia-driver-runtime-lgzr2:/ # cd /sys/bus/pci/devices/0000:0b:00.0/virtfn0/mdev_supported_types
nvidia-driver-runtime-lgzr2:/sys/bus/pci/devices/0000:0b:00.0/virtfn0/mdev_supported_types # ls
nvidia-1028 nvidia-1029 nvidia-1030 nvidia-1031 nvidia-1032 nvidia-1033 nvidia-1034 nvidia-1035 nvidia-1036 nvidia-1037 nvidia-1038 nvidia-1039 nvidia-1040 nvidia-1041
nvidia-driver-runtime-lgzr2:/sys/bus/pci/devices/0000:0b:00.0/virtfn0/mdev_supported_types # cat nvidia-1033/name
NVIDIA RTX5000-Ada-8Q
numerous-angle-77908
08/07/2024, 2:25 PMnvidia-driver-runtime-lgzr2:/ # cd /sys/bus/mdev/devices
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices # ls
475c556b-4efb-4714-974c-e080715f8a6d 5bbbece9-d755-4e38-a34f-cae18bfeda0b 6f520761-e091-4a9d-b012-dab7ac476f4d 77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices # ls -l
total 0
lrwxrwxrwx 1 root root 0 Aug 6 13:48 475c556b-4efb-4714-974c-e080715f8a6d -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.5/475c556b-4efb-4714-974c-e080715f8a6d
lrwxrwxrwx 1 root root 0 Aug 6 13:48 5bbbece9-d755-4e38-a34f-cae18bfeda0b -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.6/5bbbece9-d755-4e38-a34f-cae18bfeda0b
lrwxrwxrwx 1 root root 0 Aug 6 13:48 6f520761-e091-4a9d-b012-dab7ac476f4d -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.4/6f520761-e091-4a9d-b012-dab7ac476f4d
lrwxrwxrwx 1 root root 0 Aug 6 13:48 77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37 -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.7/77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices # cd 475c556b-4efb-4714-974c-e080715f8a6d
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d # ls
driver iommu_group mdev_type nvidia power remove subsystem uevent
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d # cd mdev_type
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d/mdev_type # ls
available_instances create description device_api devices name
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d/mdev_type # cat name
NVIDIA RTX5000-Ada-8Q
great-bear-19718
08/07/2024, 10:41 PMgreat-bear-19718
08/07/2024, 10:44 PMgreat-bear-19718
08/07/2024, 10:44 PMgreat-bear-19718
08/07/2024, 10:44 PMgreat-bear-19718
08/07/2024, 11:50 PMgmehta3/pcidevices:vgpu-type-fix
great-bear-19718
08/07/2024, 11:51 PMgreat-bear-19718
08/08/2024, 12:09 AMnumerous-angle-77908
08/08/2024, 8:37 PM"Failed creating server [fleet-default/gputest-gpu-6eb49156-dvm6n] of kind (HarvesterMachine) for machine gputest-gpu-5476f67bfbxd9jg9-rq7q6 in infrastructure provider: CreateError: Downloading driver from <https://k8s.koat.ai/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Error with pre-create check: "the server has asked for the client to provide credentials (get <http://settings.harvesterhci.io|settings.harvesterhci.io> server-version)""
AND
"Failed deleting server [fleet-default/gputest-gpu-6eb49156-bj4p5] of kind (HarvesterMachine) for machine gputest-gpu-5476f67bfbxd9jg9-4zbjp in infrastructure provider: DeleteError: Downloading driver from <https://k8s.koat.ai/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped error loading host gputest-gpu-6eb49156-bj4p5: Docker machine "gputest-gpu-6eb49156-bj4p5" does not exist. Use "docker-machine ls" to list machines. Use "docker-machine create" to add a new one."
Trying to use the vGPU in a VM it sits at.
Virt-launcher pod has not yet been scheduled
The Yaml for the VM uses lowercase still if that is intended?
gpus:
- deviceName: <http://nvidia.com/NVIDIA_RTX5000-Ada-8Q|nvidia.com/NVIDIA_RTX5000-Ada-8Q>
name: harvesterdev7-00000b007
I have also tried with a manual update to the yaml config
gpus:
- deviceName: <http://nvidia.com/NVIDIA_RTX5000-ADA-8Q|nvidia.com/NVIDIA_RTX5000-ADA-8Q>
name: harvesterdev7-00000b007
It looks like it got further see the logs here https://pastebin.com/raw/wg1Ktrpe
Here is another support package.
I tried initially using your instructions of disable vgpudevices / sriovgpudevice and re-enable the gpu and new vgpudevices,
Then I tried a complete disable of the addon and reenable of everything.numerous-angle-77908
08/12/2024, 8:02 PM