This message was deleted.
# harvester
a
This message was deleted.
g
the driver is loaded by the toolkit pod in harvester-system namespace
it will be scheduled to nodes which have the GPU's and will load the driver and nvidia utils.. you can exec into the pod to check
nvidia-smi
output
n
Thx for responding @great-bear-19718 I haven't got the devices.harvesterhci.io.sriovgpudevice showing yet after the nvidia-driver-runtime runs
g
the pcidevices controller creates those.. i assume pcidevices controller is enabled?
n
Yes pcidevices controller is enabled. But I have not enabled passthrough on the PCI Devices yet for the 2 GPU's
g
you do not need to enable passthrough for gpu's
that is only for transferring a full gpu to vm
n
Ok good to know
g
a support bundle would be nice to figure out what is going on.. a5000 is based on ampere arch so should be supported
n
So I guess the last thing to figure out is how to get them to show in SR-IOV GPU Devices
No these are the A5000 ADA
g
pcidevices scans the /sys tree
n
Lovelace
g
i will need to check.. likely then
do they support sriov vgpu or mig based vgpu only?
n
sriov vgpu
g
there is a check in pcidevices to look for the same and skip gpu's which do not support sriov vgpu
n
The MOBO has sriov enabled and IOMMU
g
i would need to see a support bundle in that case
n
Generating now
g
nvidia's documentation is confusing at best.. it says vgpu.. but then i know vgpu is via mig and sriov based mechanisms
n
ya I can't afford the mig cards they were a bit too much $$,$$$
I did call PNY and got confirmation from them today that these do have SR-IOV support
Also I am using the latest kvm.run driver 17.3
The PCI devices that show for these cards are labeled as description:VGA compatible controller: NVIDIA Corporation AD102GL [RTX 5000 Ada Generation]
g
Copy code
nvidia-driver-runtime             0         0         0       0            0           <http://sriovgpu.harvesterhci.io/driver-needed=true|sriovgpu.harvesterhci.io/driver-needed=true>   10d
driver is not loaded because none of the nodes have been labelled by pcidevices controller
which of your 2 nodes contain the gpu?
i see 8 nodes in the cluster
n
harvesterdev7 and harvesterdev8
g
so easy way.. is to disable pcidevices addon..
label the nodes 7 & 8 with extra label..
<http://sriovgpu.harvesterhci.io/driver-needed=true|sriovgpu.harvesterhci.io/driver-needed=true>
that will force deployment of driver
and we can check what nvidia-smi returns about the gpu
pcidevices checks existence of file in the /sys tree under the gpu pci address which i am trying to find
n
I've done that with the label sriovgpu.harvesterhci.io/driver-needed=true And things do install.
Copy code
Post-install sanity check passed.
2024-07-22T22:57:42.791341063Z 
2024-07-22T22:57:42.791359988Z Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 550.90.05) is now complete.
2024-07-22T22:57:42.791374576Z 
Mon, Jul 22 2024 4:57:42 pm
running nvidia vgpud
2024-07-22T22:57:42.898278808Z Creating '/dev/char' directory
g
ok.. can you exec into the pod on node 7 then?
n
nvidia-smi when I shell into the node does not run
g
it needs to be in the pod
n
Ok
g
it is not installed on the host
based on your support bundle no pod for nvidia driver is running
because pcidevices controller will remove the label from node if gpu is not compatible
which is why i asked to disabled pcidevices controller addon
sriov_vf_device
is the file we check
existence of this file is used to identify if GPU is sriov vgpu capable
/sys/bus/pci/devices/ADDRESS/
n
correct there is no pod because after the daemon runs the nvidia-driver-runtime it deletes the pod after the install
g
pod should not be deleted
the pod is needed to manage vgpus subsequently
pcidevices controller removes the label from node.. and that will cause k8s to remove pods since the scheduling criteria is not met
n
it deletes itself after it
Copy code
Post-install sanity check passed.
2024-07-22T22:57:42.825717826Z 
2024-07-22T22:57:42.825728506Z Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 550.90.05) is now complete.
2024-07-22T22:57:42.825737192Z 
running nvidia vgpud
2024-07-22T22:57:42.924907965Z Creating '/dev/char' directory
g
it should not.. it just waits
i would check the label on the nodes if they are still there
the pod runs the
vgpud
daemon and waits for subsequent requests from pcidevices controller when you configure vgpus
n
No that label no longer exists on the node 7 or 8
to force the install I have it in my config.yaml for the PXE install. labels: sriovgpu.harvesterhci.io/driver-needed: true
just for node 7 and 8
g
pcidevices will reconcile and remove that label since it manages it
which is why i asked for pcidevices to be disabled
andn then label added for debugging
n
they are disabled.
I can add the label again so it tries a reinstall no problem
g
the addon itself needs to be disabled
n
Ahhh
g
pcidevices addon runs the pcidevices controller
that manages the lifecycle of this label on nodes
n
K I disabled pcidevices-controller and re added the labels. kubectl label nodes harvesterdev7 sriovgpu.harvesterhci.io/driver-needed=true kubectl label nodes harvesterdev8 sriovgpu.harvesterhci.io/driver-needed=true
g
once the pods are running you can exec into them and run
nvidia-smi
and see what the gpu says
sriov-manage -e ALL
will enable sriov gpus if they are support and will tell us what we need to know
n
No Devices Found.
g
well that could be it..
lspci
?
does it show the gpu?
n
Copy code
0b:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
0b:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
g
wrong driver version?
if you shell to the node and check
/sys/bus/pci/devices/0000:0b:00.0
do you see the sriov_vf_device file?
n
I grabbed them from the nvidia licensing server and used NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run
using rancher/harvester-nvidia-driver-toolkit:v1.3-20240613
g
that is fine.. if the issue was toolkit driver install would have failed
n
I'm going to download 16.7 driver and test
same issue with the 16.7
g
that is strange.. i would have thought
nvidia-smi -q
should see the gpus
n
went back to 13.12 and that driver errored on trying to install and noted it was incompatible. So I reinstalled 17.3
there is also a Ubuntu KVM version for drivers should I try that
g
i dont think that is the one
the driver has to be nvidia kvm generic one i think
n
I just downloaded the vgpuDriverCatalog and confirmed that 17.3 is the right driver
g
another support bundle now would help
or from the node 6 output of
dmesg
from the host
it may shed some light on why the gpu may not be visible
n
g
did you enable a pci passthrough for the gpus?
ls -lart /sys/bus/pci/devices/0000:0b:00.0/
would be nice
Copy code
[ 4696.401058] NVRM: RmFetchGspRmImages: No firmware image found
[ 4696.401066] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x61:0x56:1697)
[ 4696.401618] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 4696.404266] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
n
did you enable a pci passthrough for the gpus? No pcidevices-controller is still disabled.
Copy code
nvidia-driver-runtime-jdv44:/ # ls -lart /sys/bus/pci/devices/0000:0b:00.0/
total 0
-r--r--r--  1 root root      4096 Jul 22 22:54 vendor
-rw-r--r--  1 root root      4096 Jul 22 22:54 uevent
-r--r--r--  1 root root      4096 Jul 22 22:54 subsystem_device
lrwxrwxrwx  1 root root         0 Jul 22 22:54 subsystem -> ../../../../bus/pci
-rw-------  1 root root    524288 Jul 22 22:54 rom
-r--r--r--  1 root root      4096 Jul 22 22:54 revision
-rw-------  1 root root       128 Jul 22 22:54 resource5
-rw-------  1 root root  33554432 Jul 22 22:54 resource3_wc
-rw-------  1 root root  33554432 Jul 22 22:54 resource3
-rw-------  1 root root 268435456 Jul 22 22:54 resource1_wc
-rw-------  1 root root 268435456 Jul 22 22:54 resource1
-rw-------  1 root root  16777216 Jul 22 22:54 resource0
-r--r--r--  1 root root      4096 Jul 22 22:54 resource
-rw-r--r--  1 root root      4096 Jul 22 22:54 reset_method
--w-------  1 root root      4096 Jul 22 22:54 reset
--w-------  1 root root      4096 Jul 22 22:54 rescan
--w--w----  1 root root      4096 Jul 22 22:54 remove
-r--r--r--  1 root root      4096 Jul 22 22:54 power_state
drwxr-xr-x  2 root root         0 Jul 22 22:54 power
-rw-r--r--  1 root root      4096 Jul 22 22:54 numa_node
-rw-r--r--  1 root root      4096 Jul 22 22:54 msi_bus
-r--r--r--  1 root root      4096 Jul 22 22:54 modalias
-r--r--r--  1 root root      4096 Jul 22 22:54 max_link_width
-r--r--r--  1 root root      4096 Jul 22 22:54 max_link_speed
-r--r--r--  1 root root      4096 Jul 22 22:54 local_cpus
-r--r--r--  1 root root      4096 Jul 22 22:54 local_cpulist
drwxr-xr-x  2 root root         0 Jul 22 22:54 link
-r--r--r--  1 root root      4096 Jul 22 22:54 irq
lrwxrwxrwx  1 root root         0 Jul 22 22:54 iommu_group -> ../../../../kernel/iommu_groups/19
lrwxrwxrwx  1 root root         0 Jul 22 22:54 iommu -> ../../0000:00:00.2/iommu/ivhd0
-rw-r--r--  1 root root      4096 Jul 22 22:54 enable
-rw-r--r--  1 root root      4096 Jul 22 22:54 driver_override
-r--r--r--  1 root root      4096 Jul 22 22:54 dma_mask_bits
-r--r--r--  1 root root      4096 Jul 22 22:54 device
-rw-r--r--  1 root root      4096 Jul 22 22:54 d3cold_allowed
-r--r--r--  1 root root      4096 Jul 22 22:54 current_link_width
-r--r--r--  1 root root      4096 Jul 22 22:54 current_link_speed
lrwxrwxrwx  1 root root         0 Jul 22 22:54 consumer:pci:0000:0b:00.1 -> ../../../virtual/devlink/pci:0000:0b:00.0--pci:0000:0b:00.1
-r--r--r--  1 root root      4096 Jul 22 22:54 consistent_dma_mask_bits
-rw-r--r--  1 root root      4096 Jul 22 22:54 config
-r--r--r--  1 root root      4096 Jul 22 22:54 class
-rw-r--r--  1 root root      4096 Jul 22 22:54 broken_parity_status
-r--r--r--  1 root root      4096 Jul 22 22:54 boot_vga
-r--r--r--  1 root root      4096 Jul 22 22:54 ari_enabled
-r--r--r--  1 root root      4096 Jul 22 22:54 aer_dev_nonfatal
-r--r--r--  1 root root      4096 Jul 22 22:54 aer_dev_fatal
-r--r--r--  1 root root      4096 Jul 22 22:54 aer_dev_correctable
drwxr-xr-x 13 root root         0 Jul 22 22:54 ..
drwxr-xr-x  4 root root         0 Jul 22 22:54 .
-r--r--r--  1 root root      4096 Jul 22 22:57 subsystem_vendor
lrwxrwxrwx  1 root root         0 Jul 23 00:44 driver -> ../../../../bus/pci/drivers/nvidia
g
the sriov file is not there which is why the gpu is not detected for start
can you reboot the node and see if it helps?
n
Kk
g
n
If it doesn't work after a restart lets come back to this another time with fresh eyes. I'll do some poking around on forums as well. I appreciate all the work you have already put into this and you have provided me some valuable checks to make along the way
Well the MOBO I have has a VGA out so I won't have any issues with turning switching the mode if needed
g
i think i had seen an issue in the past with GPU when a VGA device was connected to GPU
n
I have to create a account and ask for the mode selector tool to change the card to not use the video out
the restart didn't change anything with the card nvidia-smi still results in No Devices were found
I am reaching out to my Nvidia rep on this to see if I can get some tech support from them on this as well. They have been pretty responsive.
Thx for your help tonight. I will see what comes back from Nvidia and this Mode selector tool
👍 1
I've posted in the Forums and my Nvidia rep said they would respond there. https://forums.developer.nvidia.com/t/getting-vgpu-working-on-rtx-5000-ada/300866 Do you have any other ideas since yesterday?
I am going to do a full cluster reinstall and ensure I don't have the pcicontroller enabled from the start while I wait.
I'll run with these addons
Copy code
addons:
    harvester_vm_import_controller:
      enabled: false
      values_content: ""
    harvester_pcidevices_controller:
      enabled: false
      values_content: ""
    rancher_monitoring:
      enabled: false
      values_content: ""
    rancher_logging:
      enabled: false
      values_content: ""
    harvester_seeder:
      enabled: false
      values_content: ""
    nvidia_driver_toolkit:
      enabled: true
      values_content: ""
I did a fresh install of harvester this time only adding 2 nodes 1 the default and 7 a gpu node. Commands in the pod. https://pastebin.com/raw/TjBfDVq4
g
Copy code
[  183.294677] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[  183.298169] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[  183.308629] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[  183.311716] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[  225.816298] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
has to be this..
i do not have this specific GPU in our lab else would have already tried it out
n
@great-bear-19718 I am seeing someone post about running the install with sudo ./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run -m=kernel-open sudo update-initramfs -u sudo reboot Is there a way I can manually install the file with in the nvidia-driver-runtime pod?
I figured it out sudo /tmp/NVIDIA.run -m=kernel-open after uninstalling the driver that gets installed on pod load. This never fixed the issue with the kernel
So there are some special instructions around the kernel modules I found with the Base Linux Driver install Specifically for SUSE. https://www.nvidia.com/download/driverResults.aspx/228542/en-us/ https://en.opensuse.org/SDB:NVIDIA_drivers
g
i do not think this is needed.. i have seen another issue from an end user about A5000 vgpu passthrough working but then they have a windows guest
n
Ya the difference is the RTX A5000 is Ampere and the RTX 5000 ADA is Lovelace Still just fighting to get the nvidia-smi to recognize the GPU. I have installed the GPU into a windows Machine with the base driver and it does work so I am not dealing with a hardware issue. Still waiting to hear from someone at nvidia
@great-bear-19718 I managed to get a copy of the nvidia-bug-report.log.gz if that helps diagnose these issues.
@great-bear-19718 I think I am ready to comeback to you on this now. So I managed to get the nvidia-smi to respond but I needed to add some options to the nvidia.conf
Copy code
sudo bash -c 'echo "options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'
So I guess the question becomes how do I get this nvidia.conf file injected into the Pod before the Driver install.
How I get it working after the pod is up is I exec into the pod and run through these commands.
Copy code
sudo /usr/bin/nvidia-uninstall

sudo rm -rf /etc/modprobe.d/nvidia.conf
sudo rm -rf /etc/dracut.conf.d/nvidia.conf
sudo find /lib/modules/$(uname -r) -name 'nvidia*' -exec rm -rf {} +
sudo dracut --force

sudo bash -c 'echo "options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'

sudo /tmp/NVIDIA.run

nvidia-smi
The next step is how do I get the vGPU's to showup in Harvester
Ok one problem down. I created a custom image with my nvidia.conf added and it installed and works. registry.gitlab.com/koat-public/koat-nvidia-driver-toolkit Now I just need to get the vGPU's going which It would be great to have your help again on.
You will see that the card does have
Copy code
vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
https://pastebin.com/raw/wuVMyCCR
g
does the GPU now show up in harvester UI?
also can you please try
Copy code
/usr/local/nvidia/sriov-manage -e 00000000:41:00.0
and see what happens
n
@great-bear-19718 that function is in lib not local and when I run it nothing happens.
Copy code
nvidia-driver-runtime-9dz9v:/usr/local # /usr/lib/nvidia/sriov-manage -e 0000:41:00.0
nvidia-driver-runtime-9dz9v:/usr/local # ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn
Is there a Log to look at to see why they are not being created? Also I still have pcidevices-controller addon disabled so I can not see anything in the UI yet. I tried enabling but when I do that it deletes the nvidia-driver-runtime pods that are created from the DaemonSets. What does the the pcidevices-controller look for to determine if the pod needs to be created or deleted?
Copy code
nvidia-driver-runtime-9dz9v:/ # lsmod | grep vfio
nvidia_vgpu_vfio       69632  0
mdev                   28672  1 nvidia_vgpu_vfio
vfio_iommu_type1       40960  0
vfio                   45056  3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
kvm                  1056768  2 kvm_amd,nvidia_vgpu_vfio
irqbypass              16384  2 nvidia_vgpu_vfio,kvm
nvidia-driver-runtime-9dz9v:/ # modinfo nvidia_vgpu_vfio
filename:       /lib/modules/5.14.21-150400.24.119-default/kernel/drivers/video/nvidia-vgpu-vfio.ko
softdep:        pre: nvidia
import_ns:      IOMMUFD
version:        535.183.04
supported:      external
license:        Dual MIT/GPL
suserelease:    SLE15-SP4
srcversion:     81DFBEABED825A6138217B0
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        vfio,mdev,irqbypass,kvm
retpoline:      Y
name:           nvidia_vgpu_vfio
vermagic:       5.14.21-150400.24.119-default SMP preempt mod_unload modversions
g
yeah that is likely because of the file used to identify if GPU can do sriov vgpus
when you did the
sriov-manage -e pciaddress
do you see vf's create in /sys/bus/pci/devices/pciaddress ?
n
No nvidia-driver-runtime-9dz9v:/usr/local # /usr/lib/nvidia/sriov-manage -e 00004100.0 There is no error nvidia-driver-runtime-9dz9v:/usr/local # ls -l /sys/bus/pci/devices/00004100.0/ | grep virtfn doesn't result in any new devices created. There must be an error somewhere or a log entry. Would you know where to look where there could be either an error or confirmation after running /usr/lib/nvidia/sriov-manage -e 00004100.0 I am also on with PNY the manufacture of the card and they are saying that it must be some OS issues. They instructed me to turn on the ECC memory of the card as a long shot but something to try.
Copy code
nvidia-driver-runtime-9dz9v:/ # lspci | grep NVIDIA
41:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
41:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
nvidia-driver-runtime-9dz9v:/ # sudo /usr/lib/nvidia/sriov-manage -e 00:41:0000.0
nvidia-driver-runtime-9dz9v:/ # ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn
nvidia-driver-runtime-9dz9v:/ #
Copy code
nvidia-driver-runtime-9dz9v:/ # ls -l /sys/bus/pci/devices/0000:41:00.0/              
total 0
-r--r--r-- 1 root root      4096 Jul 27 12:34 aer_dev_correctable
-r--r--r-- 1 root root      4096 Jul 27 12:34 aer_dev_fatal
-r--r--r-- 1 root root      4096 Jul 27 12:34 aer_dev_nonfatal
-r--r--r-- 1 root root      4096 Jul 27 12:34 ari_enabled
-r--r--r-- 1 root root      4096 Jul 27 12:34 boot_vga
-rw-r--r-- 1 root root      4096 Jul 27 12:34 broken_parity_status
-r--r--r-- 1 root root      4096 Jul 27 12:33 class
-rw-r--r-- 1 root root      4096 Jul 27 12:33 config
-r--r--r-- 1 root root      4096 Jul 27 12:34 consistent_dma_mask_bits
lrwxrwxrwx 1 root root         0 Jul 27 12:34 consumer:pci:0000:41:00.1 -> ../../../virtual/devlink/pci:0000:41:00.0--pci:0000:41:00.1
-r--r--r-- 1 root root      4096 Jul 27 12:34 current_link_speed
-r--r--r-- 1 root root      4096 Jul 27 12:34 current_link_width
-rw-r--r-- 1 root root      4096 Jul 27 12:34 d3cold_allowed
-r--r--r-- 1 root root      4096 Jul 27 12:33 device
-r--r--r-- 1 root root      4096 Jul 27 12:34 dma_mask_bits
lrwxrwxrwx 1 root root         0 Jul 27 12:34 driver -> ../../../../bus/pci/drivers/nvidia
-rw-r--r-- 1 root root      4096 Jul 27 12:34 driver_override
-rw-r--r-- 1 root root      4096 Jul 27 12:34 enable
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-10
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-5
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-6
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-7
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-8
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-9
lrwxrwxrwx 1 root root         0 Jul 27 12:34 iommu -> ../../0000:40:00.2/iommu/ivhd2
lrwxrwxrwx 1 root root         0 Jul 27 12:34 iommu_group -> ../../../../kernel/iommu_groups/44
-r--r--r-- 1 root root      4096 Jul 27 12:33 irq
drwxr-xr-x 2 root root         0 Jul 27 12:34 link
-r--r--r-- 1 root root      4096 Jul 27 12:34 local_cpulist
-r--r--r-- 1 root root      4096 Jul 27 12:34 local_cpus
-r--r--r-- 1 root root      4096 Jul 27 12:34 max_link_speed
-r--r--r-- 1 root root      4096 Jul 27 12:34 max_link_width
-r--r--r-- 1 root root      4096 Jul 27 12:34 modalias
-rw-r--r-- 1 root root      4096 Jul 27 12:34 msi_bus
drwxr-xr-x 2 root root         0 Jul 27 12:34 msi_irqs
-rw-r--r-- 1 root root      4096 Jul 27 12:34 numa_node
drwxr-xr-x 2 root root         0 Jul 27 12:34 power
-r--r--r-- 1 root root      4096 Jul 27 12:34 power_state
--w--w---- 1 root root      4096 Jul 27 12:34 remove
--w------- 1 root root      4096 Jul 27 12:34 rescan
--w------- 1 root root      4096 Jul 27 12:34 reset
-rw-r--r-- 1 root root      4096 Jul 27 12:34 reset_method
-r--r--r-- 1 root root      4096 Jul 27 12:33 resource
-rw------- 1 root root  16777216 Jul 27 12:34 resource0
-rw------- 1 root root 268435456 Jul 27 12:34 resource1
-rw------- 1 root root 268435456 Jul 27 12:34 resource1_wc
-rw------- 1 root root  33554432 Jul 27 12:34 resource3
-rw------- 1 root root  33554432 Jul 27 12:34 resource3_wc
-rw------- 1 root root       128 Jul 27 12:34 resource5
-r--r--r-- 1 root root      4096 Jul 27 12:33 revision
-rw------- 1 root root    524288 Jul 27 12:34 rom
lrwxrwxrwx 1 root root         0 Jul 27 12:31 subsystem -> ../../../../bus/pci
-r--r--r-- 1 root root      4096 Jul 27 12:33 subsystem_device
-r--r--r-- 1 root root      4096 Jul 27 12:33 subsystem_vendor
-rw-r--r-- 1 root root      4096 Jul 27 12:31 uevent
-r--r--r-- 1 root root      4096 Jul 27 12:31 vendor
g
if the vendor thinks it is the OS.. you can try in on a different OS, install the kvm drivers and run sriov-manage
I have a feeling it is neither of those two, but it will help rule them out
👍 1
n
I'm getting closer. So now I have ran the .\displaymodeselector.exe --gpumode to disable the display even though it said it was already disabled. Scared myself after that Firmware process because the MOBO didn't post after that if the GPU was installed. Found that I needed to enable "Above 4G Decoding" on my MOBO and now it posts. In Ubuntu now
Copy code
sudo /usr/lib/nvidia/sriov-manage -e 0000:0b:00.0
Enabling VFs on 0000:0b:00.0
ls -l /sys/bus/pci/devices/0000:0b:00.0/ | grep virtfn
Runs and creates the devices. SO now I am moving that card over to the Harvester Node and I will proceed with testing there.
@great-bear-19718 success I was able to run the below and it created the devices. I re-enabled the pcidevices-controller and it is now finding those devices.
Copy code
nvidia-driver-runtime-mv584:/ # sudo /usr/lib/nvidia/sriov-manage -e 0000:0b:00.0
Enabling VFs on 0000:0b:00.0
nvidia-driver-runtime-mv584:/ # ls -l /sys/bus/pci/devices/0000:0b:00.0/ | grep virtfn
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn0 -> ../0000:0b:00.4
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn1 -> ../0000:0b:00.5
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn10 -> ../0000:0b:01.6
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn11 -> ../0000:0b:01.7
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn12 -> ../0000:0b:02.0
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn13 -> ../0000:0b:02.1
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn14 -> ../0000:0b:02.2
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn15 -> ../0000:0b:02.3
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn16 -> ../0000:0b:02.4
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn17 -> ../0000:0b:02.5
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn18 -> ../0000:0b:02.6
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn19 -> ../0000:0b:02.7
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn2 -> ../0000:0b:00.6
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn20 -> ../0000:0b:03.0
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn21 -> ../0000:0b:03.1
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn22 -> ../0000:0b:03.2
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn23 -> ../0000:0b:03.3
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn24 -> ../0000:0b:03.4
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn25 -> ../0000:0b:03.5
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn26 -> ../0000:0b:03.6
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn27 -> ../0000:0b:03.7
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn28 -> ../0000:0b:04.0
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn29 -> ../0000:0b:04.1
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn3 -> ../0000:0b:00.7
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn30 -> ../0000:0b:04.2
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn31 -> ../0000:0b:04.3
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn4 -> ../0000:0b:01.0
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn5 -> ../0000:0b:01.1
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn6 -> ../0000:0b:01.2
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn7 -> ../0000:0b:01.3
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn8 -> ../0000:0b:01.4
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn9 -> ../0000:0b:01.5
g
woohoo
enjoy 😄
n
@great-bear-19718 I just did a fresh install and things are working ok for the setup. One question is after I enable the SR-IOV GPU and it creates the Devices. How long before I see them in the vGPU Devices? They are already in the PCI Devices..
g
they can take upto a minute or so..
i forgot but pcidevices controller runs a rescan of entire device tree every minute or so
n
So the default image was the issue. The nvidia-smi still would not recognize the gpus when ran with in the pod. I had to go back to my image that creates the nvidia.conf registry.gitlab.com/koat-public/koat-nvidia-driver-toolkit:latest against the NVIDIA-Linux-x86_64-535.183.04-vgpu-kvm.run driver
vGPU devices got populated shortly after switching back to my image
g
i will have to read up the extra settings
i have never had to use them.. but i only have tested against A2,A30 and A100s that I have available
n
Must be nice 🙂
Wish I had some a100's
I just setup my 1st cluster and testing allocation of the vGPU. I have 2 GPU's on different nodes. Can I not have a pool with one vGPU setup split of 4gb per and another pool with vGPU setup with 8gb per. Since they are different GPU's I was expecting that I could use one for heavier loads and one for lighter?
I'm not able to allocate the vGPU's to pools or to VM's.. Pools with out vGPU's provision with no issues. When I add a vGPU to the pool I'm getting this error on the node being created.
Copy code
Failed creating server [fleet-default/gputest-gpu4-b6c0da6d-glzm7] of kind (HarvesterMachine) for machine gputest-gpu4-554fffb4fbxz9xx4-lvh65 in infrastructure provider: CreateError: Downloading driver from <https://k8s.koat.ai/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Creating machine... Error creating machine: Error in driver during machine creation: Too many retries waiting for machine to be Running. Last error: Maximum number of retries (120) exceeded
When I add a basic VM and pass in a vGPU I just get the Pod hung with
Copy code
Virt-launcher pod has not yet been scheduled
Here is the virt-controller log showing the is not permitted error.
Copy code
{"component":"virt-controller","kind":"","level":"error","msg":"Updating the VirtualMachine status failed.","name":"testrocky","namespace":"koatprod","pos":"vm.go:387","reason":"Operation cannot be fulfilled on <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> \"testrocky\": the object has been modified; please apply your changes to the latest version and try again","timestamp":"2024-08-01T15:51:44.827179Z","uid":"4ff39231-46d3-40bb-9b2a-ce9ce1d6f500"}
2024-08-01T15:51:44.827308116Z {"component":"virt-controller","level":"info","msg":"re-enqueuing VirtualMachine koatprod/testrocky","pos":"vm.go:281","reason":"Operation cannot be fulfilled on <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> \"testrocky\": the object has been modified; please apply your changes to the latest version and try again","timestamp":"2024-08-01T15:51:44.827228Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:44.861151Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:44.931519Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:45.040344Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:45.234341Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:45.586475Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:46.250799Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:47.564281Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:50.160744Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:55.319318Z"}
{"component":"virt-controller","level":"info","msg":"Updating VMIs phase metrics","pos":"collector.go:245","timestamp":"2024-08-01T15:52:04.474730Z"}
{"component":"virt-controller","level":"info","msg":"phase map[{Phase:pending OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:}:1 {Phase:running OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:harvesterdev2}:1 {Phase:running OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:harvesterdev7}:1 {Phase:running OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:harvesterdev8}:1]","pos":"collector.go:247","timestamp":"2024-08-01T15:52:04.474862Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:52:05.590622Z"}
{"component":"virt-controller","level":"info","msg":"TSC Freqency node update status: 0 updated, 8 skipped, 0 errors","pos":"nodetopologyupdater.go:44","timestamp":"2024-08-01T15:52:10.898343Z"}
g
a new support bundle would be nice.. ideally pcidevices controller patches the kubevirt crd with list of permitted host devices
i would need a bundle to check what happened there
n
Here is a support package. I created a fresh install. Steps Created Node 1,2,3 non GPU Created Node 7 with GPU and enabled "NVIDIA RTX5000-Ada-8Q" 4 vGPU devices. Logged into Rancher to Create a new cluster against harvester. 3 machines Primary non GPU pool 1 machine 1 gpu pool Image = ubuntu-20.04-minimal-cloudimg-amd64.img Result = 3 Primary Nodes Have no issues and are Running 1 GPU Node Waits while Primary nodes boot. "[INFO ] configuring worker node(s) gputest-gpu1-c9f949cddxjdg96-nxhdx: creating server [fleet-default/gputest-gpu1-1df5240c-m69vl] of kind (HarvesterMachine) for machine gputest-gpu1-c9f949cddxjdg96-nxhdx in infrastructure provider, waiting for agent to check in and apply initial plan83500 pm" 10 Min later 3 primary machines online. Node 7 on the 11th min fails "[INFO ] configuring worker node(s) gputest-gpu1-c9f949cddxjdg96-nxhdx: failed creating server [fleet-default/gputest-gpu1-1df5240c-m69vl] of kind (HarvesterMachine) for machine gputest-gpu1-c9f949cddxjdg96-nxhdx in infrastructure provider: CreateError: Downloading driver from https://k8s.koat.ai/assets/docker-machine-driver-harvesterDoing /etc/rancher/ssldocker-machine-driver-harvesterdocker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, strippedTrying to access option which does not existTHIS ***WILL*** CAUSE UNEXPECTED BEHAVIORType assertion did not go smoothly to string for keyRunning pre-create checks...Creating machine...Error creating machine: Error in driver during machine creation: Too many retries waiting for machine to be Running. Last error: Maximum number of retries (120) exceeded, waiting for agent to check in and apply initial plan" Then deletes and tries to create the node again. I will leave it like this overnight let me know what you want me to try @great-bear-19718 Thanks
Another question is should I not be able to add multiple profiles to the VGPU of a pool?
g
no you cannot.. since we only allow 1vgpu profile to a machine
and pool serves as a template for machine
the actual provisioning error is from rancher
what is the version of rancher you are using?
n
2.8.5 @great-bear-19718 I am having issues also with just creating a VM with PCI vGPU pass through. so it isn't just Rancher cluster VMs
ok so since my profile is a split of the GPU into 4 x 8gb I could deploy up to 4 pods from that profile.
here is a new bundle with the basic ubuntu vm and vgpu that won't start
g
so issue is GPU name
how did you launch the VM?
gpu name should be
<http://nvidia.com/NVIDIA_RTX5000-ADA-8Q|nvidia.com/NVIDIA_RTX5000-ADA-8Q>
the vm spec specifies..
<http://nvidia.com/NVIDIA_RTX5000-Ada-8Q|nvidia.com/NVIDIA_RTX5000-Ada-8Q>
which is not the same as it is case sensitive
n
I feel we are getting close to finding a final resolution here. I created the VM using the GUI. I did not write the Config. That is the name with in the profiles though.
Copy code
nvidia-driver-runtime-lgzr2:/ # cd /sys/bus/pci/devices/0000:0b:00.0/virtfn0/mdev_supported_types
nvidia-driver-runtime-lgzr2:/sys/bus/pci/devices/0000:0b:00.0/virtfn0/mdev_supported_types # ls
nvidia-1028  nvidia-1029  nvidia-1030  nvidia-1031  nvidia-1032  nvidia-1033  nvidia-1034  nvidia-1035  nvidia-1036  nvidia-1037  nvidia-1038  nvidia-1039  nvidia-1040  nvidia-1041
nvidia-driver-runtime-lgzr2:/sys/bus/pci/devices/0000:0b:00.0/virtfn0/mdev_supported_types # cat nvidia-1033/name
NVIDIA RTX5000-Ada-8Q
Here is some more Queries of the devices.
Copy code
nvidia-driver-runtime-lgzr2:/ # cd /sys/bus/mdev/devices
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices # ls
475c556b-4efb-4714-974c-e080715f8a6d  5bbbece9-d755-4e38-a34f-cae18bfeda0b  6f520761-e091-4a9d-b012-dab7ac476f4d  77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices # ls -l
total 0
lrwxrwxrwx 1 root root 0 Aug  6 13:48 475c556b-4efb-4714-974c-e080715f8a6d -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.5/475c556b-4efb-4714-974c-e080715f8a6d
lrwxrwxrwx 1 root root 0 Aug  6 13:48 5bbbece9-d755-4e38-a34f-cae18bfeda0b -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.6/5bbbece9-d755-4e38-a34f-cae18bfeda0b
lrwxrwxrwx 1 root root 0 Aug  6 13:48 6f520761-e091-4a9d-b012-dab7ac476f4d -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.4/6f520761-e091-4a9d-b012-dab7ac476f4d
lrwxrwxrwx 1 root root 0 Aug  6 13:48 77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37 -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.7/77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices # cd 475c556b-4efb-4714-974c-e080715f8a6d
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d # ls
driver  iommu_group  mdev_type  nvidia  power  remove  subsystem  uevent
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d # cd mdev_type
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d/mdev_type # ls
available_instances  create  description  device_api  devices  name
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d/mdev_type # cat name
NVIDIA RTX5000-Ada-8Q
g
i will check the logic.. why it is doing that.. but for now you just need to add the name as i mentioned
i think i know the reason.. most of the A series GPU's have always return the profile in upper case.. the plugin always uses upper case and that is causing the issue.. with a mismatch
i can fix this and give you a dev build with the fixes
and i will try and get them into next release
are you able to please update the pcidevices controller ds to use this dev image..
Copy code
gmehta3/pcidevices:vgpu-type-fix
once image is rolled out you just need to disable vgpudevices / sriovgpudevice and re-enable the gpu and new vgpudevices with correct naming would be created
n
@great-bear-19718 I do see that it did in fact have the profiles now be created in all caps as you see in the screenshot. Doesn't look like there has been a change in the outcome though. Via Cluster Manager and adding a machine with a vGPU I get an error.
Copy code
"Failed creating server [fleet-default/gputest-gpu-6eb49156-dvm6n] of kind (HarvesterMachine) for machine gputest-gpu-5476f67bfbxd9jg9-rq7q6 in infrastructure provider: CreateError: Downloading driver from <https://k8s.koat.ai/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Error with pre-create check: "the server has asked for the client to provide credentials (get <http://settings.harvesterhci.io|settings.harvesterhci.io> server-version)""
AND
"Failed deleting server [fleet-default/gputest-gpu-6eb49156-bj4p5] of kind (HarvesterMachine) for machine gputest-gpu-5476f67bfbxd9jg9-4zbjp in infrastructure provider: DeleteError: Downloading driver from <https://k8s.koat.ai/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped error loading host gputest-gpu-6eb49156-bj4p5: Docker machine "gputest-gpu-6eb49156-bj4p5" does not exist. Use "docker-machine ls" to list machines. Use "docker-machine create" to add a new one."
Trying to use the vGPU in a VM it sits at.
Copy code
Virt-launcher pod has not yet been scheduled
The Yaml for the VM uses lowercase still if that is intended?
Copy code
gpus:
            - deviceName: <http://nvidia.com/NVIDIA_RTX5000-Ada-8Q|nvidia.com/NVIDIA_RTX5000-Ada-8Q>
              name: harvesterdev7-00000b007
I have also tried with a manual update to the yaml config
Copy code
gpus:
            - deviceName: <http://nvidia.com/NVIDIA_RTX5000-ADA-8Q|nvidia.com/NVIDIA_RTX5000-ADA-8Q>
              name: harvesterdev7-00000b007
It looks like it got further see the logs here https://pastebin.com/raw/wg1Ktrpe Here is another support package. I tried initially using your instructions of disable vgpudevices / sriovgpudevice and re-enable the gpu and new vgpudevices, Then I tried a complete disable of the addon and reenable of everything.
@great-bear-19718 hope you had a great weekend. Looking forward to picking this up this week and getting things working.
132 Views