This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

07/22/2024, 8:46 PM

This message was deleted.

great-bear-19718

07/22/2024, 11:18 PM

the driver is loaded by the toolkit pod in harvester-system namespace

great-bear-19718

07/22/2024, 11:19 PM

it will be scheduled to nodes which have the GPU's and will load the driver and nvidia utils.. you can exec into the pod to check

nvidia-smi

output

numerous-angle-77908

07/22/2024, 11:21 PM

Thx for responding @great-bear-19718 I haven't got the devices.harvesterhci.io.sriovgpudevice showing yet after the nvidia-driver-runtime runs

great-bear-19718

07/22/2024, 11:21 PM

the pcidevices controller creates those.. i assume pcidevices controller is enabled?

numerous-angle-77908

07/22/2024, 11:22 PM

Yes pcidevices controller is enabled. But I have not enabled passthrough on the PCI Devices yet for the 2 GPU's

great-bear-19718

07/22/2024, 11:22 PM

you do not need to enable passthrough for gpu's

great-bear-19718

07/22/2024, 11:22 PM

that is only for transferring a full gpu to vm

numerous-angle-77908

07/22/2024, 11:22 PM

Ok good to know

great-bear-19718

07/22/2024, 11:23 PM

a support bundle would be nice to figure out what is going on.. a5000 is based on ampere arch so should be supported

numerous-angle-77908

07/22/2024, 11:23 PM

So I guess the last thing to figure out is how to get them to show in SR-IOV GPU Devices

numerous-angle-77908

07/22/2024, 11:23 PM

No these are the A5000 ADA

great-bear-19718

07/22/2024, 11:23 PM

pcidevices scans the /sys tree

numerous-angle-77908

07/22/2024, 11:23 PM

Lovelace

great-bear-19718

07/22/2024, 11:24 PM

i will need to check.. likely then

great-bear-19718

07/22/2024, 11:24 PM

do they support sriov vgpu or mig based vgpu only?

numerous-angle-77908

07/22/2024, 11:24 PM

sriov vgpu

great-bear-19718

07/22/2024, 11:24 PM

there is a check in pcidevices to look for the same and skip gpu's which do not support sriov vgpu

numerous-angle-77908

07/22/2024, 11:25 PM

The MOBO has sriov enabled and IOMMU

great-bear-19718

07/22/2024, 11:25 PM

i would need to see a support bundle in that case

numerous-angle-77908

07/22/2024, 11:26 PM

Generating now

numerous-angle-77908

07/22/2024, 11:29 PM

https://www.pny.com/en-eu/File%20Library/Professional/DATASHEET/WORKSTATION/PNY-NVIDIA-RTX-5000-Ada-Generation-Datasheet.pdf

great-bear-19718

07/22/2024, 11:30 PM

nvidia's documentation is confusing at best.. it says vgpu.. but then i know vgpu is via mig and sriov based mechanisms

numerous-angle-77908

07/22/2024, 11:30 PM

ya I can't afford the mig cards they were a bit too much $$,$$$

numerous-angle-77908

07/22/2024, 11:31 PM

I did call PNY and got confirmation from them today that these do have SR-IOV support

numerous-angle-77908

07/22/2024, 11:32 PM

Also I am using the latest kvm.run driver 17.3

numerous-angle-77908

07/22/2024, 11:34 PM

The PCI devices that show for these cards are labeled as description:VGA compatible controller: NVIDIA Corporation AD102GL [RTX 5000 Ada Generation]

great-bear-19718

07/22/2024, 11:47 PM

Copy code

nvidia-driver-runtime             0         0         0       0            0           <http://sriovgpu.harvesterhci.io/driver-needed=true|sriovgpu.harvesterhci.io/driver-needed=true>   10d

driver is not loaded because none of the nodes have been labelled by pcidevices controller

great-bear-19718

07/22/2024, 11:47 PM

which of your 2 nodes contain the gpu?

great-bear-19718

07/22/2024, 11:48 PM

i see 8 nodes in the cluster

numerous-angle-77908

07/22/2024, 11:48 PM

harvesterdev7 and harvesterdev8

great-bear-19718

07/22/2024, 11:49 PM

so easy way.. is to disable pcidevices addon..

great-bear-19718

07/22/2024, 11:49 PM

label the nodes 7 & 8 with extra label..

<http://sriovgpu.harvesterhci.io/driver-needed=true|sriovgpu.harvesterhci.io/driver-needed=true>

that will force deployment of driver

great-bear-19718

07/22/2024, 11:49 PM

and we can check what nvidia-smi returns about the gpu

great-bear-19718

07/22/2024, 11:50 PM

pcidevices checks existence of file in the /sys tree under the gpu pci address which i am trying to find

numerous-angle-77908

07/22/2024, 11:51 PM

I've done that with the label sriovgpu.harvesterhci.io/driver-needed=true And things do install.

Copy code

Post-install sanity check passed.
2024-07-22T22:57:42.791341063Z 
2024-07-22T22:57:42.791359988Z Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 550.90.05) is now complete.
2024-07-22T22:57:42.791374576Z 
Mon, Jul 22 2024 4:57:42 pm
running nvidia vgpud
2024-07-22T22:57:42.898278808Z Creating '/dev/char' directory

great-bear-19718

07/22/2024, 11:51 PM

ok.. can you exec into the pod on node 7 then?

numerous-angle-77908

07/22/2024, 11:52 PM

nvidia-smi when I shell into the node does not run

great-bear-19718

07/22/2024, 11:52 PM

it needs to be in the pod

numerous-angle-77908

07/22/2024, 11:52 PM

great-bear-19718

07/22/2024, 11:52 PM

it is not installed on the host

great-bear-19718

07/22/2024, 11:52 PM

based on your support bundle no pod for nvidia driver is running

great-bear-19718

07/22/2024, 11:52 PM

because pcidevices controller will remove the label from node if gpu is not compatible

great-bear-19718

07/22/2024, 11:52 PM

which is why i asked to disabled pcidevices controller addon

great-bear-19718

07/22/2024, 11:53 PM

sriov_vf_device

is the file we check

great-bear-19718

07/22/2024, 11:53 PM

existence of this file is used to identify if GPU is sriov vgpu capable

great-bear-19718

07/22/2024, 11:53 PM

/sys/bus/pci/devices/ADDRESS/

numerous-angle-77908

07/22/2024, 11:56 PM

correct there is no pod because after the daemon runs the nvidia-driver-runtime it deletes the pod after the install

great-bear-19718

07/22/2024, 11:57 PM

pod should not be deleted

great-bear-19718

07/22/2024, 11:57 PM

the pod is needed to manage vgpus subsequently

great-bear-19718

07/22/2024, 11:57 PM

pcidevices controller removes the label from node.. and that will cause k8s to remove pods since the scheduling criteria is not met

numerous-angle-77908

07/22/2024, 11:58 PM

it deletes itself after it

Copy code

Post-install sanity check passed.
2024-07-22T22:57:42.825717826Z 
2024-07-22T22:57:42.825728506Z Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 550.90.05) is now complete.
2024-07-22T22:57:42.825737192Z 
running nvidia vgpud
2024-07-22T22:57:42.924907965Z Creating '/dev/char' directory

great-bear-19718

07/22/2024, 11:58 PM

it should not.. it just waits

great-bear-19718

07/22/2024, 11:58 PM

i would check the label on the nodes if they are still there

great-bear-19718

07/22/2024, 11:59 PM

the pod runs the

vgpud

daemon and waits for subsequent requests from pcidevices controller when you configure vgpus

numerous-angle-77908

07/23/2024, 12:02 AM

No that label no longer exists on the node 7 or 8

numerous-angle-77908

07/23/2024, 12:03 AM

to force the install I have it in my config.yaml for the PXE install. labels: sriovgpu.harvesterhci.io/driver-needed: true

numerous-angle-77908

07/23/2024, 12:03 AM

just for node 7 and 8

great-bear-19718

07/23/2024, 12:08 AM

pcidevices will reconcile and remove that label since it manages it

great-bear-19718

07/23/2024, 12:09 AM

which is why i asked for pcidevices to be disabled

great-bear-19718

07/23/2024, 12:09 AM

andn then label added for debugging

numerous-angle-77908

07/23/2024, 12:09 AM

they are disabled.

numerous-angle-77908

07/23/2024, 12:10 AM

I can add the label again so it tries a reinstall no problem

great-bear-19718

07/23/2024, 12:10 AM

the addon itself needs to be disabled

numerous-angle-77908

07/23/2024, 12:10 AM

Ahhh

great-bear-19718

07/23/2024, 12:10 AM

pcidevices addon runs the pcidevices controller

great-bear-19718

07/23/2024, 12:10 AM

that manages the lifecycle of this label on nodes

numerous-angle-77908

07/23/2024, 12:10 AM

So disable pcidevices-controller

numerous-angle-77908

07/23/2024, 12:13 AM

K I disabled pcidevices-controller and re added the labels. kubectl label nodes harvesterdev7 sriovgpu.harvesterhci.io/driver-needed=true kubectl label nodes harvesterdev8 sriovgpu.harvesterhci.io/driver-needed=true

great-bear-19718

07/23/2024, 12:14 AM

once the pods are running you can exec into them and run

nvidia-smi

and see what the gpu says

great-bear-19718

07/23/2024, 12:14 AM

sriov-manage -e ALL

will enable sriov gpus if they are support and will tell us what we need to know

numerous-angle-77908

07/23/2024, 12:14 AM

No Devices Found.

great-bear-19718

07/23/2024, 12:18 AM

well that could be it..

great-bear-19718

07/23/2024, 12:18 AM

lspci

great-bear-19718

07/23/2024, 12:18 AM

does it show the gpu?

numerous-angle-77908

07/23/2024, 12:19 AM

Copy code

0b:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
0b:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)

great-bear-19718

07/23/2024, 12:19 AM

wrong driver version?

great-bear-19718

07/23/2024, 12:20 AM

if you shell to the node and check

/sys/bus/pci/devices/0000:0b:00.0

do you see the sriov_vf_device file?

numerous-angle-77908

07/23/2024, 12:21 AM

I grabbed them from the nvidia licensing server and used NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run

numerous-angle-77908

07/23/2024, 12:22 AM

using rancher/harvester-nvidia-driver-toolkit:v1.3-20240613

great-bear-19718

07/23/2024, 12:23 AM

that is fine.. if the issue was toolkit driver install would have failed

numerous-angle-77908

07/23/2024, 12:31 AM

I'm going to download 16.7 driver and test

numerous-angle-77908

07/23/2024, 12:35 AM

same issue with the 16.7

great-bear-19718

07/23/2024, 12:41 AM

that is strange.. i would have thought

nvidia-smi -q

should see the gpus

numerous-angle-77908

07/23/2024, 12:47 AM

went back to 13.12 and that driver errored on trying to install and noted it was incompatible. So I reinstalled 17.3

numerous-angle-77908

07/23/2024, 12:47 AM

there is also a Ubuntu KVM version for drivers should I try that

great-bear-19718

07/23/2024, 12:50 AM

i dont think that is the one

great-bear-19718

07/23/2024, 12:50 AM

the driver has to be nvidia kvm generic one i think

numerous-angle-77908

07/23/2024, 12:50 AM

I just downloaded the vgpuDriverCatalog and confirmed that 17.3 is the right driver

great-bear-19718

07/23/2024, 12:52 AM

another support bundle now would help

great-bear-19718

07/23/2024, 12:52 AM

or from the node 6 output of

dmesg

from the host

great-bear-19718

07/23/2024, 12:52 AM

it may shed some light on why the gpu may not be visible

numerous-angle-77908

07/23/2024, 12:55 AM

dmesg from node 7 https://pastebin.com/raw/980sAAcX

great-bear-19718

07/23/2024, 12:56 AM

did you enable a pci passthrough for the gpus?

great-bear-19718

07/23/2024, 12:56 AM

ls -lart /sys/bus/pci/devices/0000:0b:00.0/

would be nice

great-bear-19718

07/23/2024, 12:57 AM

Copy code

[ 4696.401058] NVRM: RmFetchGspRmImages: No firmware image found
[ 4696.401066] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x61:0x56:1697)
[ 4696.401618] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 4696.404266] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2

numerous-angle-77908

07/23/2024, 12:59 AM

did you enable a pci passthrough for the gpus? No pcidevices-controller is still disabled.

Copy code

nvidia-driver-runtime-jdv44:/ # ls -lart /sys/bus/pci/devices/0000:0b:00.0/
total 0
-r--r--r--  1 root root      4096 Jul 22 22:54 vendor
-rw-r--r--  1 root root      4096 Jul 22 22:54 uevent
-r--r--r--  1 root root      4096 Jul 22 22:54 subsystem_device
lrwxrwxrwx  1 root root         0 Jul 22 22:54 subsystem -> ../../../../bus/pci
-rw-------  1 root root    524288 Jul 22 22:54 rom
-r--r--r--  1 root root      4096 Jul 22 22:54 revision
-rw-------  1 root root       128 Jul 22 22:54 resource5
-rw-------  1 root root  33554432 Jul 22 22:54 resource3_wc
-rw-------  1 root root  33554432 Jul 22 22:54 resource3
-rw-------  1 root root 268435456 Jul 22 22:54 resource1_wc
-rw-------  1 root root 268435456 Jul 22 22:54 resource1
-rw-------  1 root root  16777216 Jul 22 22:54 resource0
-r--r--r--  1 root root      4096 Jul 22 22:54 resource
-rw-r--r--  1 root root      4096 Jul 22 22:54 reset_method
--w-------  1 root root      4096 Jul 22 22:54 reset
--w-------  1 root root      4096 Jul 22 22:54 rescan
--w--w----  1 root root      4096 Jul 22 22:54 remove
-r--r--r--  1 root root      4096 Jul 22 22:54 power_state
drwxr-xr-x  2 root root         0 Jul 22 22:54 power
-rw-r--r--  1 root root      4096 Jul 22 22:54 numa_node
-rw-r--r--  1 root root      4096 Jul 22 22:54 msi_bus
-r--r--r--  1 root root      4096 Jul 22 22:54 modalias
-r--r--r--  1 root root      4096 Jul 22 22:54 max_link_width
-r--r--r--  1 root root      4096 Jul 22 22:54 max_link_speed
-r--r--r--  1 root root      4096 Jul 22 22:54 local_cpus
-r--r--r--  1 root root      4096 Jul 22 22:54 local_cpulist
drwxr-xr-x  2 root root         0 Jul 22 22:54 link
-r--r--r--  1 root root      4096 Jul 22 22:54 irq
lrwxrwxrwx  1 root root         0 Jul 22 22:54 iommu_group -> ../../../../kernel/iommu_groups/19
lrwxrwxrwx  1 root root         0 Jul 22 22:54 iommu -> ../../0000:00:00.2/iommu/ivhd0
-rw-r--r--  1 root root      4096 Jul 22 22:54 enable
-rw-r--r--  1 root root      4096 Jul 22 22:54 driver_override
-r--r--r--  1 root root      4096 Jul 22 22:54 dma_mask_bits
-r--r--r--  1 root root      4096 Jul 22 22:54 device
-rw-r--r--  1 root root      4096 Jul 22 22:54 d3cold_allowed
-r--r--r--  1 root root      4096 Jul 22 22:54 current_link_width
-r--r--r--  1 root root      4096 Jul 22 22:54 current_link_speed
lrwxrwxrwx  1 root root         0 Jul 22 22:54 consumer:pci:0000:0b:00.1 -> ../../../virtual/devlink/pci:0000:0b:00.0--pci:0000:0b:00.1
-r--r--r--  1 root root      4096 Jul 22 22:54 consistent_dma_mask_bits
-rw-r--r--  1 root root      4096 Jul 22 22:54 config
-r--r--r--  1 root root      4096 Jul 22 22:54 class
-rw-r--r--  1 root root      4096 Jul 22 22:54 broken_parity_status
-r--r--r--  1 root root      4096 Jul 22 22:54 boot_vga
-r--r--r--  1 root root      4096 Jul 22 22:54 ari_enabled
-r--r--r--  1 root root      4096 Jul 22 22:54 aer_dev_nonfatal
-r--r--r--  1 root root      4096 Jul 22 22:54 aer_dev_fatal
-r--r--r--  1 root root      4096 Jul 22 22:54 aer_dev_correctable
drwxr-xr-x 13 root root         0 Jul 22 22:54 ..
drwxr-xr-x  4 root root         0 Jul 22 22:54 .
-r--r--r--  1 root root      4096 Jul 22 22:57 subsystem_vendor
lrwxrwxrwx  1 root root         0 Jul 23 00:44 driver -> ../../../../bus/pci/drivers/nvidia

great-bear-19718

07/23/2024, 1:00 AM

the sriov file is not there which is why the gpu is not detected for start

great-bear-19718

07/23/2024, 1:00 AM

can you reboot the node and see if it helps?

numerous-angle-77908

07/23/2024, 1:01 AM

great-bear-19718

07/23/2024, 1:01 AM

also this.. https://forums.developer.nvidia.com/t/rtx-a5000-vgpu-support/273584

numerous-angle-77908

07/23/2024, 1:04 AM

If it doesn't work after a restart lets come back to this another time with fresh eyes. I'll do some poking around on forums as well. I appreciate all the work you have already put into this and you have provided me some valuable checks to make along the way

numerous-angle-77908

07/23/2024, 1:06 AM

Well the MOBO I have has a VGA out so I won't have any issues with turning switching the mode if needed

great-bear-19718

07/23/2024, 1:07 AM

i think i had seen an issue in the past with GPU when a VGA device was connected to GPU

numerous-angle-77908

07/23/2024, 1:14 AM

I have to create a account and ask for the mode selector tool to change the card to not use the video out

numerous-angle-77908

07/23/2024, 1:16 AM

the restart didn't change anything with the card nvidia-smi still results in No Devices were found

numerous-angle-77908

07/23/2024, 1:21 AM

I am reaching out to my Nvidia rep on this to see if I can get some tech support from them on this as well. They have been pretty responsive.

numerous-angle-77908

07/23/2024, 1:26 AM

Thx for your help tonight. I will see what comes back from Nvidia and this Mode selector tool

👍 1

numerous-angle-77908

07/23/2024, 7:01 PM

I've posted in the Forums and my Nvidia rep said they would respond there. https://forums.developer.nvidia.com/t/getting-vgpu-working-on-rtx-5000-ada/300866 Do you have any other ideas since yesterday?

numerous-angle-77908

07/23/2024, 7:31 PM

I am going to do a full cluster reinstall and ensure I don't have the pcicontroller enabled from the start while I wait.

numerous-angle-77908

07/23/2024, 7:33 PM

I'll run with these addons

Copy code

addons:
    harvester_vm_import_controller:
      enabled: false
      values_content: ""
    harvester_pcidevices_controller:
      enabled: false
      values_content: ""
    rancher_monitoring:
      enabled: false
      values_content: ""
    rancher_logging:
      enabled: false
      values_content: ""
    harvester_seeder:
      enabled: false
      values_content: ""
    nvidia_driver_toolkit:
      enabled: true
      values_content: ""

numerous-angle-77908

07/23/2024, 8:50 PM

I did a fresh install of harvester this time only adding 2 nodes 1 the default and 7 a gpu node. Commands in the pod. https://pastebin.com/raw/TjBfDVq4

supportbundle_71df5a5d-77ac-4ccc-9fa4-b686d4e45c2f_2024-07-23T20-42-26Z.zip

great-bear-19718

07/23/2024, 11:01 PM

Copy code

[  183.294677] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[  183.298169] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[  183.308629] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[  183.311716] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2
[  225.816298] nvidia 0000:0b:00.0: Direct firmware load for nvidia/550.90.05/gsp_ga10x.bin failed with error -2

has to be this..

great-bear-19718

07/23/2024, 11:02 PM

i do not have this specific GPU in our lab else would have already tried it out

numerous-angle-77908

07/24/2024, 5:34 PM

@great-bear-19718 I am seeing someone post about running the install with sudo ./NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run -m=kernel-open sudo update-initramfs -u sudo reboot Is there a way I can manually install the file with in the nvidia-driver-runtime pod?

numerous-angle-77908

07/24/2024, 6:04 PM

I figured it out sudo /tmp/NVIDIA.run -m=kernel-open after uninstalling the driver that gets installed on pod load. This never fixed the issue with the kernel

numerous-angle-77908

07/24/2024, 9:11 PM

So there are some special instructions around the kernel modules I found with the Base Linux Driver install Specifically for SUSE. https://www.nvidia.com/download/driverResults.aspx/228542/en-us/ https://en.opensuse.org/SDB:NVIDIA_drivers

numerous-angle-77908

07/24/2024, 9:41 PM

https://lists.suse.com/pipermail/sle-updates/2023-August/030800.html

great-bear-19718

07/24/2024, 11:03 PM

i do not think this is needed.. i have seen another issue from an end user about A5000 vgpu passthrough working but then they have a windows guest

great-bear-19718

07/24/2024, 11:03 PM

https://github.com/harvester/harvester/issues/6192

numerous-angle-77908

07/24/2024, 11:11 PM

Ya the difference is the RTX A5000 is Ampere and the RTX 5000 ADA is Lovelace Still just fighting to get the nvidia-smi to recognize the GPU. I have installed the GPU into a windows Machine with the base driver and it does work so I am not dealing with a hardware issue. Still waiting to hear from someone at nvidia

numerous-angle-77908

07/26/2024, 4:56 PM

@great-bear-19718 I managed to get a copy of the nvidia-bug-report.log.gz if that helps diagnose these issues.

nvidia-bug-report.log.gz

numerous-angle-77908

07/26/2024, 8:32 PM

@great-bear-19718 I think I am ready to comeback to you on this now. So I managed to get the nvidia-smi to respond but I needed to add some options to the nvidia.conf

Copy code

sudo bash -c 'echo "options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'

So I guess the question becomes how do I get this nvidia.conf file injected into the Pod before the Driver install.

numerous-angle-77908

07/26/2024, 8:58 PM

How I get it working after the pod is up is I exec into the pod and run through these commands.

Copy code

sudo /usr/bin/nvidia-uninstall

sudo rm -rf /etc/modprobe.d/nvidia.conf
sudo rm -rf /etc/dracut.conf.d/nvidia.conf
sudo find /lib/modules/$(uname -r) -name 'nvidia*' -exec rm -rf {} +
sudo dracut --force

sudo bash -c 'echo "options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_EnableGpuFirmware=0
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" > /etc/modprobe.d/nvidia.conf'

sudo /tmp/NVIDIA.run

nvidia-smi

The next step is how do I get the vGPU's to showup in Harvester

numerous-angle-77908

07/26/2024, 9:49 PM

Ok one problem down. I created a custom image with my nvidia.conf added and it installed and works. registry.gitlab.com/koat-public/koat-nvidia-driver-toolkit Now I just need to get the vGPU's going which It would be great to have your help again on.

numerous-angle-77908

07/26/2024, 10:01 PM

You will see that the card does have

Copy code

vGPU Device Capability
        Fractional Multi-vGPU             : Supported
        Heterogeneous Time-Slice Profiles : Supported
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV

https://pastebin.com/raw/wuVMyCCR

great-bear-19718

07/28/2024, 11:43 PM

does the GPU now show up in harvester UI?

great-bear-19718

07/28/2024, 11:44 PM

also can you please try

Copy code

/usr/local/nvidia/sriov-manage -e 00000000:41:00.0

and see what happens

numerous-angle-77908

07/29/2024, 2:29 PM

@great-bear-19718 that function is in lib not local and when I run it nothing happens.

Copy code

nvidia-driver-runtime-9dz9v:/usr/local # /usr/lib/nvidia/sriov-manage -e 0000:41:00.0
nvidia-driver-runtime-9dz9v:/usr/local # ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn

Is there a Log to look at to see why they are not being created? Also I still have pcidevices-controller addon disabled so I can not see anything in the UI yet. I tried enabling but when I do that it deletes the nvidia-driver-runtime pods that are created from the DaemonSets. What does the the pcidevices-controller look for to determine if the pod needs to be created or deleted?

Copy code

nvidia-driver-runtime-9dz9v:/ # lsmod | grep vfio
nvidia_vgpu_vfio       69632  0
mdev                   28672  1 nvidia_vgpu_vfio
vfio_iommu_type1       40960  0
vfio                   45056  3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
kvm                  1056768  2 kvm_amd,nvidia_vgpu_vfio
irqbypass              16384  2 nvidia_vgpu_vfio,kvm
nvidia-driver-runtime-9dz9v:/ # modinfo nvidia_vgpu_vfio
filename:       /lib/modules/5.14.21-150400.24.119-default/kernel/drivers/video/nvidia-vgpu-vfio.ko
softdep:        pre: nvidia
import_ns:      IOMMUFD
version:        535.183.04
supported:      external
license:        Dual MIT/GPL
suserelease:    SLE15-SP4
srcversion:     81DFBEABED825A6138217B0
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        vfio,mdev,irqbypass,kvm
retpoline:      Y
name:           nvidia_vgpu_vfio
vermagic:       5.14.21-150400.24.119-default SMP preempt mod_unload modversions

great-bear-19718

07/29/2024, 11:04 PM

yeah that is likely because of the file used to identify if GPU can do sriov vgpus

great-bear-19718

07/29/2024, 11:04 PM

when you did the

sriov-manage -e pciaddress

do you see vf's create in /sys/bus/pci/devices/pciaddress ?

great-bear-19718

07/29/2024, 11:07 PM

https://documentation.suse.com/sles/15-SP5/html/SLES-all/article-nvidia-vgpu.html#configure-nvidia-vgpu-passthrough-with-sriov is what i would expect

numerous-angle-77908

07/30/2024, 3:37 PM

No nvidia-driver-runtime-9dz9v:/usr/local # /usr/lib/nvidia/sriov-manage -e 00004100.0 There is no error nvidia-driver-runtime-9dz9v:/usr/local # ls -l /sys/bus/pci/devices/00004100.0/ | grep virtfn doesn't result in any new devices created. There must be an error somewhere or a log entry. Would you know where to look where there could be either an error or confirmation after running /usr/lib/nvidia/sriov-manage -e 00004100.0 I am also on with PNY the manufacture of the card and they are saying that it must be some OS issues. They instructed me to turn on the ECC memory of the card as a long shot but something to try.

numerous-angle-77908

07/30/2024, 3:39 PM

Copy code

nvidia-driver-runtime-9dz9v:/ # lspci | grep NVIDIA
41:00.0 VGA compatible controller: NVIDIA Corporation Device 26b2 (rev a1)
41:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
nvidia-driver-runtime-9dz9v:/ # sudo /usr/lib/nvidia/sriov-manage -e 00:41:0000.0
nvidia-driver-runtime-9dz9v:/ # ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn
nvidia-driver-runtime-9dz9v:/ #

numerous-angle-77908

07/30/2024, 3:41 PM

Copy code

nvidia-driver-runtime-9dz9v:/ # ls -l /sys/bus/pci/devices/0000:41:00.0/              
total 0
-r--r--r-- 1 root root      4096 Jul 27 12:34 aer_dev_correctable
-r--r--r-- 1 root root      4096 Jul 27 12:34 aer_dev_fatal
-r--r--r-- 1 root root      4096 Jul 27 12:34 aer_dev_nonfatal
-r--r--r-- 1 root root      4096 Jul 27 12:34 ari_enabled
-r--r--r-- 1 root root      4096 Jul 27 12:34 boot_vga
-rw-r--r-- 1 root root      4096 Jul 27 12:34 broken_parity_status
-r--r--r-- 1 root root      4096 Jul 27 12:33 class
-rw-r--r-- 1 root root      4096 Jul 27 12:33 config
-r--r--r-- 1 root root      4096 Jul 27 12:34 consistent_dma_mask_bits
lrwxrwxrwx 1 root root         0 Jul 27 12:34 consumer:pci:0000:41:00.1 -> ../../../virtual/devlink/pci:0000:41:00.0--pci:0000:41:00.1
-r--r--r-- 1 root root      4096 Jul 27 12:34 current_link_speed
-r--r--r-- 1 root root      4096 Jul 27 12:34 current_link_width
-rw-r--r-- 1 root root      4096 Jul 27 12:34 d3cold_allowed
-r--r--r-- 1 root root      4096 Jul 27 12:33 device
-r--r--r-- 1 root root      4096 Jul 27 12:34 dma_mask_bits
lrwxrwxrwx 1 root root         0 Jul 27 12:34 driver -> ../../../../bus/pci/drivers/nvidia
-rw-r--r-- 1 root root      4096 Jul 27 12:34 driver_override
-rw-r--r-- 1 root root      4096 Jul 27 12:34 enable
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-10
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-5
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-6
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-7
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-8
drwxr-xr-x 3 root root         0 Jul 27 12:33 i2c-9
lrwxrwxrwx 1 root root         0 Jul 27 12:34 iommu -> ../../0000:40:00.2/iommu/ivhd2
lrwxrwxrwx 1 root root         0 Jul 27 12:34 iommu_group -> ../../../../kernel/iommu_groups/44
-r--r--r-- 1 root root      4096 Jul 27 12:33 irq
drwxr-xr-x 2 root root         0 Jul 27 12:34 link
-r--r--r-- 1 root root      4096 Jul 27 12:34 local_cpulist
-r--r--r-- 1 root root      4096 Jul 27 12:34 local_cpus
-r--r--r-- 1 root root      4096 Jul 27 12:34 max_link_speed
-r--r--r-- 1 root root      4096 Jul 27 12:34 max_link_width
-r--r--r-- 1 root root      4096 Jul 27 12:34 modalias
-rw-r--r-- 1 root root      4096 Jul 27 12:34 msi_bus
drwxr-xr-x 2 root root         0 Jul 27 12:34 msi_irqs
-rw-r--r-- 1 root root      4096 Jul 27 12:34 numa_node
drwxr-xr-x 2 root root         0 Jul 27 12:34 power
-r--r--r-- 1 root root      4096 Jul 27 12:34 power_state
--w--w---- 1 root root      4096 Jul 27 12:34 remove
--w------- 1 root root      4096 Jul 27 12:34 rescan
--w------- 1 root root      4096 Jul 27 12:34 reset
-rw-r--r-- 1 root root      4096 Jul 27 12:34 reset_method
-r--r--r-- 1 root root      4096 Jul 27 12:33 resource
-rw------- 1 root root  16777216 Jul 27 12:34 resource0
-rw------- 1 root root 268435456 Jul 27 12:34 resource1
-rw------- 1 root root 268435456 Jul 27 12:34 resource1_wc
-rw------- 1 root root  33554432 Jul 27 12:34 resource3
-rw------- 1 root root  33554432 Jul 27 12:34 resource3_wc
-rw------- 1 root root       128 Jul 27 12:34 resource5
-r--r--r-- 1 root root      4096 Jul 27 12:33 revision
-rw------- 1 root root    524288 Jul 27 12:34 rom
lrwxrwxrwx 1 root root         0 Jul 27 12:31 subsystem -> ../../../../bus/pci
-r--r--r-- 1 root root      4096 Jul 27 12:33 subsystem_device
-r--r--r-- 1 root root      4096 Jul 27 12:33 subsystem_vendor
-rw-r--r-- 1 root root      4096 Jul 27 12:31 uevent
-r--r--r-- 1 root root      4096 Jul 27 12:31 vendor

great-bear-19718

07/30/2024, 9:58 PM

if the vendor thinks it is the OS.. you can try in on a different OS, install the kvm drivers and run sriov-manage

great-bear-19718

07/30/2024, 9:59 PM

I have a feeling it is neither of those two, but it will help rule them out

👍 1

numerous-angle-77908

07/31/2024, 4:22 PM

I'm getting closer. So now I have ran the .\displaymodeselector.exe --gpumode to disable the display even though it said it was already disabled. Scared myself after that Firmware process because the MOBO didn't post after that if the GPU was installed. Found that I needed to enable "Above 4G Decoding" on my MOBO and now it posts. In Ubuntu now

Copy code

sudo /usr/lib/nvidia/sriov-manage -e 0000:0b:00.0
Enabling VFs on 0000:0b:00.0
ls -l /sys/bus/pci/devices/0000:0b:00.0/ | grep virtfn

Runs and creates the devices. SO now I am moving that card over to the Harvester Node and I will proceed with testing there.

numerous-angle-77908

07/31/2024, 4:35 PM

@great-bear-19718 success I was able to run the below and it created the devices. I re-enabled the pcidevices-controller and it is now finding those devices.

Copy code

nvidia-driver-runtime-mv584:/ # sudo /usr/lib/nvidia/sriov-manage -e 0000:0b:00.0
Enabling VFs on 0000:0b:00.0
nvidia-driver-runtime-mv584:/ # ls -l /sys/bus/pci/devices/0000:0b:00.0/ | grep virtfn
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn0 -> ../0000:0b:00.4
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn1 -> ../0000:0b:00.5
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn10 -> ../0000:0b:01.6
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn11 -> ../0000:0b:01.7
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn12 -> ../0000:0b:02.0
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn13 -> ../0000:0b:02.1
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn14 -> ../0000:0b:02.2
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn15 -> ../0000:0b:02.3
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn16 -> ../0000:0b:02.4
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn17 -> ../0000:0b:02.5
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn18 -> ../0000:0b:02.6
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn19 -> ../0000:0b:02.7
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn2 -> ../0000:0b:00.6
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn20 -> ../0000:0b:03.0
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn21 -> ../0000:0b:03.1
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn22 -> ../0000:0b:03.2
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn23 -> ../0000:0b:03.3
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn24 -> ../0000:0b:03.4
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn25 -> ../0000:0b:03.5
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn26 -> ../0000:0b:03.6
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn27 -> ../0000:0b:03.7
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn28 -> ../0000:0b:04.0
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn29 -> ../0000:0b:04.1
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn3 -> ../0000:0b:00.7
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn30 -> ../0000:0b:04.2
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn31 -> ../0000:0b:04.3
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn4 -> ../0000:0b:01.0
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn5 -> ../0000:0b:01.1
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn6 -> ../0000:0b:01.2
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn7 -> ../0000:0b:01.3
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn8 -> ../0000:0b:01.4
lrwxrwxrwx 1 root root           0 Jul 31 16:26 virtfn9 -> ../0000:0b:01.5

great-bear-19718

07/31/2024, 10:59 PM

woohoo

great-bear-19718

07/31/2024, 10:59 PM

enjoy 😄

numerous-angle-77908

08/01/2024, 1:51 AM

@great-bear-19718 I just did a fresh install and things are working ok for the setup. One question is after I enable the SR-IOV GPU and it creates the Devices. How long before I see them in the vGPU Devices? They are already in the PCI Devices..

great-bear-19718

08/01/2024, 1:58 AM

they can take upto a minute or so..

great-bear-19718

08/01/2024, 1:59 AM

i forgot but pcidevices controller runs a rescan of entire device tree every minute or so

numerous-angle-77908

08/01/2024, 2:49 AM

So the default image was the issue. The nvidia-smi still would not recognize the gpus when ran with in the pod. I had to go back to my image that creates the nvidia.conf registry.gitlab.com/koat-public/koat-nvidia-driver-toolkit:latest against the NVIDIA-Linux-x86_64-535.183.04-vgpu-kvm.run driver

numerous-angle-77908

08/01/2024, 2:49 AM

vGPU devices got populated shortly after switching back to my image

great-bear-19718

08/01/2024, 2:51 AM

i will have to read up the extra settings

great-bear-19718

08/01/2024, 2:51 AM

i have never had to use them.. but i only have tested against A2,A30 and A100s that I have available

numerous-angle-77908

08/01/2024, 3:29 AM

Must be nice 🙂

numerous-angle-77908

08/01/2024, 3:29 AM

Wish I had some a100's

numerous-angle-77908

08/01/2024, 4:43 AM

I just setup my 1st cluster and testing allocation of the vGPU. I have 2 GPU's on different nodes. Can I not have a pool with one vGPU setup split of 4gb per and another pool with vGPU setup with 8gb per. Since they are different GPU's I was expecting that I could use one for heavier loads and one for lighter?

numerous-angle-77908

08/01/2024, 5:05 AM

I'm not able to allocate the vGPU's to pools or to VM's.. Pools with out vGPU's provision with no issues. When I add a vGPU to the pool I'm getting this error on the node being created.

Copy code

Failed creating server [fleet-default/gputest-gpu4-b6c0da6d-glzm7] of kind (HarvesterMachine) for machine gputest-gpu4-554fffb4fbxz9xx4-lvh65 in infrastructure provider: CreateError: Downloading driver from <https://k8s.koat.ai/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Creating machine... Error creating machine: Error in driver during machine creation: Too many retries waiting for machine to be Running. Last error: Maximum number of retries (120) exceeded

numerous-angle-77908

08/01/2024, 2:17 PM

When I add a basic VM and pass in a vGPU I just get the Pod hung with

Copy code

Virt-launcher pod has not yet been scheduled

numerous-angle-77908

08/01/2024, 3:53 PM

Here is the virt-controller log showing the is not permitted error.

Copy code

{"component":"virt-controller","kind":"","level":"error","msg":"Updating the VirtualMachine status failed.","name":"testrocky","namespace":"koatprod","pos":"vm.go:387","reason":"Operation cannot be fulfilled on <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> \"testrocky\": the object has been modified; please apply your changes to the latest version and try again","timestamp":"2024-08-01T15:51:44.827179Z","uid":"4ff39231-46d3-40bb-9b2a-ce9ce1d6f500"}
2024-08-01T15:51:44.827308116Z {"component":"virt-controller","level":"info","msg":"re-enqueuing VirtualMachine koatprod/testrocky","pos":"vm.go:281","reason":"Operation cannot be fulfilled on <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> \"testrocky\": the object has been modified; please apply your changes to the latest version and try again","timestamp":"2024-08-01T15:51:44.827228Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:44.861151Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:44.931519Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:45.040344Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:45.234341Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:45.586475Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:46.250799Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:47.564281Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:50.160744Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:51:55.319318Z"}
{"component":"virt-controller","level":"info","msg":"Updating VMIs phase metrics","pos":"collector.go:245","timestamp":"2024-08-01T15:52:04.474730Z"}
{"component":"virt-controller","level":"info","msg":"phase map[{Phase:pending OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:}:1 {Phase:running OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:harvesterdev2}:1 {Phase:running OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:harvesterdev7}:1 {Phase:running OS:<none> Workload:<none> Flavor:<none> InstanceType:<none> Preference:<none> NodeName:harvesterdev8}:1]","pos":"collector.go:247","timestamp":"2024-08-01T15:52:04.474862Z"}
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance koatprod/testrocky","pos":"vmi.go:322","reason":"failed to render launch manifest: GPU <http://nvidia.com/NVIDIA_RTX5000-Ada-4Q|nvidia.com/NVIDIA_RTX5000-Ada-4Q> is not permitted in permittedHostDevices configuration","timestamp":"2024-08-01T15:52:05.590622Z"}
{"component":"virt-controller","level":"info","msg":"TSC Freqency node update status: 0 updated, 8 skipped, 0 errors","pos":"nodetopologyupdater.go:44","timestamp":"2024-08-01T15:52:10.898343Z"}

great-bear-19718

08/04/2024, 11:45 PM

a new support bundle would be nice.. ideally pcidevices controller patches the kubevirt crd with list of permitted host devices

great-bear-19718

08/04/2024, 11:46 PM

i would need a bundle to check what happened there

numerous-angle-77908

08/06/2024, 2:51 AM

Here is a support package. I created a fresh install. Steps Created Node 1,2,3 non GPU Created Node 7 with GPU and enabled "NVIDIA RTX5000-Ada-8Q" 4 vGPU devices. Logged into Rancher to Create a new cluster against harvester. 3 machines Primary non GPU pool 1 machine 1 gpu pool Image = ubuntu-20.04-minimal-cloudimg-amd64.img Result = 3 Primary Nodes Have no issues and are Running 1 GPU Node Waits while Primary nodes boot. "[INFO ] configuring worker node(s) gputest-gpu1-c9f949cddxjdg96-nxhdx: creating server [fleet-default/gputest-gpu1-1df5240c-m69vl] of kind (HarvesterMachine) for machine gputest-gpu1-c9f949cddxjdg96-nxhdx in infrastructure provider, waiting for agent to check in and apply initial plan83500 pm" 10 Min later 3 primary machines online. Node 7 on the 11th min fails "[INFO ] configuring worker node(s) gputest-gpu1-c9f949cddxjdg96-nxhdx: failed creating server [fleet-default/gputest-gpu1-1df5240c-m69vl] of kind (HarvesterMachine) for machine gputest-gpu1-c9f949cddxjdg96-nxhdx in infrastructure provider: CreateError: Downloading driver from https://k8s.koat.ai/assets/docker-machine-driver-harvesterDoing /etc/rancher/ssldocker-machine-driver-harvesterdocker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, strippedTrying to access option which does not existTHIS ***WILL*** CAUSE UNEXPECTED BEHAVIORType assertion did not go smoothly to string for keyRunning pre-create checks...Creating machine...Error creating machine: Error in driver during machine creation: Too many retries waiting for machine to be Running. Last error: Maximum number of retries (120) exceeded, waiting for agent to check in and apply initial plan" Then deletes and tries to create the node again. I will leave it like this overnight let me know what you want me to try @great-bear-19718 Thanks

supportbundle_fd618382-5e22-4766-bb60-1ca8bcd6f686_2024-08-06T02-47-49Z.zip

numerous-angle-77908

08/06/2024, 3:02 AM

Another question is should I not be able to add multiple profiles to the VGPU of a pool?

great-bear-19718

08/06/2024, 3:04 AM

no you cannot.. since we only allow 1vgpu profile to a machine

great-bear-19718

08/06/2024, 3:04 AM

and pool serves as a template for machine

great-bear-19718

08/06/2024, 3:05 AM

the actual provisioning error is from rancher

great-bear-19718

08/06/2024, 3:06 AM

what is the version of rancher you are using?

numerous-angle-77908

08/06/2024, 12:38 PM

2.8.5 @great-bear-19718 I am having issues also with just creating a VM with PCI vGPU pass through. so it isn't just Rancher cluster VMs

numerous-angle-77908

08/06/2024, 1:32 PM

ok so since my profile is a split of the GPU into 4 x 8gb I could deploy up to 4 pods from that profile.

numerous-angle-77908

08/06/2024, 2:21 PM

here is a new bundle with the basic ubuntu vm and vgpu that won't start

supportbundle_fd618382-5e22-4766-bb60-1ca8bcd6f686_2024-08-06T14-12-11Z.zip

great-bear-19718

08/07/2024, 5:33 AM

so issue is GPU name

great-bear-19718

08/07/2024, 5:33 AM

how did you launch the VM?

great-bear-19718

08/07/2024, 5:33 AM

gpu name should be

<http://nvidia.com/NVIDIA_RTX5000-ADA-8Q|nvidia.com/NVIDIA_RTX5000-ADA-8Q>

great-bear-19718

08/07/2024, 5:33 AM

the vm spec specifies..

<http://nvidia.com/NVIDIA_RTX5000-Ada-8Q|nvidia.com/NVIDIA_RTX5000-Ada-8Q>

which is not the same as it is case sensitive

numerous-angle-77908

08/07/2024, 1:37 PM

I feel we are getting close to finding a final resolution here. I created the VM using the GUI. I did not write the Config. That is the name with in the profiles though.

Copy code

nvidia-driver-runtime-lgzr2:/ # cd /sys/bus/pci/devices/0000:0b:00.0/virtfn0/mdev_supported_types
nvidia-driver-runtime-lgzr2:/sys/bus/pci/devices/0000:0b:00.0/virtfn0/mdev_supported_types # ls
nvidia-1028  nvidia-1029  nvidia-1030  nvidia-1031  nvidia-1032  nvidia-1033  nvidia-1034  nvidia-1035  nvidia-1036  nvidia-1037  nvidia-1038  nvidia-1039  nvidia-1040  nvidia-1041
nvidia-driver-runtime-lgzr2:/sys/bus/pci/devices/0000:0b:00.0/virtfn0/mdev_supported_types # cat nvidia-1033/name
NVIDIA RTX5000-Ada-8Q

numerous-angle-77908

08/07/2024, 2:25 PM

Here is some more Queries of the devices.

Copy code

nvidia-driver-runtime-lgzr2:/ # cd /sys/bus/mdev/devices
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices # ls
475c556b-4efb-4714-974c-e080715f8a6d  5bbbece9-d755-4e38-a34f-cae18bfeda0b  6f520761-e091-4a9d-b012-dab7ac476f4d  77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices # ls -l
total 0
lrwxrwxrwx 1 root root 0 Aug  6 13:48 475c556b-4efb-4714-974c-e080715f8a6d -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.5/475c556b-4efb-4714-974c-e080715f8a6d
lrwxrwxrwx 1 root root 0 Aug  6 13:48 5bbbece9-d755-4e38-a34f-cae18bfeda0b -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.6/5bbbece9-d755-4e38-a34f-cae18bfeda0b
lrwxrwxrwx 1 root root 0 Aug  6 13:48 6f520761-e091-4a9d-b012-dab7ac476f4d -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.4/6f520761-e091-4a9d-b012-dab7ac476f4d
lrwxrwxrwx 1 root root 0 Aug  6 13:48 77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37 -> ../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.7/77e13f64-dd22-42f2-a8c8-ab3f3c4a0e37
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices # cd 475c556b-4efb-4714-974c-e080715f8a6d
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d # ls
driver  iommu_group  mdev_type  nvidia  power  remove  subsystem  uevent
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d # cd mdev_type
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d/mdev_type # ls
available_instances  create  description  device_api  devices  name
nvidia-driver-runtime-lgzr2:/sys/bus/mdev/devices/475c556b-4efb-4714-974c-e080715f8a6d/mdev_type # cat name
NVIDIA RTX5000-Ada-8Q

great-bear-19718

08/07/2024, 10:41 PM

i will check the logic.. why it is doing that.. but for now you just need to add the name as i mentioned

great-bear-19718

08/07/2024, 10:44 PM

i think i know the reason.. most of the A series GPU's have always return the profile in upper case.. the plugin always uses upper case and that is causing the issue.. with a mismatch

great-bear-19718

08/07/2024, 10:44 PM

i can fix this and give you a dev build with the fixes

great-bear-19718

08/07/2024, 10:44 PM

and i will try and get them into next release

great-bear-19718

08/07/2024, 11:50 PM

are you able to please update the pcidevices controller ds to use this dev image..

Copy code

gmehta3/pcidevices:vgpu-type-fix

great-bear-19718

08/07/2024, 11:51 PM

once image is rolled out you just need to disable vgpudevices / sriovgpudevice and re-enable the gpu and new vgpudevices with correct naming would be created

great-bear-19718

08/08/2024, 12:09 AM

https://github.com/harvester/harvester/issues/6294

numerous-angle-77908

08/08/2024, 8:37 PM

@great-bear-19718 I do see that it did in fact have the profiles now be created in all caps as you see in the screenshot. Doesn't look like there has been a change in the outcome though. Via Cluster Manager and adding a machine with a vGPU I get an error.

Copy code

"Failed creating server [fleet-default/gputest-gpu-6eb49156-dvm6n] of kind (HarvesterMachine) for machine gputest-gpu-5476f67bfbxd9jg9-rq7q6 in infrastructure provider: CreateError: Downloading driver from <https://k8s.koat.ai/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Error with pre-create check: "the server has asked for the client to provide credentials (get <http://settings.harvesterhci.io|settings.harvesterhci.io> server-version)""
AND
"Failed deleting server [fleet-default/gputest-gpu-6eb49156-bj4p5] of kind (HarvesterMachine) for machine gputest-gpu-5476f67bfbxd9jg9-4zbjp in infrastructure provider: DeleteError: Downloading driver from <https://k8s.koat.ai/assets/docker-machine-driver-harvester> Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped error loading host gputest-gpu-6eb49156-bj4p5: Docker machine "gputest-gpu-6eb49156-bj4p5" does not exist. Use "docker-machine ls" to list machines. Use "docker-machine create" to add a new one."

Trying to use the vGPU in a VM it sits at.

Copy code

Virt-launcher pod has not yet been scheduled

The Yaml for the VM uses lowercase still if that is intended?

Copy code

gpus:
            - deviceName: <http://nvidia.com/NVIDIA_RTX5000-Ada-8Q|nvidia.com/NVIDIA_RTX5000-Ada-8Q>
              name: harvesterdev7-00000b007

I have also tried with a manual update to the yaml config

Copy code

gpus:
            - deviceName: <http://nvidia.com/NVIDIA_RTX5000-ADA-8Q|nvidia.com/NVIDIA_RTX5000-ADA-8Q>
              name: harvesterdev7-00000b007

It looks like it got further see the logs here https://pastebin.com/raw/wg1Ktrpe Here is another support package. I tried initially using your instructions of disable vgpudevices / sriovgpudevice and re-enable the gpu and new vgpudevices, Then I tried a complete disable of the addon and reenable of everything.

supportbundle_fd618382-5e22-4766-bb60-1ca8bcd6f686_2024-08-08T20-34-04Z.zip

numerous-angle-77908

08/12/2024, 8:02 PM

@great-bear-19718 hope you had a great weekend. Looking forward to picking this up this week and getting things working.

148 Views

Open in Slack

Previous Next