This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

07/22/2024, 3:50 PM

This message was deleted.

great-bear-19718

07/23/2024, 12:24 AM

can you please check what is the image in the ds

nvidia-driver-runtime

in harvester-system namespace?

gifted-night-57750

07/23/2024, 8:10 PM

Hi! I got it working by manually editing the the ds. I did a docker login and pulled the image down from our internal artifactory instance.

Copy code

Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       63s                default-scheduler  Successfully assigned harvester-system/nvidia-driver-runtime-vl9kk to hrn0
  Normal   AddedInterface  63s                multus             Add eth0 [10.52.0.112/32] from k8s-pod-network
  Normal   Pulled          63s                kubelet            Successfully pulled image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613" in 28.826693ms (28.83368ms including waiting)
  Normal   Pulled          60s                kubelet            Successfully pulled image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613" in 24.007639ms (24.014945ms including waiting)
  Normal   Pulled          44s                kubelet            Successfully pulled image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613" in 27.234193ms (27.242451ms including waiting)
  Normal   Pulling         16s (x4 over 63s)  kubelet            Pulling image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613"
  Normal   Created         16s (x4 over 63s)  kubelet            Created container nvidia-driver-ctr
  Normal   Started         16s (x4 over 63s)  kubelet            Started container nvidia-driver-ctr
  Normal   Pulled          16s                kubelet            Successfully pulled image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613" in 26.040731ms (26.046675ms including waiting)
  Warning  BackOff         2s (x5 over 58s)   kubelet            Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-runtime-vl9kk_harvester-system(4fe3a0f7-03ab-4aea-88e6-f7b9656c07e5)

gifted-night-57750

07/23/2024, 8:11 PM

however, now there is a crash loop. I can still pass vgpu devices into the VMs.

Copy code

nvidia-driver-runtime-vl9kk                            0/1     CrashLoopBackOff   5 (79s ago)     4m27

gifted-night-57750

07/23/2024, 8:14 PM

Here are the container logs:

Copy code

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.54.16......................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: An NVIDIA kernel module 'nvidia-vgpu-vfio' appears to be already loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Some of the sanity checks that nvidia-installer performs to detect potential installation problems are not possible while an NVIDIA kernel module is running.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at <http://www.nvidia.com|www.nvidia.com>.


Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 72 CPUs online; setting concurrency level to 32.
Scanning the initramfs with lsinitrd...
/usr/bin/lsinitrd requires a file path argument, but none was given.
/usr/bin/lsinitrd requires a file path argument, but none was given.
/usr/bin/lsinitrd requires a file path argument, but none was given.
/usr/bin/lsinitrd requires a file path argument, but none was given.
Initramfs scan failed.
Would you like to continue installation and skip the sanity checks? If not, please abort the installation, then close any programs which may be using the NVIDIA GPU(s), and attempt installation again. (Answer: Abort installation)

gifted-night-57750

07/23/2024, 8:15 PM

Inside of windows VMs, I am getting a code 12.

great-bear-19718

07/23/2024, 11:03 PM

you can reboot the node and force the driver to be reloaded and error should go away in the driver pod

great-bear-19718

07/23/2024, 11:04 PM

i cannot comment on about the error in windows vm

gifted-night-57750

07/24/2024, 1:18 PM

Thank you for your help so far, I really appreciate it! I restarted like you mentioned and manually edited the nvidia run time container with kubectl

Copy code

edit ds nvidia-driver-runtime -n harvester-system

gifted-night-57750

07/24/2024, 1:19 PM

I did this sense the UI doesn't seem to change the image name.

gifted-night-57750

07/24/2024, 1:20 PM

That worked and now I am still getting code 12 in the windows VM. I saw this in the Harvester VM logs

Copy code

{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_NVIDIA_A2-4A not set for resource <http://nvidia.com/NVIDIA_A2-4A%22,%22pos%22:%22addresspool.go:51%22,%22timestamp%22:%222024-07-24T13:10:43.952455Z%22}|nvidia.com/NVIDIA_A2-4A","pos":"addresspool.go:51","timestamp":"2024-07-24T13:10:43.952455Z"}>
{"component":"virt-launcher","level":"info","msg":"host-devices created: [040781e4-a40c-4f7d-8657-5d65d2979ef4]","pos":"hostdev.go:98","timestamp":"2024-07-24T13:10:43.952525Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"wintest","namespace":"default","pos":"server.go:208","timestamp":"2024-07-24T13:10:43.953553Z","uid":"0b1180b8-4f0d-4a62-80eb-6cad8b5ecb3d"}
2024-07-24T13:10:

gifted-night-57750

07/24/2024, 1:20 PM

What does this line mean

Copy code

{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_NVIDIA_A2-4A not set for resource <http://nvidia.com/NVIDIA_A2-4A%22,%22pos%22:%22addresspool.go:51%22,%22timestamp%22:%222024-07-24T13:10:43.952455Z%22}|nvidia.com/NVIDIA_A2-4A","pos":"addresspool.go:51","timestamp":"2024-07-24T13:10:43.952455Z"}>

great-bear-19718

07/24/2024, 11:05 PM

the device has been passed through.. the error is in windows vm.. so that would be the right place to check

20 Views

Open in Slack

Previous Next