This message was deleted.
# harvester
a
This message was deleted.
g
can you please check what is the image in the ds
nvidia-driver-runtime
in harvester-system namespace?
g
Hi! I got it working by manually editing the the ds. I did a docker login and pulled the image down from our internal artifactory instance.
Copy code
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       63s                default-scheduler  Successfully assigned harvester-system/nvidia-driver-runtime-vl9kk to hrn0
  Normal   AddedInterface  63s                multus             Add eth0 [10.52.0.112/32] from k8s-pod-network
  Normal   Pulled          63s                kubelet            Successfully pulled image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613" in 28.826693ms (28.83368ms including waiting)
  Normal   Pulled          60s                kubelet            Successfully pulled image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613" in 24.007639ms (24.014945ms including waiting)
  Normal   Pulled          44s                kubelet            Successfully pulled image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613" in 27.234193ms (27.242451ms including waiting)
  Normal   Pulling         16s (x4 over 63s)  kubelet            Pulling image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613"
  Normal   Created         16s (x4 over 63s)  kubelet            Created container nvidia-driver-ctr
  Normal   Started         16s (x4 over 63s)  kubelet            Started container nvidia-driver-ctr
  Normal   Pulled          16s                kubelet            Successfully pulled image "artifactory.domain.net/folder-1-dev/rancher/harvester-nvidia-driver-toolkit:v1.3-20240613" in 26.040731ms (26.046675ms including waiting)
  Warning  BackOff         2s (x5 over 58s)   kubelet            Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-runtime-vl9kk_harvester-system(4fe3a0f7-03ab-4aea-88e6-f7b9656c07e5)
however, now there is a crash loop. I can still pass vgpu devices into the VMs.
Copy code
nvidia-driver-runtime-vl9kk                            0/1     CrashLoopBackOff   5 (79s ago)     4m27
Here are the container logs:
Copy code
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.54.16......................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: An NVIDIA kernel module 'nvidia-vgpu-vfio' appears to be already loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Some of the sanity checks that nvidia-installer performs to detect potential installation problems are not possible while an NVIDIA kernel module is running.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at <http://www.nvidia.com|www.nvidia.com>.


Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 72 CPUs online; setting concurrency level to 32.
Scanning the initramfs with lsinitrd...
/usr/bin/lsinitrd requires a file path argument, but none was given.
/usr/bin/lsinitrd requires a file path argument, but none was given.
/usr/bin/lsinitrd requires a file path argument, but none was given.
/usr/bin/lsinitrd requires a file path argument, but none was given.
Initramfs scan failed.
Would you like to continue installation and skip the sanity checks? If not, please abort the installation, then close any programs which may be using the NVIDIA GPU(s), and attempt installation again. (Answer: Abort installation)
Inside of windows VMs, I am getting a code 12.
g
you can reboot the node and force the driver to be reloaded and error should go away in the driver pod
i cannot comment on about the error in windows vm
g
Thank you for your help so far, I really appreciate it! I restarted like you mentioned and manually edited the nvidia run time container with kubectl
Copy code
edit ds nvidia-driver-runtime -n harvester-system
I did this sense the UI doesn't seem to change the image name.
That worked and now I am still getting code 12 in the windows VM. I saw this in the Harvester VM logs
Copy code
{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_NVIDIA_A2-4A not set for resource <http://nvidia.com/NVIDIA_A2-4A%22,%22pos%22:%22addresspool.go:51%22,%22timestamp%22:%222024-07-24T13:10:43.952455Z%22}|nvidia.com/NVIDIA_A2-4A","pos":"addresspool.go:51","timestamp":"2024-07-24T13:10:43.952455Z"}>
{"component":"virt-launcher","level":"info","msg":"host-devices created: [040781e4-a40c-4f7d-8657-5d65d2979ef4]","pos":"hostdev.go:98","timestamp":"2024-07-24T13:10:43.952525Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"wintest","namespace":"default","pos":"server.go:208","timestamp":"2024-07-24T13:10:43.953553Z","uid":"0b1180b8-4f0d-4a62-80eb-6cad8b5ecb3d"}
2024-07-24T13:10:
What does this line mean
Copy code
{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_NVIDIA_A2-4A not set for resource <http://nvidia.com/NVIDIA_A2-4A%22,%22pos%22:%22addresspool.go:51%22,%22timestamp%22:%222024-07-24T13:10:43.952455Z%22}|nvidia.com/NVIDIA_A2-4A","pos":"addresspool.go:51","timestamp":"2024-07-24T13:10:43.952455Z"}>
g
the device has been passed through.. the error is in windows vm.. so that would be the right place to check