This message was deleted.
# harvester
a
This message was deleted.
l
sorry, I'm new to slack
Hi, sorry about that I am new to slack... What you are describing sounds like a problem with Kubernetes scheduling. The last time I worked with NVIDIA GPUs and Kubernetes there was a "Nvidia Device plugin" which takes care of letting Kubernetes know how many GPUs are available in your cluster. It could be that this plugin is not running correctly so Kubernetes thinks there are no GPUs available and won't schedule your VM because of it.
f
I've run
kubectl describe node inog02
and this was returned:
Copy code
Capacity:
  ...  
<http://nvidia.com/GA102_GEFORCE_RTX_3090|nvidia.com/GA102_GEFORCE_RTX_3090>:                  1
  <http://nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER|nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER>:  4
  ...
Allocatable:
  ...
  <http://nvidia.com/GA102_GEFORCE_RTX_3090|nvidia.com/GA102_GEFORCE_RTX_3090>:                  1
  <http://nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER|nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER>:  4
  ...
It seems the node is missing the other 3 GPUs. We restarted the host with GPUs and now running the same command returns:
Copy code
Capacity:
  ...
<http://nvidia.com/GA102_GEFORCE_RTX_3090|nvidia.com/GA102_GEFORCE_RTX_3090>:                  0
  <http://nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER|nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER>:  0
...
Allocatable:
  ...<http://nvidia.com/GA102_GEFORCE_RTX_3090|nvidia.com/GA102_GEFORCE_RTX_3090>:                  0
  <http://nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER|nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER>:  0
  ...
It looks like it is completely missing the GPUs
Thanks @lively-zebra-61132 for the suggestion about the device plugin. As this plugin is not mentioned in the documentation regarding PCI passthrough. I am wondering whether Harvester has its own device plugin and whether I might mess up something by running the plugin from Nvidia. For the past few hours, I tried disabling passthrough and enabling it again for the GPUs. But I still see 0 devices under Capacity and under Allocatable. I assume that Harvester is not completely aware of these devices after the reboot. Is there any way how to force the discovery of these devices? @limited-breakfast-50094 Any ideas what I might be doing wrong? Thanks
We managed to solve the problem by disabling the PCI passthrough on the GPUs. Then deleting the pcidevices-controller pod on the host with the GPUs.
kubectl delete pod -n harvester-system harvester-pcidevices-controller-f29sr --grace-period=0 --force
New pod has been created by Harvester and now it found all 4 of the GPUs. Thanks all for help
l
Hi @full-crayon-745 my PR will fix this, it introduces our custom DevicePlugin that solves this issue. The rebooting issue, the out-of-date allocatable counts, those happen because 1.1.0+ (pre 1.1.2) don't use DevicePlugins. We directly modify the KubeVirt config, which is awkward and unreliable in practice. The NVIDIA DevicePlugin can work, but it's not supported. This DevicePlugin PR will be supported, and also customized for Harvester's use cases. It allows more than just NVIDIA devices, for example.
f
Great to hear that. Thanks Tobi