This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

01/17/2023, 11:06 AM

This message was deleted.

lively-zebra-61132

01/17/2023, 12:53 PM

sorry, I'm new to slack

lively-zebra-61132

01/17/2023, 12:57 PM

Hi, sorry about that I am new to slack... What you are describing sounds like a problem with Kubernetes scheduling. The last time I worked with NVIDIA GPUs and Kubernetes there was a "Nvidia Device plugin" which takes care of letting Kubernetes know how many GPUs are available in your cluster. It could be that this plugin is not running correctly so Kubernetes thinks there are no GPUs available and won't schedule your VM because of it.

full-crayon-745

01/17/2023, 1:41 PM

I've run

kubectl describe node inog02

full-crayon-745

01/17/2023, 1:44 PM

and this was returned:

Copy code

Capacity:
  ...  
<http://nvidia.com/GA102_GEFORCE_RTX_3090|nvidia.com/GA102_GEFORCE_RTX_3090>:                  1
  <http://nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER|nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER>:  4
  ...
Allocatable:
  ...
  <http://nvidia.com/GA102_GEFORCE_RTX_3090|nvidia.com/GA102_GEFORCE_RTX_3090>:                  1
  <http://nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER|nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER>:  4
  ...

It seems the node is missing the other 3 GPUs. We restarted the host with GPUs and now running the same command returns:

Copy code

Capacity:
  ...
<http://nvidia.com/GA102_GEFORCE_RTX_3090|nvidia.com/GA102_GEFORCE_RTX_3090>:                  0
  <http://nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER|nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER>:  0
...
Allocatable:
  ...<http://nvidia.com/GA102_GEFORCE_RTX_3090|nvidia.com/GA102_GEFORCE_RTX_3090>:                  0
  <http://nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER|nvidia.com/GA102_HIGH_DEFINITION_AUDIO_CONTROLLER>:  0
  ...

It looks like it is completely missing the GPUs

full-crayon-745

01/18/2023, 12:45 PM

Thanks @lively-zebra-61132 for the suggestion about the device plugin. As this plugin is not mentioned in the documentation regarding PCI passthrough. I am wondering whether Harvester has its own device plugin and whether I might mess up something by running the plugin from Nvidia. For the past few hours, I tried disabling passthrough and enabling it again for the GPUs. But I still see 0 devices under Capacity and under Allocatable. I assume that Harvester is not completely aware of these devices after the reboot. Is there any way how to force the discovery of these devices? @limited-breakfast-50094 Any ideas what I might be doing wrong? Thanks

full-crayon-745

01/18/2023, 1:30 PM

We managed to solve the problem by disabling the PCI passthrough on the GPUs. Then deleting the pcidevices-controller pod on the host with the GPUs.

kubectl delete pod -n harvester-system harvester-pcidevices-controller-f29sr --grace-period=0 --force

New pod has been created by Harvester and now it found all 4 of the GPUs. Thanks all for help

limited-breakfast-50094

01/18/2023, 5:28 PM

Hi @full-crayon-745 my PR will fix this, it introduces our custom DevicePlugin that solves this issue. The rebooting issue, the out-of-date allocatable counts, those happen because 1.1.0+ (pre 1.1.2) don't use DevicePlugins. We directly modify the KubeVirt config, which is awkward and unreliable in practice. The NVIDIA DevicePlugin can work, but it's not supported. This DevicePlugin PR will be supported, and also customized for Harvester's use cases. It allows more than just NVIDIA devices, for example.

full-crayon-745

01/19/2023, 8:05 AM

Great to hear that. Thanks Tobi

26 Views

Open in Slack

Previous Next