This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

05/24/2023, 9:36 AM

This message was deleted.

boundless-spoon-6503

05/24/2023, 3:15 PM

I have it working in v1.26.3+k3s1

boundless-spoon-6503

05/24/2023, 3:17 PM

Did you deployed runtime class nvidia?

boundless-spoon-6503

05/24/2023, 3:17 PM

Apply this and show me the logs of the pod:

boundless-spoon-6503

05/24/2023, 3:17 PM

apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia --- apiVersion: v1 kind: Pod metadata: name: nbody-gpu-benchmark namespace: default spec: restartPolicy: OnFailure runtimeClassName: nvidia containers: - name: cuda-container image: nvidiak8s/cuda-sample:nbody args: ["nbody", "-gpu", "-benchmark"] # resources: # limits: # nvidia.com/gpu: 1 env: - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all

handsome-receptionist-60256

05/25/2023, 8:19 AM

thank you for your reply 🙂 I have tried this or similar pods with gpu as resources but the pod cant start because it cant find nvidia.com/gpu. Interesting thing is that when I try on an instance with gpu and drivers installed (nvidia-smi show output) with nvidia-container-runtime, nvidia k8s device plugin installed with helm, gpu-discovery and gpu operator installed with helm it works. You can also see nvidia.com/gpu 1 if you describe the node. I was trying to make a cluster with 1 master 1 agent with the same steps but it can't find gpu or so. Maybe I am missing something but when you describe nodes you cant find any available gpus. Probably I am missing some configurations

boundless-spoon-6503

05/25/2023, 11:26 AM

I have commented the line with gpu limit on purpose, It's not necessary if you don't launch a pod that checks that, what is mandatory is to use the runtime nvidia

boundless-spoon-6503

05/25/2023, 11:26 AM

Could you launch that yaml as it is and type the output?

boundless-spoon-6503

05/25/2023, 11:28 AM

you can run nvidia cuda workloads even if this command gives you 0

boundless-spoon-6503

05/25/2023, 11:28 AM

kubectl get nodes -o=custom-columns=NAME.metadata.name,GPUs.status.capacity.'nvidia\.com/gpu'

boundless-spoon-6503

05/25/2023, 11:29 AM

if you want to set it to 1, just launch k8s-device-plugin, with this command: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml or as you said with gpu-operator

boundless-spoon-6503

05/25/2023, 11:34 AM

In resume,nvidia-container-runtime and runtime class are both mandatory, gpu operator is not needed and nvidia-device-plugin is needed only if you run containers that make use of nvidia-com/gpu tag instead of runtime class tagged

handsome-receptionist-60256

05/25/2023, 12:29 PM

thanks a lot for explanation. I managed to get it up and running after few tries, maybe it was some misconfiguration on nvidia container runtime. Now I got the nvidia gpu to show even when you describe node.

boundless-spoon-6503

05/25/2023, 1:34 PM

Thanks, that's a good new 🙂

71 Views

Open in Slack

Previous Next