This message was deleted.
# k3s
a
This message was deleted.
b
I have it working in v1.26.3+k3s1
Did you deployed runtime class nvidia?
Apply this and show me the logs of the pod:
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia --- apiVersion: v1 kind: Pod metadata: name: nbody-gpu-benchmark namespace: default spec: restartPolicy: OnFailure runtimeClassName: nvidia containers: - name: cuda-container image: nvidiak8s/cuda-sample:nbody args: ["nbody", "-gpu", "-benchmark"] # resources: # limits: # nvidia.com/gpu: 1 env: - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all
h
thank you for your reply 🙂 I have tried this or similar pods with gpu as resources but the pod cant start because it cant find nvidia.com/gpu. Interesting thing is that when I try on an instance with gpu and drivers installed (nvidia-smi show output) with nvidia-container-runtime, nvidia k8s device plugin installed with helm, gpu-discovery and gpu operator installed with helm it works. You can also see nvidia.com/gpu 1 if you describe the node. I was trying to make a cluster with 1 master 1 agent with the same steps but it can't find gpu or so. Maybe I am missing something but when you describe nodes you cant find any available gpus. Probably I am missing some configurations
b
I have commented the line with gpu limit on purpose, It's not necessary if you don't launch a pod that checks that, what is mandatory is to use the runtime nvidia
Could you launch that yaml as it is and type the output?
you can run nvidia cuda workloads even if this command gives you 0
kubectl get nodes -o=custom-columns=NAME.metadata.name,GPUs.status.capacity.'nvidia\.com/gpu'
if you want to set it to 1, just launch k8s-device-plugin, with this command: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml or as you said with gpu-operator
In resume,nvidia-container-runtime and runtime class are both mandatory, gpu operator is not needed and nvidia-device-plugin is needed only if you run containers that make use of nvidia-com/gpu tag instead of runtime class tagged
h
thanks a lot for explanation. I managed to get it up and running after few tries, maybe it was some misconfiguration on nvidia container runtime. Now I got the nvidia gpu to show even when you describe node.
b
Thanks, that's a good new 🙂