https://rancher.com/ logo
Title
h

handsome-receptionist-60256

05/24/2023, 9:36 AM
Hey all, I am having problem setup k3s node with gpu access. I tried some configs and it works with v1.21.4 but it seems to not work with version 1.25.8 the drivers and nvidia-container-runtime are installed properly and the cluster works but cant get access to gpu resources. I have modified a config.toml.tmpl in /var/lib/rancher/k3s/agent/etc/containerd/ but it seems I'm probably missing something. I have tried with a sample config file in k3d docs for cuda workloads but as I said it only works in older versions. Any tips or idea what I'm missing ?
b

boundless-spoon-6503

05/24/2023, 3:15 PM
I have it working in v1.26.3+k3s1
Did you deployed runtime class nvidia?
Apply this and show me the logs of the pod:
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia --- apiVersion: v1 kind: Pod metadata: name: nbody-gpu-benchmark namespace: default spec: restartPolicy: OnFailure runtimeClassName: nvidia containers: - name: cuda-container image: nvidiak8s/cuda-sample:nbody args: ["nbody", "-gpu", "-benchmark"] # resources: # limits: # nvidia.com/gpu: 1 env: - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all
h

handsome-receptionist-60256

05/25/2023, 8:19 AM
thank you for your reply 🙂 I have tried this or similar pods with gpu as resources but the pod cant start because it cant find nvidia.com/gpu. Interesting thing is that when I try on an instance with gpu and drivers installed (nvidia-smi show output) with nvidia-container-runtime, nvidia k8s device plugin installed with helm, gpu-discovery and gpu operator installed with helm it works. You can also see nvidia.com/gpu 1 if you describe the node. I was trying to make a cluster with 1 master 1 agent with the same steps but it can't find gpu or so. Maybe I am missing something but when you describe nodes you cant find any available gpus. Probably I am missing some configurations
b

boundless-spoon-6503

05/25/2023, 11:26 AM
I have commented the line with gpu limit on purpose, It's not necessary if you don't launch a pod that checks that, what is mandatory is to use the runtime nvidia
Could you launch that yaml as it is and type the output?
you can run nvidia cuda workloads even if this command gives you 0
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
if you want to set it to 1, just launch k8s-device-plugin, with this command: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml or as you said with gpu-operator
In resume,nvidia-container-runtime and runtime class are both mandatory, gpu operator is not needed and nvidia-device-plugin is needed only if you run containers that make use of nvidia-com/gpu tag instead of runtime class tagged
h

handsome-receptionist-60256

05/25/2023, 12:29 PM
thanks a lot for explanation. I managed to get it up and running after few tries, maybe it was some misconfiguration on nvidia container runtime. Now I got the nvidia gpu to show even when you describe node.
b

boundless-spoon-6503

05/25/2023, 1:34 PM
Thanks, that's a good new 🙂