This message was deleted.
# k3s
a
This message was deleted.
b
yes, it's possible...
the node tags are included with https://github.com/NVIDIA/k8s-device-plugin
make sure drivers of gpu are installed in OS an GPU is working.
check with nvidia-smi
make sure to install nvidia Runtime class https://docs.k3s.io/advanced
NVIDIA Container Runtime Support K3s will automatically detect and configure the NVIDIA container runtime if it is present when K3s starts. 1. Install the nvidia-container package repository on the node by following the instructions at: https://nvidia.github.io/libnvidia-container/ 2. Install the nvidia container runtime packages. For example:
apt install -y nvidia-container-runtime cuda-drivers-fabricmanager-515 nvidia-headless-515-server
3. Install K3s, or restart it if already installed:
curl -ksL <http://get.k3s.io|get.k3s.io> | sh -
4. Confirm that the nvidia container runtime has been found by k3s:
grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
check with sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
something like:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options] BinaryName = "/usr/bin/nvidia-container-runtime"
with this you should already able to run pods specifying the runtimeclass:
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia --- apiVersion: v1 kind: Pod metadata: name: nbody-gpu-benchmark namespace: default spec: restartPolicy: OnFailure runtimeClassName: nvidia containers: - name: cuda-container image: nvidiak8s/cuda-sample:nbody args: ["nbody", "-gpu", "-benchmark"] # resources: # limits: # nvidia.com/gpu: 1 env: - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all
if you want the tagged nodes, install device-plugin or gpu operator
for gpu operator edit values for: toolkit: env: - name: CONTAINERD_CONFIG value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml - name: CONTAINERD_SOCKET value: /run/k3s/containerd/containerd.sock
I hope this helps you
a
Yes the nvidia container tuntime is working. With : docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi I got my GPU detected.
I did follow the documentation on k3s "advanced options" .
But gpu-feature-discovery don´t set the labels.
this is the only nvidia label
and this is the log of the gpu-feature-discovery POD. look like it did not find the GPU.
b
this is the container that tags the node, not the feature discovery...
Let me know the logs
a
Some details about my procedure. I am following the procedure on "NVIDIA Container Runtime Support" on k3s I did steos 1 e 2 (install container runtime packages). nvidia-smi is detecting my gpu.After this I did step 3. a install of k3s.
This is the config.toml of containerd :
This is the log of the nvidia-device-plugin-daemonset POD.
Looks like it found the GPU, on a previous image it did not found it. On the message it ask about the nvidia-container-toolkit. But it is installed. If it was not installed the containerd would not be updated right ? And I said the command :
Copy code
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
runs fine.
b
Copy code
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvidiak8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
Could you apply this manifest and check the logs of the pod? did it worked?
if it works, GPU on K3s is working in your cluster, we have to check the reason nvidia-device-plugin is not tagging your nodes (meanwhile you can work specifying runtimeclassname as in the example of the pod, assigning the complete GPU to the pods requiring it...)... I think the reason is your path is not correct (for your logs attached).
a
"meanwhile you can work specifying runtimeclassname as in the example of the pod, assigning the complete GPU to the pods requiring it." on this case I can have multiple pods using the GPU ?
b
yes, you could, but if one consumes the full GPU, none of them will work
it worked, I mean the pod you just have run detected the GPU and used it on K3s...
please, make a full recursive copy with 777 permissions (if it's a non prod enviroment) from dir /usr/bin/nvidia-container-runtime to /usr/bin/nvidia-ctk (the good way would be changing in nvidia-device-plugin instead) ...
relaunch nvidia-device-plugin ... then if you check the tags of the pod it should be ok and everything already working
Did it help you @ancient-tomato-94095? Btw if it's not too much curiosity ... Could you explain which is your use case with k3s and GPU?
Direct message me if prefered
a
"I think the reason is your path is not correct (for your logs attached)." i will further investigate the issue.