numerous-angle-77908
06/20/2025, 2:12 PMmammoth-evening-6464
06/23/2025, 1:32 PMnumerous-angle-77908
06/23/2025, 3:29 PM### rke2-gpu:
#cloud-config
package_update: true
packages:
- qemu-guest-agent
- iptables
- build-essential
runcmd:
- - systemctl
- enable
- '--now'
- qemu-guest-agent.service
- wget http://<YourIP>:8080/vgpu/NVIDIA-Linux-x86_64-550.90.07-grid.run -O NVIDIA-Linux-x86_64-550.90.07-grid.run
- chmod +x NVIDIA-Linux-x86_64-550.90.07-grid.run
- apt update
- apt install -y iptables build-essential
- ./NVIDIA-Linux-x86_64-550.90.07-grid.run --silent
- curl -fsSL <https://nvidia.github.io/libnvidia-container/gpgkey> | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
- curl -s -L <https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list> | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
- apt-get update
- apt-get install -y nvidia-container-toolkit
- nvidia-ctk runtime configure --runtime=containerd
- cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
- sed -i 's/FeatureType=0/FeatureType=2/' /etc/nvidia/gridd.conf
- echo "GpuVirtualization=1" >> /etc/nvidia/gridd.conf
- echo "SchedulingPolicy=1" >> /etc/nvidia/gridd.conf
- wget http://<YourIP>:8080/vgpu/client_configuration_token_03-07-2025-14-51-05.tok -O /etc/nvidia/ClientConfigToken/client_configuration_token_03-07-2025-14-51-05.tok
- service nvidia-gridd restart
- service nvidia-gridd status
Then this was my gpu operator that was installed after on the cluster.
helm upgrade --install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set nodeSelector."nvidia\.com/gpu"=true \
--set toolkit.enabled=false \
--set driver.enabled=false \
--set licensing.enabled=false \
--set runtime.enabled=true \
--set runtime.defaultRuntime=nvidia \
--set dcgmExporter.enabled=true \
--set gpuFeatureDiscovery.enabled=true \
--set gpuSharing.enabled=true \
--set gpuSharing.slices=2
Then this was my test deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: nvidia-smi-test
namespace: gpu-operator
labels:
app: nvidia-smi-test
spec:
replicas: 1
selector:
matchLabels:
app: nvidia-smi-test
template:
metadata:
labels:
app: nvidia-smi-test
annotations:
<http://nvidia.com/gpu.deploy.container.runtime|nvidia.com/gpu.deploy.container.runtime>: "nvidia" # Optional: Helps in selecting runtime
spec:
runtimeClassName: nvidia
containers:
- name: nvidia-smi
image: nvidia/cuda:12.4.0-runtime-ubuntu20.04
command: ["sleep", "infinity"] # Keeps the container running
resources:
limits:
<http://nvidia.com/gpu|nvidia.com/gpu>: 1 # Request 1 GPU
restartPolicy: Always
mammoth-evening-6464
06/27/2025, 6:18 AM