This message was deleted.
# harvester
a
This message was deleted.
๐ŸŽ‰ 3
๐Ÿ‘ 1
๐Ÿ”ฅ 1
m
Congratulations! Mind sharing some insights? We are currently trying to get a single harvester with 2 GPU running with the goal of being able to slice on GPU on kubernetes layer. Do you have a blog post maybe? ๐Ÿ™‚
n
What's the GPU? With the one I have vGPU was the only option something that might help you is I created this cloud script to prepare the cloudimg for GPU. You would need to replace the locations that grab the driver and license.
Copy code
### rke2-gpu:
#cloud-config
package_update: true
packages:
  - qemu-guest-agent
  - iptables
  - build-essential
runcmd:
  - - systemctl
    - enable
    - '--now'
    - qemu-guest-agent.service
  - wget http://<YourIP>:8080/vgpu/NVIDIA-Linux-x86_64-550.90.07-grid.run -O NVIDIA-Linux-x86_64-550.90.07-grid.run
  - chmod +x NVIDIA-Linux-x86_64-550.90.07-grid.run
  - apt update
  - apt install -y iptables build-essential
  - ./NVIDIA-Linux-x86_64-550.90.07-grid.run --silent
  - curl -fsSL <https://nvidia.github.io/libnvidia-container/gpgkey> | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  - curl -s -L <https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list> | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  - apt-get update
  - apt-get install -y nvidia-container-toolkit
  - nvidia-ctk runtime configure --runtime=containerd
  - cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
  - sed -i 's/FeatureType=0/FeatureType=2/' /etc/nvidia/gridd.conf
  - echo "GpuVirtualization=1" >> /etc/nvidia/gridd.conf
  - echo "SchedulingPolicy=1" >> /etc/nvidia/gridd.conf
  - wget http://<YourIP>:8080/vgpu/client_configuration_token_03-07-2025-14-51-05.tok -O /etc/nvidia/ClientConfigToken/client_configuration_token_03-07-2025-14-51-05.tok
  - service nvidia-gridd restart
  - service nvidia-gridd status
Then this was my gpu operator that was installed after on the cluster.
Copy code
helm upgrade --install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set nodeSelector."nvidia\.com/gpu"=true \
  --set toolkit.enabled=false \
  --set driver.enabled=false \
  --set licensing.enabled=false \
  --set runtime.enabled=true \
  --set runtime.defaultRuntime=nvidia \
  --set dcgmExporter.enabled=true \
  --set gpuFeatureDiscovery.enabled=true \
  --set gpuSharing.enabled=true \
  --set gpuSharing.slices=2
Then this was my test deployment
Copy code
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvidia-smi-test
  namespace: gpu-operator
  labels:
    app: nvidia-smi-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nvidia-smi-test
  template:
    metadata:
      labels:
        app: nvidia-smi-test
      annotations:
        <http://nvidia.com/gpu.deploy.container.runtime|nvidia.com/gpu.deploy.container.runtime>: "nvidia"  # Optional: Helps in selecting runtime
    spec:
      runtimeClassName: nvidia
      containers:
      - name: nvidia-smi
        image: nvidia/cuda:12.4.0-runtime-ubuntu20.04
        command: ["sleep", "infinity"]  # Keeps the container running
        resources:
          limits:
            <http://nvidia.com/gpu|nvidia.com/gpu>: 1  # Request 1 GPU
      restartPolicy: Always
m
Thank you. We are still in the process of deploying the infra, not yet at the stage of actually trying to do anything with the gpus. They are H100 and we successfully passed one through to a VM directly on harvester, but of course would like to be able to use it as resources we can allocate/split as desired between workloads later in a cluster.