This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

06/20/2025, 2:12 PM

This message was deleted.

🎉 3

👏 1

🔥 1

mammoth-evening-6464

06/23/2025, 1:32 PM

Congratulations! Mind sharing some insights? We are currently trying to get a single harvester with 2 GPU running with the goal of being able to slice on GPU on kubernetes layer. Do you have a blog post maybe? 🙂

numerous-angle-77908

06/23/2025, 3:29 PM

What's the GPU? With the one I have vGPU was the only option something that might help you is I created this cloud script to prepare the cloudimg for GPU. You would need to replace the locations that grab the driver and license.

Copy code

### rke2-gpu:
#cloud-config
package_update: true
packages:
  - qemu-guest-agent
  - iptables
  - build-essential
runcmd:
  - - systemctl
    - enable
    - '--now'
    - qemu-guest-agent.service
  - wget http://<YourIP>:8080/vgpu/NVIDIA-Linux-x86_64-550.90.07-grid.run -O NVIDIA-Linux-x86_64-550.90.07-grid.run
  - chmod +x NVIDIA-Linux-x86_64-550.90.07-grid.run
  - apt update
  - apt install -y iptables build-essential
  - ./NVIDIA-Linux-x86_64-550.90.07-grid.run --silent
  - curl -fsSL <https://nvidia.github.io/libnvidia-container/gpgkey> | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  - curl -s -L <https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list> | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  - apt-get update
  - apt-get install -y nvidia-container-toolkit
  - nvidia-ctk runtime configure --runtime=containerd
  - cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
  - sed -i 's/FeatureType=0/FeatureType=2/' /etc/nvidia/gridd.conf
  - echo "GpuVirtualization=1" >> /etc/nvidia/gridd.conf
  - echo "SchedulingPolicy=1" >> /etc/nvidia/gridd.conf
  - wget http://<YourIP>:8080/vgpu/client_configuration_token_03-07-2025-14-51-05.tok -O /etc/nvidia/ClientConfigToken/client_configuration_token_03-07-2025-14-51-05.tok
  - service nvidia-gridd restart
  - service nvidia-gridd status

Then this was my gpu operator that was installed after on the cluster.

Copy code

helm upgrade --install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set nodeSelector."nvidia\.com/gpu"=true \
  --set toolkit.enabled=false \
  --set driver.enabled=false \
  --set licensing.enabled=false \
  --set runtime.enabled=true \
  --set runtime.defaultRuntime=nvidia \
  --set dcgmExporter.enabled=true \
  --set gpuFeatureDiscovery.enabled=true \
  --set gpuSharing.enabled=true \
  --set gpuSharing.slices=2

Then this was my test deployment

Copy code

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvidia-smi-test
  namespace: gpu-operator
  labels:
    app: nvidia-smi-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nvidia-smi-test
  template:
    metadata:
      labels:
        app: nvidia-smi-test
      annotations:
        <http://nvidia.com/gpu.deploy.container.runtime|nvidia.com/gpu.deploy.container.runtime>: "nvidia"  # Optional: Helps in selecting runtime
    spec:
      runtimeClassName: nvidia
      containers:
      - name: nvidia-smi
        image: nvidia/cuda:12.4.0-runtime-ubuntu20.04
        command: ["sleep", "infinity"]  # Keeps the container running
        resources:
          limits:
            <http://nvidia.com/gpu|nvidia.com/gpu>: 1  # Request 1 GPU
      restartPolicy: Always

mammoth-evening-6464

06/27/2025, 6:18 AM

Thank you. We are still in the process of deploying the infra, not yet at the stage of actually trying to do anything with the gpus. They are H100 and we successfully passed one through to a VM directly on harvester, but of course would like to be able to use it as resources we can allocate/split as desired between workloads later in a cluster.

4 Views

Open in Slack

Previous Next