:tada: I'd like to announce after many battles and...
# harvester
n
πŸŽ‰ I'd like to announce after many battles and long nights over the past months on and off. I Finally have a 8 node Harvester setup with vGPU nodes being propagated with Clusters Via Rancher's Cluster Management. πŸŽ‰ Great support from the community especially @great-bear-19718 . Still have some final refinements that are coming in v1.5.1 for the vGPU support since I have a A5000 ADA which acts just a bit different then other card for vGPU and in v1.6.0 I hope to have the Libvirt issues resolved as I have some NUMA issues 2 nodes that have a Threadripper 2990WX 32-Core CPU. But for now the small fix's have allowed things to work now as intended.
πŸŽ‰ 2
πŸ‘ 1
πŸ”₯ 1
m
Congratulations! Mind sharing some insights? We are currently trying to get a single harvester with 2 GPU running with the goal of being able to slice on GPU on kubernetes layer. Do you have a blog post maybe? πŸ™‚
n
What's the GPU? With the one I have vGPU was the only option something that might help you is I created this cloud script to prepare the cloudimg for GPU. You would need to replace the locations that grab the driver and license.
Copy code
### rke2-gpu:
#cloud-config
package_update: true
packages:
  - qemu-guest-agent
  - iptables
  - build-essential
runcmd:
  - - systemctl
    - enable
    - '--now'
    - qemu-guest-agent.service
  - wget http://<YourIP>:8080/vgpu/NVIDIA-Linux-x86_64-550.90.07-grid.run -O NVIDIA-Linux-x86_64-550.90.07-grid.run
  - chmod +x NVIDIA-Linux-x86_64-550.90.07-grid.run
  - apt update
  - apt install -y iptables build-essential
  - ./NVIDIA-Linux-x86_64-550.90.07-grid.run --silent
  - curl -fsSL <https://nvidia.github.io/libnvidia-container/gpgkey> | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  - curl -s -L <https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list> | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  - apt-get update
  - apt-get install -y nvidia-container-toolkit
  - nvidia-ctk runtime configure --runtime=containerd
  - cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
  - sed -i 's/FeatureType=0/FeatureType=2/' /etc/nvidia/gridd.conf
  - echo "GpuVirtualization=1" >> /etc/nvidia/gridd.conf
  - echo "SchedulingPolicy=1" >> /etc/nvidia/gridd.conf
  - wget http://<YourIP>:8080/vgpu/client_configuration_token_03-07-2025-14-51-05.tok -O /etc/nvidia/ClientConfigToken/client_configuration_token_03-07-2025-14-51-05.tok
  - service nvidia-gridd restart
  - service nvidia-gridd status
Then this was my gpu operator that was installed after on the cluster.
Copy code
helm upgrade --install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set nodeSelector."nvidia\.com/gpu"=true \
  --set toolkit.enabled=false \
  --set driver.enabled=false \
  --set licensing.enabled=false \
  --set runtime.enabled=true \
  --set runtime.defaultRuntime=nvidia \
  --set dcgmExporter.enabled=true \
  --set gpuFeatureDiscovery.enabled=true \
  --set gpuSharing.enabled=true \
  --set gpuSharing.slices=2
Then this was my test deployment
Copy code
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvidia-smi-test
  namespace: gpu-operator
  labels:
    app: nvidia-smi-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nvidia-smi-test
  template:
    metadata:
      labels:
        app: nvidia-smi-test
      annotations:
        <http://nvidia.com/gpu.deploy.container.runtime|nvidia.com/gpu.deploy.container.runtime>: "nvidia"  # Optional: Helps in selecting runtime
    spec:
      runtimeClassName: nvidia
      containers:
      - name: nvidia-smi
        image: nvidia/cuda:12.4.0-runtime-ubuntu20.04
        command: ["sleep", "infinity"]  # Keeps the container running
        resources:
          limits:
            <http://nvidia.com/gpu|nvidia.com/gpu>: 1  # Request 1 GPU
      restartPolicy: Always
m
Thank you. We are still in the process of deploying the infra, not yet at the stage of actually trying to do anything with the gpus. They are H100 and we successfully passed one through to a VM directly on harvester, but of course would like to be able to use it as resources we can allocate/split as desired between workloads later in a cluster.