This message was deleted.
# k3s
a
This message was deleted.
c
yes, thats how it works
you specify the runtimeClassName in the pod spec. That’s covered in the docs.
i
Okay, I'll dig again later. I was getting an error when the daemon set for the Nvidia plugin was attempting to start, some cgroup mount not found or some such. Thanks for the link. I'll read.
I'm unsure what the issue is.
I'm on archlinux, but I believe I have all the packages installed.
as mentioned in the screenshot, I checked this directory thinking that I might see an nvidia file, but I do not.
/var/lib/kubelet/device-plugins
Should I expect to see an nvidia socket there also?
Containerd seems to be working if manually used.
So the detection worked, but I used the toml.tmpl file from this blog. k3s https://itnext.io/enabling-nvidia-gpus-on-k3s-for-cuda-workloads-a11b96f967b0
c
don’t do that. just use the stock k3s containerd config
just follow the k3s docs, and use the different runtime classes
i
Okay good advice, the rest of the pods that are not nvidia come up now. I'm having an issue with the daemonset though.
Copy code
❯ k -n kube-system logs nvidia-device-plugin-sw9j5
I0602 16:37:22.243956       1 main.go:154] Starting FS watcher.
I0602 16:37:22.244038       1 main.go:161] Starting OS watcher.
I0602 16:37:22.244732       1 main.go:176] Starting Plugins.
I0602 16:37:22.244747       1 main.go:234] Loading configuration.
I0602 16:37:22.244857       1 main.go:242] Updating config with default resource matching patterns.
I0602 16:37:22.245032       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "<http://cdi.k8s.io/|cdi.k8s.io/>",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "<http://nvidia.com/gpu|nvidia.com/gpu>"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0602 16:37:22.245041       1 main.go:256] Retreiving plugins.
W0602 16:37:22.245351       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0602 16:37:22.245407       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0602 16:37:22.245436       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0602 16:37:22.245444       1 factory.go:115] Incompatible platform detected
E0602 16:37:22.245449       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0602 16:37:22.245454       1 factory.go:117] You can check the prerequisites at: <https://github.com/NVIDIA/k8s-device-plugin#prerequisites>
E0602 16:37:22.245460       1 factory.go:118] You can learn how to set the runtime at: <https://github.com/NVIDIA/k8s-device-plugin#quick-start>
E0602 16:37:22.245466       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0602 16:37:22.256648       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed
The
libnvidia-ml.so.1
is interesting, but I think that's referring to the container. The host has this library.
This is required to run the workloads, correct? If so, that seems odd that the container to run the plugin doesn't have the discovery NVML lib.
I get `Insufficient nvidia.com/gpu.`when scheduling the nvidia gpu benchmark.
228 Views