This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

05/31/2023, 4:45 PM

This message was deleted.

creamy-pencil-82913

05/31/2023, 5:06 PM

yes, thats how it works

creamy-pencil-82913

05/31/2023, 5:06 PM

you specify the runtimeClassName in the pod spec. That’s covered in the docs.

creamy-pencil-82913

05/31/2023, 5:07 PM

https://docs.k3s.io/advanced#nvidia-container-runtime-support

important-tomato-46085

05/31/2023, 5:08 PM

Okay, I'll dig again later. I was getting an error when the daemon set for the Nvidia plugin was attempting to start, some cgroup mount not found or some such. Thanks for the link. I'll read.

important-tomato-46085

05/31/2023, 10:51 PM

I'm unsure what the issue is.

important-tomato-46085

05/31/2023, 10:51 PM

I'm on archlinux, but I believe I have all the packages installed.

important-tomato-46085

05/31/2023, 10:52 PM

as mentioned in the screenshot, I checked this directory thinking that I might see an nvidia file, but I do not.

important-tomato-46085

05/31/2023, 10:52 PM

/var/lib/kubelet/device-plugins

important-tomato-46085

05/31/2023, 10:52 PM

Should I expect to see an nvidia socket there also?

important-tomato-46085

05/31/2023, 10:54 PM

Containerd seems to be working if manually used.

important-tomato-46085

05/31/2023, 10:55 PM

So the detection worked, but I used the toml.tmpl file from this blog. k3s https://itnext.io/enabling-nvidia-gpus-on-k3s-for-cuda-workloads-a11b96f967b0

creamy-pencil-82913

05/31/2023, 11:37 PM

don’t do that. just use the stock k3s containerd config

creamy-pencil-82913

05/31/2023, 11:38 PM

just follow the k3s docs, and use the different runtime classes

important-tomato-46085

06/02/2023, 4:38 PM

Okay good advice, the rest of the pods that are not nvidia come up now. I'm having an issue with the daemonset though.

important-tomato-46085

06/02/2023, 4:38 PM

Copy code

❯ k -n kube-system logs nvidia-device-plugin-sw9j5
I0602 16:37:22.243956       1 main.go:154] Starting FS watcher.
I0602 16:37:22.244038       1 main.go:161] Starting OS watcher.
I0602 16:37:22.244732       1 main.go:176] Starting Plugins.
I0602 16:37:22.244747       1 main.go:234] Loading configuration.
I0602 16:37:22.244857       1 main.go:242] Updating config with default resource matching patterns.
I0602 16:37:22.245032       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "<http://cdi.k8s.io/|cdi.k8s.io/>",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "<http://nvidia.com/gpu|nvidia.com/gpu>"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0602 16:37:22.245041       1 main.go:256] Retreiving plugins.
W0602 16:37:22.245351       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0602 16:37:22.245407       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0602 16:37:22.245436       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0602 16:37:22.245444       1 factory.go:115] Incompatible platform detected
E0602 16:37:22.245449       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0602 16:37:22.245454       1 factory.go:117] You can check the prerequisites at: <https://github.com/NVIDIA/k8s-device-plugin#prerequisites>
E0602 16:37:22.245460       1 factory.go:118] You can learn how to set the runtime at: <https://github.com/NVIDIA/k8s-device-plugin#quick-start>
E0602 16:37:22.245466       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0602 16:37:22.256648       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

important-tomato-46085

06/02/2023, 4:38 PM

The

libnvidia-ml.so.1

is interesting, but I think that's referring to the container. The host has this library.

important-tomato-46085

06/02/2023, 4:39 PM

This is required to run the workloads, correct? If so, that seems odd that the container to run the plugin doesn't have the discovery NVML lib.

important-tomato-46085

06/02/2023, 4:43 PM

I get `Insufficient nvidia.com/gpu.`when scheduling the nvidia gpu benchmark.

important-tomato-46085

06/02/2023, 4:56 PM

https://github.com/NVIDIA/k8s-device-plugin/issues/406 is I think what I'm running into.

230 Views

Open in Slack

Previous Next