https://rancher.com/ logo
Title
f

future-address-23425

09/14/2022, 6:09 PM
I was trying to enable the
vfio-pci
driver through the Harvester configuration, using this:
os:
  modules:
  - vfio-pci
, with no luck. I made it work with:
os:
  write_files:
  - content: |
      vfio-pci
    path: /etc/modules-load.d/vfio-pci.conf
Any thoughts, am I missing sth? Shouldn't the first one work (do the same thing)? I believe that
vfio-pci
should be anyway enabled by default.
l

limited-breakfast-50094

09/14/2022, 7:01 PM
I enable it through my controller, I'll send a link
Applying these two manifests should pull in the controller: https://github.com/harvester/pcidevices/tree/master/manifests
I haven't helm-ified it yet
f

future-address-23425

09/14/2022, 7:10 PM
We'll evaluate this for sure, thanks. I have some questions already, for example how does this advertises the resources to
kv/kubevirt
and how it triggers a controller update upon change of the driver of a PCI device (to/from
vfio-pci
).
We try this at the moment, how do I deploy the front end components, too? The controller looks fine so far.
Why is this so restrictive? Some GPUs are announced as a
3D controller
and thus excluded from your list. Also since the
u-root
project doesn't use a PCI DB like https://pci-ids.ucw.cz/v2.2/pci.ids.gz, I would suggest filtering devices using their "PCI.Class" and not their "PCI.ClassName" which is not accurate (for example, https://pci-ids.ucw.cz/read/PD/12/00, which represents FPGA devices (mostly), is considered an unknown class id by the
u-root
project).
l

limited-breakfast-50094

09/15/2022, 9:52 PM
Good call, I will change the filter to be less restrictive. I'll find out what hardware our customers are intending on using and make sure it isn't filtered out
f

future-address-23425

09/15/2022, 9:57 PM
If I may suggest (at least) the 2 following categories, classes with id that starts with: • 03 - Display controller (GPUs) • 12 - Processing accelerators (FPGAs) Of course, you can also include audio controllers and whatever else your use-cases target.
l

limited-breakfast-50094

09/15/2022, 9:59 PM
12 in dec or in hex?
I'm looking at the A100 PDF right now and it lists it's device class as 0x03 "Display Controller"
f

future-address-23425

09/15/2022, 10:00 PM
hex
yes, this is the lateat pci.ids DB, u-root unfortunately embeds a way older version than that, which is unaware of many devices, vendors and classes.
l

limited-breakfast-50094

09/15/2022, 10:03 PM
Yeah, and when I use u-root to get the
Class
, it gives some weird results:
PCI Device: GP106 [GeForce GTX 1060 3GB]
        Class: 196608
        ClassName: DisplayVGA
        Address: 0000:04:00.0
        VendorId: 4318
        DeviceId: 7170
        ExtraInfo: []
I could probably scrape it out of
lspci
, which is how I've been detecting the driver
So we are in crunch time, I will remove the filter and just let all the PCI devices show up. Our UI has some nice filtering so it shouldn't be too overwhelming
f

future-address-23425

09/15/2022, 10:09 PM
Inside your alpine container, since you install
pciutils
packages, you will find that
pci.ids
DB under
/usr/share/hwdata/
. There are libraries like
jaypipes/pcidb
able to read it.
l

limited-breakfast-50094

09/15/2022, 10:09 PM
oh rad!
f

future-address-23425

09/15/2022, 10:13 PM
I would suggest moving away from executing "lspci" and using such libraries, instead. That would make it possible to build distroless containers, including just a static binary.
l

limited-breakfast-50094

09/15/2022, 10:14 PM
That is a good suggestion, at the moment we are overdue for our 1.1.0 release, but I'll make an issue to track that
f

future-address-23425

09/15/2022, 10:16 PM
Also have you noticed that inside containers lspci never outputs the list of "Kernel modules"? It must be a libkmod bug or something. But it means that your kernelModules field will never be populated with values. I suppose it's just informational though, right?
l

limited-breakfast-50094

09/15/2022, 10:16 PM
I did notice that. It hasn't been an issue, right now it's all about the happy path, and I haven't needed the list of kernel modules for that yet
✔️ 1
f

future-address-23425

09/15/2022, 10:21 PM
It won't be ever needed I guess.
l

limited-breakfast-50094

09/15/2022, 10:22 PM
Yeah so far I treat it like a stack of size one. When I make a PDC, I push on
vfio-pci
, and when I delete the PDC, I pop it back off and use the driver in use before that.
👌🏼 1
f

future-address-23425

09/15/2022, 10:23 PM
Is there a dev harvester/rancher container image I can use to evaluate the UI changes, too?
l

limited-breakfast-50094

09/15/2022, 10:24 PM
not yet in a container but I can show you how:
there are docker instructions in the readme
f

future-address-23425

09/15/2022, 10:24 PM
Would it make sense to implement that functionality as part of a kubevirt device plugin, instead of adding the notion of claims (which from what I have seen so far, look node-bound)?
I mean in this approach doesn't user need to select a specific device on a specific node?
l

limited-breakfast-50094

09/15/2022, 10:26 PM
Yes, it is per node right now
To implement migrations between nodes, I'm going to have to modify the claim in a transaction
then the controller will reconcile it by unbinding the old node's device and then binding the new node's device to vfio-pci
I haven't designed that yet, and I'm open to the DevicePlugin approach
f

future-address-23425

09/15/2022, 10:29 PM
If you could add this driver override process in the lifecycle of a device plugin for kubevirt, users would simply have to request a quantity for each resource, and let k8s schedule everything!
l

limited-breakfast-50094

09/15/2022, 10:30 PM
Want to pair on this design after this release?
I'll look into that after we get this shipped
f

future-address-23425

09/15/2022, 10:31 PM
I can further elaborate on this proposal if you wish!
l

limited-breakfast-50094

09/15/2022, 10:32 PM
Totally, let's write up a HEP (Harvester Enhancement Proposal)
👌🏼 1
Right now the UI lets the user easily claim all of their devices at once, so once those claims are made, kubevirt does do all the VM scheduling
i use the
vendorId:deviceId
pair, which is associated with a
resourceName
, and then KubeVirt looks at the Node.Status.Allocatable for that
resourceName
so if I had a datacenter with 100 nodes, each with an NVIDIA A100 in it, I could click once, claim all 100, then attach the device to the VM, and KubeVirt already does the automatic scheduling. The thing is, if for any reason, if the user only claims 80 of the devices, then KubeVirt will ignore the other 20
f

future-address-23425

09/15/2022, 10:38 PM
I guess, I need to try the UI! 😁
l

limited-breakfast-50094

09/15/2022, 10:40 PM
the way I've been testing with it is to
yarn install
that dashboard repo I sent you, then
API=$HARVESTER_MGMT_IP yarn dev
and it will spin up a server on port 8005
f

future-address-23425

09/15/2022, 10:46 PM
Thanks, I had already found that branch from a PR, and I was wondering if you had a pipeline to build it and embed it in the rancher-harvester image. But it's ok, I'll try it locally.
Also how this device selection looks like in the UI when you spawn a k8s cluster on harvester? Is this feature gonna be included in 1.1.0?
Is there a plan to promote harvester vm templates to machine pool "flavors"? Is there a different approach on the table?
l

limited-breakfast-50094

09/15/2022, 11:07 PM
Yes on 1.1.0, there's no special treatment for guest k8s clusters. The way it would work is that PCI passthrough can be acheived on the host, then the guest cluster can use a DevicePlugin or something to use the GPU, but it depends on the guest cluster configuration and what the goals are
Update on the UI:
RANCHER_ENV=harvester API=192.168.1.147 yarn dev
That RANCHER_ENV=harvester has to be set now
f

future-address-23425

09/15/2022, 11:15 PM
Let me rephrase. I was wondering how users will select the PCI devices of their k8s machine pools through Rancher "Create cluster" UI. I guess similarly to how it will be done in the "Create Virtual Machine" UI in Harvester. Another option would have been to allow them select a predefined (in Harvester) VM Template for their k8s machine pools.
l

limited-breakfast-50094

09/16/2022, 4:21 PM
That would be a good idea
I'm using jaypipes/pcidb, so far it's great
👌🏼 1
f

future-address-23425

09/16/2022, 9:20 PM
How is device claim username enforced? Kubevirt doesn't know about it and can still schedule VMIs on other peoples' devices.
How is Kubevirt's
resourceName
embedded in all this?
l

limited-breakfast-50094

09/16/2022, 9:22 PM
KubeVirt's resourceName needs to be a URL, so I am fixing that, but I am not using DevicePlugins so it's purely to get KubeVirt to work
About the userName, I am going to use RBAC for enforcement with claims
f

future-address-23425

09/16/2022, 9:24 PM
Your PCIDeviceClaim should probably have a required parameter
resourceName
and you could use it to serve your own device plugin. I can help you with that. I could also help with the
jaypipes/pcidb
migrations, but also
modprobe/lsmod
migration to native Go code.
l

limited-breakfast-50094

09/16/2022, 9:26 PM
These are good things to do in a later release, I already have pcidb pulled in, but I don't have the time to rearchitect this as a device plugin
f

future-address-23425

09/16/2022, 9:27 PM
RBAC won't make any sense if you let KubeVirt serve the device plugins. Anyone will be able to edit the VMI yamls and request host devices directly from Kubevirt.
Cool, I can check if I can put together fast a prototype of your daemonset acting as a kubevirt device plugin driven by your claim controller.
l

limited-breakfast-50094

09/16/2022, 9:32 PM
Awesome!