Has anyone been able to add more than 1 GPU via the PCI pass Rancher Users #harvester

Has anyone been able to add more than 1 GPU via th...

bland-article-62755

08/25/2025, 9:09 PM

Has anyone been able to add more than 1 GPU via the PCI passthrough? We've been trying to add two cards to a VM and nvidia-smi only ever recognizes and loads 1 of them.

bland-article-62755

08/25/2025, 9:12 PM

dmessage from AlmaLinux:

Copy code

[Mon Aug 25 21:07:13 2025] Loaded X.509 cert 'Nvidia GPU OOT signing 001: 55e1cef88193e60419f0b0ec379c49f77545acf0'
[Mon Aug 25 21:07:13 2025] Loaded X.509 cert 'AlmaLinux NVIDIA Module Signing: 4a47544e27f990e063664d7d0dd9c2158954567a'
[Mon Aug 25 21:07:17 2025] nvidia: loading out-of-tree module taints kernel.
[Mon Aug 25 21:07:17 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[Mon Aug 25 21:07:17 2025] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0a:00.0)
[Mon Aug 25 21:07:17 2025] nvidia: probe of 0000:0a:00.0 failed with error -1
[Mon Aug 25 21:07:17 2025] NVRM: The NVIDIA probe routine failed for 1 device(s).
[Mon Aug 25 21:07:17 2025] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  580.65.06  Release Build  (mockbuild@)  Tue Aug  5 16:02:49 EDT 2025
[Mon Aug 25 21:07:17 2025] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  580.65.06  Release Build  (mockbuild@)  Tue Aug  5 16:02:38 EDT 2025
[Mon Aug 25 21:07:17 2025] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[Mon Aug 25 21:07:19 2025] [drm] Initialized nvidia-drm 0.0.0 for 0000:0b:00.0 on minor 1
[Mon Aug 25 21:07:19 2025] nvidia 0000:0b:00.0: [drm] No compatible format found
[Mon Aug 25 21:07:19 2025] nvidia 0000:0b:00.0: [drm] Cannot find any crtc or sizes

bland-article-62755

08/25/2025, 9:16 PM

from Ubuntu:

Copy code

# dmesg -T | egrep -i 'nvidia|nvrm|xid'
[Mon Aug 25 20:50:31 2025] nvidia: loading out-of-tree module taints kernel.
[Mon Aug 25 20:50:31 2025] nvidia: module license 'NVIDIA' taints kernel.
[Mon Aug 25 20:50:31 2025] nvidia: module license taints kernel.
[Mon Aug 25 20:50:31 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[Mon Aug 25 20:50:31 2025] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0a:00.0)
[Mon Aug 25 20:50:31 2025] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR2 is 0M @ 0x0 (PCI:0000:0a:00.0)
[Mon Aug 25 20:50:31 2025] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR5 is 0M @ 0x0 (PCI:0000:0a:00.0)
[Mon Aug 25 20:50:31 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  575.57.08  Sat May 24 07:21:16 UTC 2025
[Mon Aug 25 20:50:31 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  575.57.08  Sat May 24 06:52:56 UTC 2025
[Mon Aug 25 20:50:31 2025] [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver
[Mon Aug 25 20:50:31 2025] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 0
[Mon Aug 25 20:50:31 2025] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[Mon Aug 25 20:50:31 2025] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0b:00.0 on minor 1
[Mon Aug 25 20:50:32 2025] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[Mon Aug 25 20:51:18 2025] caller os_map_kernel_space+0x120/0x130 [nvidia] mapping multiple BARs
[Mon Aug 25 20:51:18 2025] NVRM: GPU 0000:0a:00.0: RmInitAdapter failed! (0x24:0x72:1589)
[Mon Aug 25 20:51:18 2025] NVRM: GPU 0000:0a:00.0: rm_init_adapter failed, device minor number 0

bland-article-62755

08/25/2025, 9:25 PM

But

nvidia-smi -L

will only ever list one at time. If you only add 1 of 2 - it will always work/show up. But with two we always see the PCI regions are invalid messages or Cannot find any crtc or sizes.

bland-article-62755

08/25/2025, 9:45 PM

Both cards have

Interrupt: pin A routed to IRQ 23

is there a way to unset that or change it?

thousands-advantage-10804

08/25/2025, 11:18 PM

I have been able to get 1 card working. I have not tested two. Did you you add the audio device for the cards as well?

bland-article-62755

08/25/2025, 11:23 PM

Nope, they were getting used for compute workloads.

thousands-advantage-10804

08/25/2025, 11:24 PM

I heard that when adding gpus you need to add the audio device from the cards as well.

thousands-advantage-10804

08/25/2025, 11:24 PM

maybe worth a shot?

bland-article-62755

08/25/2025, 11:36 PM

Does it show as a different PCIE device?

thousands-advantage-10804

08/25/2025, 11:37 PM

yes

thousands-advantage-10804

08/25/2025, 11:38 PM

https://youtu.be/RgW_uB6dOJ0?t=274▾

bland-article-62755

08/25/2025, 11:42 PM

I'm not sure the cards we have have that. I'm filtering by the Vendor ID and only the 3D Controllers populate

thousands-advantage-10804

08/25/2025, 11:43 PM

ah ok. I don’t think the tesla’s have hdmi audio. I had an a2000 which did.

thousands-advantage-10804

08/25/2025, 11:44 PM

Are all those cards on the same host? you may be trying to add cards from 2 different hosts.

bland-article-62755

08/25/2025, 11:44 PM

100% on the same host

bland-article-62755

08/25/2025, 11:44 PM

Well, the ones I'm trying to add

bland-article-62755

08/25/2025, 11:44 PM

The VM boots and lspci shows both devices.

thousands-advantage-10804

08/25/2025, 11:46 PM

OH.. alma and the driver don’t see both?

bland-article-62755

08/25/2025, 11:50 PM

On both ubuntu and alma, the host sees two devices with lspci, but the driver fails to get both working.

bland-article-62755

08/25/2025, 11:50 PM

Copy code

0a:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
        Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
        Physical Slot: 0-10
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 23
        Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
        Region 3: Memory at 39080000000 (64-bit, prefetchable) [size=32M]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nvidia_drm, nvidia

0b:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
        Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
        Physical Slot: 0-11
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 23
        Region 0: Memory at f9000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 38000000000 (64-bit, prefetchable) [size=32G]
        Region 3: Memory at 38800000000 (64-bit, prefetchable) [size=32M]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nvidia_drm, nvidia

thousands-advantage-10804

08/25/2025, 11:51 PM

does lspci from the vm see both?

bland-article-62755

08/25/2025, 11:51 PM

yes

bland-article-62755

08/25/2025, 11:52 PM

That text block above is from both devices on Ubuntu

thousands-advantage-10804

08/25/2025, 11:52 PM

sounds like a kernel/driver issue and not harvester.

thousands-advantage-10804

08/25/2025, 11:53 PM

in my testing I used rocky 9 from the nvidia repo : https://github.com/clemenko/harvester_ollama

bland-article-62755

08/25/2025, 11:53 PM

I think the Interrupt pin is set at the hypervisor level.

bland-article-62755

08/25/2025, 11:56 PM

Like the host level has them all routed to different IRQs.

bland-article-62755

08/25/2025, 11:56 PM

Copy code

gpu3:~ # lspci -d 10de: -vvv |grep IRQ
        Interrupt: pin A routed to IRQ 1209
        Interrupt: pin A routed to IRQ 1210
        Interrupt: pin A routed to IRQ 1217
        Interrupt: pin A routed to IRQ 1218

bland-article-62755

08/25/2025, 11:57 PM

Copy code

ubuntu@gpu3test:~$ lspci -d 10de: -vvv | grep IRQ
        Interrupt: pin A routed to IRQ 23
        Interrupt: pin A routed to IRQ 23

bland-article-62755

08/25/2025, 11:57 PM

Copy code

[root@alma9gpu3test ~]# lspci -d 10DE: -vvv  |grep IRQ
        Interrupt: pin A routed to IRQ 23
        Interrupt: pin A routed to IRQ 23

thousands-advantage-10804

08/25/2025, 11:57 PM

which kernel or ubuntu are you using on the vm?

bland-article-62755

08/25/2025, 11:58 PM

Copy code

5.14.0-570.35.1.el9_6.x86_64

bland-article-62755

08/25/2025, 11:58 PM

It's 24.04

thousands-advantage-10804

08/25/2025, 11:58 PM

that is an OLD kernel.

bland-article-62755

08/25/2025, 11:58 PM

Copy code

ubuntu@gpu3test:~$ uname -a
Linux gpu3test 6.8.0-78-generic #78-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 12 11:34:18 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@gpu3test:~$ uname -r
6.8.0-78-generic

thousands-advantage-10804

08/25/2025, 11:59 PM

lol

bland-article-62755

08/25/2025, 11:59 PM

Alma 9 ships what they ship

thousands-advantage-10804

08/26/2025, 12:00 AM

I have found that for nvidia ubuntu is the best.

bland-article-62755

08/26/2025, 12:00 AM

FYI - https://almalinux.org/blog/2025-08-06-announcing-native-nvidia-suport/

thousands-advantage-10804

08/26/2025, 12:00 AM

oh wow

bland-article-62755

08/26/2025, 12:00 AM

Yeah that's what we started with

bland-article-62755

08/26/2025, 12:00 AM

We only tried alma because it was throwing a fit with the second card.

bland-article-62755

08/26/2025, 12:01 AM

We've also tried this on Tesla100 cards and it's the same.

thousands-advantage-10804

08/26/2025, 12:01 AM

interesting.

bland-article-62755

08/26/2025, 12:01 AM

I really do think it's related to that IRQ getting the same mapping.

bland-article-62755

08/26/2025, 12:02 AM

I just don't see it in the configs anywhere, so this feels more like a bug

thousands-advantage-10804

08/26/2025, 12:03 AM

what version of harvester?

bland-article-62755

08/26/2025, 12:03 AM

1.4.3

thousands-advantage-10804

08/26/2025, 12:03 AM

OH I know they worked on pci passthrough with 1.5.x

thousands-advantage-10804

08/26/2025, 12:03 AM

upgrade?

bland-article-62755

08/26/2025, 12:09 AM

¯\_(ツ)_/¯

bland-article-62755

08/26/2025, 12:09 AM

What's maddening is that just 1 card in the same VM works fine. (and with 2 - one of them still seems to work fine)

thousands-advantage-10804

08/26/2025, 12:11 AM

sounds like a Harv bug..

thousands-advantage-10804

08/26/2025, 12:11 AM

can you test with 1.5.1?

bland-article-62755

08/26/2025, 12:12 AM

Not easily

thousands-advantage-10804

08/26/2025, 12:12 AM

just buy another server and 2 more gpus…… lol

bland-article-62755

08/26/2025, 12:13 AM

By the time it'll make it through customs and purchasing 1.7.x will be here.

😅 1

prehistoric-morning-49258

08/26/2025, 8:12 AM

there's gotta be somewhere you could rent a 2x gpu dedi hourly or something and install 1.6.0-rc & a guest with a modern kernel just to test 🫣

rhythmic-article-81903

08/26/2025, 9:25 AM

I have tested 8 RTX5000 with passthrough. Ubuntu and harvester 1.5.x

bland-article-62755

08/26/2025, 1:22 PM

@rhythmic-article-81903 On the same VM? 8 cards?

rhythmic-article-81903

08/26/2025, 1:22 PM

yes

bland-article-62755

08/26/2025, 1:23 PM

Did they suffer from the IRQ thing? Or was that not a problem in that version?

rhythmic-article-81903

08/26/2025, 1:24 PM

I did not have that problem (many other problems though but all were solved)

bland-article-62755

08/26/2025, 10:58 PM

Well I upgraded to 1.5.1 and even with adding 4 cards to a single VM,

nvidia-smi

always only detects 1 card. They all have the same interrupt IRQ number. I'm out of ideas and am starting to think about just opening up a bug issue.

rhythmic-article-81903

08/26/2025, 11:13 PM

is it same issue https://github.com/kubevirt/kubevirt/issues/11093 ?

rhythmic-article-81903

08/26/2025, 11:16 PM

https://github.com/harvester/harvester/issues/5619

bland-article-62755

08/26/2025, 11:30 PM

It's similar, but it only happens for additional cards.

bland-article-62755

08/26/2025, 11:33 PM

I don't think it's the same as the second issue at all.

bland-article-62755

08/27/2025, 12:47 AM

https://github.com/harvester/harvester/issues/8972

bland-article-62755

08/29/2025, 5:03 AM

I was a day early and $1 short.

bland-article-62755

08/29/2025, 5:05 AM

@great-bear-19718 ❤️ Thanks for clarification. 🙂

bland-article-62755

08/29/2025, 5:06 AM

> the root cause in case of harvester was the efi/bios firmware available in SLES SP6/SP7 repos. > This has been fixed and Harvester v1.6.0 is the first to contain the images with the fix. > Harvester v1.5.2 will be the next in line to get the fix. > The root cause stems from a patch which limited efi/bios firmwares from using the cpu physical bits to identify total addressable memory and reserving 1/8th for MMIO space. If you use

cpu-model: host-passthrough

then the MMIO reservation on most modern systems would be sufficient to easily handle multiple GPUs.

🦜 2

2 Views

Open in Slack

Previous Next