Has anyone been able to add more than 1 GPU via th...
# harvester
b
Has anyone been able to add more than 1 GPU via the PCI passthrough? We've been trying to add two cards to a VM and nvidia-smi only ever recognizes and loads 1 of them.
dmessage from AlmaLinux:
Copy code
[Mon Aug 25 21:07:13 2025] Loaded X.509 cert 'Nvidia GPU OOT signing 001: 55e1cef88193e60419f0b0ec379c49f77545acf0'
[Mon Aug 25 21:07:13 2025] Loaded X.509 cert 'AlmaLinux NVIDIA Module Signing: 4a47544e27f990e063664d7d0dd9c2158954567a'
[Mon Aug 25 21:07:17 2025] nvidia: loading out-of-tree module taints kernel.
[Mon Aug 25 21:07:17 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[Mon Aug 25 21:07:17 2025] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0a:00.0)
[Mon Aug 25 21:07:17 2025] nvidia: probe of 0000:0a:00.0 failed with error -1
[Mon Aug 25 21:07:17 2025] NVRM: The NVIDIA probe routine failed for 1 device(s).
[Mon Aug 25 21:07:17 2025] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  580.65.06  Release Build  (mockbuild@)  Tue Aug  5 16:02:49 EDT 2025
[Mon Aug 25 21:07:17 2025] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  580.65.06  Release Build  (mockbuild@)  Tue Aug  5 16:02:38 EDT 2025
[Mon Aug 25 21:07:17 2025] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[Mon Aug 25 21:07:19 2025] [drm] Initialized nvidia-drm 0.0.0 for 0000:0b:00.0 on minor 1
[Mon Aug 25 21:07:19 2025] nvidia 0000:0b:00.0: [drm] No compatible format found
[Mon Aug 25 21:07:19 2025] nvidia 0000:0b:00.0: [drm] Cannot find any crtc or sizes
from Ubuntu:
Copy code
# dmesg -T | egrep -i 'nvidia|nvrm|xid'
[Mon Aug 25 20:50:31 2025] nvidia: loading out-of-tree module taints kernel.
[Mon Aug 25 20:50:31 2025] nvidia: module license 'NVIDIA' taints kernel.
[Mon Aug 25 20:50:31 2025] nvidia: module license taints kernel.
[Mon Aug 25 20:50:31 2025] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[Mon Aug 25 20:50:31 2025] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0a:00.0)
[Mon Aug 25 20:50:31 2025] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR2 is 0M @ 0x0 (PCI:0000:0a:00.0)
[Mon Aug 25 20:50:31 2025] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                           NVRM: BAR5 is 0M @ 0x0 (PCI:0000:0a:00.0)
[Mon Aug 25 20:50:31 2025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  575.57.08  Sat May 24 07:21:16 UTC 2025
[Mon Aug 25 20:50:31 2025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  575.57.08  Sat May 24 06:52:56 UTC 2025
[Mon Aug 25 20:50:31 2025] [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver
[Mon Aug 25 20:50:31 2025] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 0
[Mon Aug 25 20:50:31 2025] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[Mon Aug 25 20:50:31 2025] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0b:00.0 on minor 1
[Mon Aug 25 20:50:32 2025] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[Mon Aug 25 20:51:18 2025] caller os_map_kernel_space+0x120/0x130 [nvidia] mapping multiple BARs
[Mon Aug 25 20:51:18 2025] NVRM: GPU 0000:0a:00.0: RmInitAdapter failed! (0x24:0x72:1589)
[Mon Aug 25 20:51:18 2025] NVRM: GPU 0000:0a:00.0: rm_init_adapter failed, device minor number 0
But
nvidia-smi -L
will only ever list one at time. If you only add 1 of 2 - it will always work/show up. But with two we always see the PCI regions are invalid messages or Cannot find any crtc or sizes.
Both cards have
Interrupt: pin A routed to IRQ 23
is there a way to unset that or change it?
t
I have been able to get 1 card working. I have not tested two. Did you you add the audio device for the cards as well?
b
Nope, they were getting used for compute workloads.
t
I heard that when adding gpus you need to add the audio device from the cards as well.
maybe worth a shot?
b
Does it show as a different PCIE device?
t
yes

https://youtu.be/RgW_uB6dOJ0?t=274

b
I'm not sure the cards we have have that. I'm filtering by the Vendor ID and only the 3D Controllers populate
t
ah ok. I don’t think the tesla’s have hdmi audio. I had an a2000 which did.
Are all those cards on the same host? you may be trying to add cards from 2 different hosts.
b
100% on the same host
Well, the ones I'm trying to add
The VM boots and lspci shows both devices.
t
OH.. alma and the driver don’t see both?
b
On both ubuntu and alma, the host sees two devices with lspci, but the driver fails to get both working.
Copy code
0a:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
        Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
        Physical Slot: 0-10
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 23
        Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
        Region 3: Memory at 39080000000 (64-bit, prefetchable) [size=32M]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nvidia_drm, nvidia

0b:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
        Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
        Physical Slot: 0-11
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 23
        Region 0: Memory at f9000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 38000000000 (64-bit, prefetchable) [size=32G]
        Region 3: Memory at 38800000000 (64-bit, prefetchable) [size=32M]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nvidia_drm, nvidia
t
does lspci from the vm see both?
b
yes
That text block above is from both devices on Ubuntu
t
sounds like a kernel/driver issue and not harvester.
in my testing I used rocky 9 from the nvidia repo : https://github.com/clemenko/harvester_ollama
b
I think the Interrupt pin is set at the hypervisor level.
Like the host level has them all routed to different IRQs.
Copy code
gpu3:~ # lspci -d 10de: -vvv |grep IRQ
        Interrupt: pin A routed to IRQ 1209
        Interrupt: pin A routed to IRQ 1210
        Interrupt: pin A routed to IRQ 1217
        Interrupt: pin A routed to IRQ 1218
Copy code
ubuntu@gpu3test:~$ lspci -d 10de: -vvv | grep IRQ
        Interrupt: pin A routed to IRQ 23
        Interrupt: pin A routed to IRQ 23
Copy code
[root@alma9gpu3test ~]# lspci -d 10DE: -vvv  |grep IRQ
        Interrupt: pin A routed to IRQ 23
        Interrupt: pin A routed to IRQ 23
t
which kernel or ubuntu are you using on the vm?
b
Copy code
5.14.0-570.35.1.el9_6.x86_64
It's 24.04
t
that is an OLD kernel.
b
Copy code
ubuntu@gpu3test:~$ uname -a
Linux gpu3test 6.8.0-78-generic #78-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 12 11:34:18 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@gpu3test:~$ uname -r
6.8.0-78-generic
t
lol
b
Alma 9 ships what they ship
t
I have found that for nvidia ubuntu is the best.
t
oh wow
b
Yeah that's what we started with
We only tried alma because it was throwing a fit with the second card.
We've also tried this on Tesla100 cards and it's the same.
t
interesting.
b
I really do think it's related to that IRQ getting the same mapping.
I just don't see it in the configs anywhere, so this feels more like a bug
t
what version of harvester?
b
1.4.3
t
OH I know they worked on pci passthrough with 1.5.x
upgrade?
b
¯\_(ツ)_/¯
What's maddening is that just 1 card in the same VM works fine. (and with 2 - one of them still seems to work fine)
t
sounds like a Harv bug..
can you test with 1.5.1?
b
Not easily
t
just buy another server and 2 more gpus…… lol
b
By the time it'll make it through customs and purchasing 1.7.x will be here.
😅 1
p
there's gotta be somewhere you could rent a 2x gpu dedi hourly or something and install 1.6.0-rc & a guest with a modern kernel just to test 🫣
r
I have tested 8 RTX5000 with passthrough. Ubuntu and harvester 1.5.x
b
@rhythmic-article-81903 On the same VM? 8 cards?
r
yes
b
Did they suffer from the IRQ thing? Or was that not a problem in that version?
r
I did not have that problem (many other problems though but all were solved)
b
Well I upgraded to 1.5.1 and even with adding 4 cards to a single VM,
nvidia-smi
always only detects 1 card. They all have the same interrupt IRQ number. I'm out of ideas and am starting to think about just opening up a bug issue.
r
b
It's similar, but it only happens for additional cards.
I don't think it's the same as the second issue at all.
I was a day early and $1 short.
@great-bear-19718 ❤️ Thanks for clarification. 🙂
> the root cause in case of harvester was the efi/bios firmware available in SLES SP6/SP7 repos. > This has been fixed and Harvester v1.6.0 is the first to contain the images with the fix. > Harvester v1.5.2 will be the next in line to get the fix. > The root cause stems from a patch which limited efi/bios firmwares from using the cpu physical bits to identify total addressable memory and reserving 1/8th for MMIO space. If you use
cpu-model: host-passthrough
then the MMIO reservation on most modern systems would be sufficient to easily handle multiple GPUs.
🦜 2