This message was deleted.
# harvester
This message was deleted.
Hi @rhythmic-painter-76998, if you enabled the pcidevices addon, there will be a
DaemonSet running on the cluster. You can check the Pod’s log by executing
kubectl -n harvester-system logs harvester-pcidevices-controller-xxxxx
to see if there’s any error message. And if it’s possible, please grab a support bundle file and post it here. We can help look into it. Thanks
i got this message
Copy code
time="2023-04-27T02:51:31Z" level=info msg="Adding harvester01-000001000 to KubeVirt list of permitted devices"
time="2023-04-27T02:51:31Z" level=info msg="Enabling passthrough for PDC: harvester01-000001000"
time="2023-04-27T02:51:31Z" level=info msg="Binding device harvester01-000001000 [10de 2204] to vfio-pci"
time="2023-04-27T02:51:31Z" level=info msg="Binding device 0000:01:00.0 vfio-pci"
time="2023-04-27T02:51:31Z" level=error msg="error syncing 'harvester01-000001000': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
btw where or how can I grab bundle file?
👀 1
To generate a support bundle, please refer to this
bundle file is uploaded along with the issue
Thank you! I quickly scan for similar issues, and it turns out there’s one But that seems to be resolved already … cc @great-bear-19718 @limited-breakfast-50094
but my pcicontroller is running image: rancher/harvester-pcidevices:v0.2.4, the code mentioned (i.e.
Copy code
) should be included.
that was a dev image i created for QA to validate.. let me check the support bundle
was this an upgrade?
no fresh installation
are you able to reboot this node?
I reboot once, but the status is the same
but I can do reboot if you need
let me check the code.. and see first
gimme a few mins
can you please share output of
ls -l /sys/bus/pci/drivers/vfio-pci/
from the node
Copy code
sudo ls -l /sys/bus/pci/drivers/vfio-pci/
total 0
--w------- 1 root root 4096 Apr 27 02:00 bind
lrwxrwxrwx 1 root root    0 Apr 27 04:31 module -> ../../../../module/vfio_pci
--w------- 1 root root 4096 Apr 27 02:00 new_id
--w------- 1 root root 4096 Apr 27 04:31 remove_id
--w------- 1 root root 4096 Apr 27 02:00 uevent
--w------- 1 root root 4096 Apr 27 04:31 unbind
can you please try this..
echo "10de 2204" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
it seems that I am not able to overwrite existing file
do you get an error?
needs to be done as root
Copy code
harvester01:~ # whoami
harvester01:~ # echo "10de 2204" > /sys/bus/pci/drivers/vfio-pci/new_id
-bash: echo: write error: File exists
ok.. lets skip it then
and try the 2nd one
Copy code
harvester01:~ # echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
-bash: echo: write error: Invalid argument
can you please try this..
echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
it doesnt seem to be bound.. but i am curious why it is failing
Copy code
harvester01:~ # echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
-bash: echo: write error: No such device
lspci | grep -i nvidia
Copy code
harvester01:~ # lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
are you able to enable pcideviceclaim on
NVIDIA Corporation GA102 High Definition Audio Controller
as well
they should be in same iommu group.. but i see iommu group is not loaded for some reason in device status
definitely, but I tried to enable them before, still no luck.
Copy code
time="2023-04-27T04:55:42Z" level=info msg="Reconciling PCI Devices list"
time="2023-04-27T04:55:54Z" level=info msg="Adding harvester01-000001000 to KubeVirt list of permitted devices"
time="2023-04-27T04:55:54Z" level=info msg="Enabling passthrough for PDC: harvester01-000001000"
time="2023-04-27T04:55:54Z" level=info msg="Binding device harvester01-000001000 [10de 2204] to vfio-pci"
time="2023-04-27T04:55:54Z" level=info msg="Binding device 0000:01:00.0 vfio-pci"
time="2023-04-27T04:55:54Z" level=error msg="error syncing 'harvester01-000001000': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
time="2023-04-27T04:55:56Z" level=info msg="Adding harvester01-000001001 to KubeVirt list of permitted devices"
time="2023-04-27T04:55:56Z" level=info msg="Enabling passthrough for PDC: harvester01-000001001"
time="2023-04-27T04:55:56Z" level=info msg="Binding device harvester01-000001001 [10de 1aef] to vfio-pci"
time="2023-04-27T04:55:56Z" level=info msg="Binding device 0000:01:00.1 vfio-pci"
time="2023-04-27T04:55:56Z" level=error msg="error syncing 'harvester01-000001001': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
lets try something..
disable the devices..
delete the pcidevices.. and reboot node
may i also see this...
ls -alrt /sys/bus/pci/devices/0000:01:00.0
Copy code
harvester01:~ #  ls -alrt /sys/bus/pci/devices/0000:01:00.0
lrwxrwxrwx 1 root root 0 Apr 27 01:58 /sys/bus/pci/devices/0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
how is the gpu plugged into the machine?
and i also sorry meant
ls -lart ls -alrt /sys/bus/pci/devices/0000:01:00.0/
Copy code
harvester01:~ # ls -alrt /sys/bus/pci/devices/0000:01:00.0/
total 0
-r--r--r--  1 root root      4096 Apr 27 01:58 waiting_for_supplier
-r--r--r--  1 root root      4096 Apr 27 01:58 vendor
-rw-r--r--  1 root root      4096 Apr 27 01:58 uevent
-r--r--r--  1 root root      4096 Apr 27 01:58 subsystem_device
lrwxrwxrwx  1 root root         0 Apr 27 01:58 subsystem -> ../../../../bus/pci
-rw-------  1 root root    524288 Apr 27 01:58 rom
-r--r--r--  1 root root      4096 Apr 27 01:58 revision
-rw-------  1 root root       128 Apr 27 01:58 resource5
-rw-------  1 root root  33554432 Apr 27 01:58 resource3_wc
-rw-------  1 root root  33554432 Apr 27 01:58 resource3
-rw-------  1 root root 268435456 Apr 27 01:58 resource1_wc
-rw-------  1 root root 268435456 Apr 27 01:58 resource1
-rw-------  1 root root  16777216 Apr 27 01:58 resource0
-r--r--r--  1 root root      4096 Apr 27 01:58 resource
-rw-r--r--  1 root root      4096 Apr 27 01:58 reset_method
--w-------  1 root root      4096 Apr 27 01:58 reset
--w-------  1 root root      4096 Apr 27 01:58 rescan
--w--w----  1 root root      4096 Apr 27 01:58 remove
-r--r--r--  1 root root      4096 Apr 27 01:58 power_state
drwxr-xr-x  2 root root         0 Apr 27 01:58 power
-rw-r--r--  1 root root      4096 Apr 27 01:58 numa_node
-rw-r--r--  1 root root      4096 Apr 27 01:58 msi_bus
-r--r--r--  1 root root      4096 Apr 27 01:58 modalias
-r--r--r--  1 root root      4096 Apr 27 01:58 max_link_width
-r--r--r--  1 root root      4096 Apr 27 01:58 max_link_speed
-r--r--r--  1 root root      4096 Apr 27 01:58 local_cpus
-r--r--r--  1 root root      4096 Apr 27 01:58 local_cpulist
drwxr-xr-x  2 root root         0 Apr 27 01:58 link
-r--r--r--  1 root root      4096 Apr 27 01:58 irq
lrwxrwxrwx  1 root root         0 Apr 27 01:58 firmware_node -> ../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/device:01
-rw-r--r--  1 root root      4096 Apr 27 01:58 enable
-rw-r--r--  1 root root      4096 Apr 27 01:58 driver_override
-r--r--r--  1 root root      4096 Apr 27 01:58 dma_mask_bits
-r--r--r--  1 root root      4096 Apr 27 01:58 device
-rw-r--r--  1 root root      4096 Apr 27 01:58 d3cold_allowed
-r--r--r--  1 root root      4096 Apr 27 01:58 current_link_width
-r--r--r--  1 root root      4096 Apr 27 01:58 current_link_speed
lrwxrwxrwx  1 root root         0 Apr 27 01:58 consumer:pci:0000:01:00.1 -> ../../../virtual/devlink/pci:0000:01:00.0--pci:0000:01:00.1
-r--r--r--  1 root root      4096 Apr 27 01:58 consistent_dma_mask_bits
-rw-r--r--  1 root root      4096 Apr 27 01:58 config
-r--r--r--  1 root root      4096 Apr 27 01:58 class
-rw-r--r--  1 root root      4096 Apr 27 01:58 broken_parity_status
-r--r--r--  1 root root      4096 Apr 27 01:58 boot_vga
-r--r--r--  1 root root      4096 Apr 27 01:58 ari_enabled
-r--r--r--  1 root root      4096 Apr 27 01:58 aer_dev_nonfatal
-r--r--r--  1 root root      4096 Apr 27 01:58 aer_dev_fatal
-r--r--r--  1 root root      4096 Apr 27 01:58 aer_dev_correctable
drwxr-xr-x 12 root root         0 Apr 27 01:58 ..
drwxr-xr-x  4 root root         0 Apr 27 01:58 .
-r--r--r--  1 root root      4096 Apr 27 05:00 subsystem_vendor
you mean how gpu physically plugged into the machine?
i need to be away for a short while.. i will check some of our gpu nodes to see what is going on there..
i dont have the same gpu type either
so i need to check up on that too
while you are off, here is what i am going to do: 1. disable pcicontroller 2. reboot 3. re-enable pci controller sounds good?
Sounds good
I need to check if there is something specific for this devices
still no luck
Copy code
/sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
time="2023-04-27T05:22:25Z" level=info msg="Adding harvester01-000001000 to KubeVirt list of permitted devices"
time="2023-04-27T05:22:25Z" level=info msg="Enabling passthrough for PDC: harvester01-000001000"
time="2023-04-27T05:22:25Z" level=info msg="Binding device harvester01-000001000 [10de 2204] to vfio-pci"
time="2023-04-27T05:22:25Z" level=info msg="Binding device 0000:01:00.0 vfio-pci"
time="2023-04-27T05:22:25Z" level=error msg="error syncing 'harvester01-000001000': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
time="2023-04-27T05:22:27Z" level=info msg="Adding harvester01-000001001 to KubeVirt list of permitted devices"
time="2023-04-27T05:22:27Z" level=info msg="Enabling passthrough for PDC: harvester01-000001001"
time="2023-04-27T05:22:27Z" level=info msg="Binding device harvester01-000001001 [10de 1aef] to vfio-pci"
time="2023-04-27T05:22:27Z" level=info msg="Binding device 0000:01:00.1 vfio-pci"
time="2023-04-27T05:22:27Z" level=error msg="error syncing 'harve
Can you please check in bios if there is a setting for virtualisation or iommu
I saw no iommu grouping for the pcidevices
What is the machine?
Copy code
harvester01:/home/rancher # dmesg | grep IOMMU
[    0.023360] DMAR: IOMMU enabled
is this enough ?
Should be fine then
If you check pcidevice for gpu do you see iommu grouping there?
That message could be misleading
“If you check pcidevice for gpu do you see iommu grouping there?” how to check this?
Kubectl get pcidevices deviceName -o yaml
Status should have a field for iommu group
Which I don’t see for your devices from the support bundle
Copy code
apiVersion: <|>
kind: PCIDevice
    <|>: ""
  creationTimestamp: "2023-04-26T02:07:48Z"
  generation: 1
    nodename: harvester01
  name: harvester01-000001000
  resourceVersion: "22811"
  uid: febacf0f-8ddd-4ad9-b44e-ae4c06114643
spec: {}
  address: "0000:01:00.0"
  classId: "0300"
  description: 'VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090]'
  deviceId: "2204"
  iommuGroup: ""
  nodeName: harvester01
  resourceName: <|>
  vendorId: 10de
Copy code
is empty
yeah.. likely why the binding is failing
are you able to please check the bios settings
what is the machine type / processor type?
processor is intel i7-13700k motherboard is z690 aorus elite ax
you may need to check if
is enabled in the bios
the hardware i have.. i see the settings...
I enabled virtualization on bios
i cant say if there is an additional setting for
in the bios
so i disabled VT-d on a node.. and my pcidevice before disable looking like this..
Copy code
apiVersion: <|>
kind: PCIDevice
    <|>: ixgbe
  creationTimestamp: "2023-04-14T03:23:29Z"
  generation: 1
    nodename: harvester-ldgh9
  name: harvester-ldgh9-000004000
  resourceVersion: "38802"
  uid: 502ab495-414d-45fa-b5b8-0e86fa92e899
spec: {}
  address: "0000:04:00.0"
  classId: "0200"
  description: 'Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
    Network Connection'
  deviceId: 10fb
  iommuGroup: "36"
  kernelDriverInUse: ixgbe
  nodeName: harvester-ldgh9
  resourceName: <|>
  vendorId: "8086"
post disable.. i delete pcidevice crd.. and it was rescanned and recreated as
Copy code
apiVersion: <|>
kind: PCIDevice
    <|>: ixgbe
  creationTimestamp: "2023-04-27T06:31:03Z"
  generation: 1
    nodename: harvester-ldgh9
  name: harvester-ldgh9-000004000
  resourceVersion: "21133767"
  uid: 989bb8bf-f6e8-424d-be70-f110e251c3b8
spec: {}
  address: "0000:04:00.0"
  classId: "0200"
  description: 'Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
    Network Connection'
  deviceId: 10fb
  iommuGroup: ""
  kernelDriverInUse: ixgbe
  nodeName: harvester-ldgh9
  resourceName: <|>
  vendorId: "8086"
leave for a meeting, would come back later
and device cant be enabled..
Copy code
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=info msg="Adding harvester-ldgh9-000004000 to KubeVirt list of permitted devices"
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=info msg="Enabling passthrough for PDC: harvester-ldgh9-000004000"
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=info msg="Binding device harvester-ldgh9-000004000 [8086 10fb] to vfio-pci"
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=info msg="Binding device 0000:04:00.0 vfio-pci"
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=error msg="error syncing 'harvester-ldgh9-000004000': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
so basically we need to find VT-d setting and make sure it is enabled.. and that will fix it
🙌 1
let me try from bios setting
truns out there is another settings called vt-d along with virtualization.
thanks @great-bear-19718