This message was deleted.
# harvester
a
This message was deleted.
r
Hi @rhythmic-painter-76998, if you enabled the pcidevices addon, there will be a
harvester-pcidevices-controller
DaemonSet running on the cluster. You can check the Pod’s log by executing
kubectl -n harvester-system logs harvester-pcidevices-controller-xxxxx
to see if there’s any error message. And if it’s possible, please grab a support bundle file and post it here. We can help look into it. Thanks
r
i got this message
Copy code
time="2023-04-27T02:51:31Z" level=info msg="Adding harvester01-000001000 to KubeVirt list of permitted devices"
time="2023-04-27T02:51:31Z" level=info msg="Enabling passthrough for PDC: harvester01-000001000"
time="2023-04-27T02:51:31Z" level=info msg="Binding device harvester01-000001000 [10de 2204] to vfio-pci"
time="2023-04-27T02:51:31Z" level=info msg="Binding device 0000:01:00.0 vfio-pci"
time="2023-04-27T02:51:31Z" level=error msg="error syncing 'harvester01-000001000': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
btw where or how can I grab bundle file?
👀 1
r
To generate a support bundle, please refer to this
r
bundle file is uploaded along with the issue
r
Thank you! I quickly scan for similar issues, and it turns out there’s one https://github.com/harvester/harvester/issues/3630. But that seems to be resolved already … cc @great-bear-19718 @limited-breakfast-50094
r
but my pcicontroller is running image: rancher/harvester-pcidevices:v0.2.4, the code mentioned (i.e.
Copy code
gmehta3/pcidevices:dev
Digest:sha256:e86f89562a997dd54677eee9e945a1cadefa8cfa103267edf01d380ede39c9a5
) should be included.
g
that was a dev image i created for QA to validate.. let me check the support bundle
was this an upgrade?
r
no fresh installation
g
are you able to reboot this node?
r
I reboot once, but the status is the same
but I can do reboot if you need
g
let me check the code.. and see first
gimme a few mins
can you please share output of
ls -l /sys/bus/pci/drivers/vfio-pci/
from the node
r
Copy code
sudo ls -l /sys/bus/pci/drivers/vfio-pci/
total 0
--w------- 1 root root 4096 Apr 27 02:00 bind
lrwxrwxrwx 1 root root    0 Apr 27 04:31 module -> ../../../../module/vfio_pci
--w------- 1 root root 4096 Apr 27 02:00 new_id
--w------- 1 root root 4096 Apr 27 04:31 remove_id
--w------- 1 root root 4096 Apr 27 02:00 uevent
--w------- 1 root root 4096 Apr 27 04:31 unbind
g
can you please try this..
echo "10de 2204" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
r
it seems that I am not able to overwrite existing file
g
do you get an error?
needs to be done as root
r
Copy code
harvester01:~ # whoami
root
harvester01:~ # echo "10de 2204" > /sys/bus/pci/drivers/vfio-pci/new_id
-bash: echo: write error: File exists
g
ok.. lets skip it then
and try the 2nd one
r
Copy code
harvester01:~ # echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
-bash: echo: write error: Invalid argument
g
can you please try this..
echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
it doesnt seem to be bound.. but i am curious why it is failing
r
Copy code
harvester01:~ # echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
-bash: echo: write error: No such device
g
lspci | grep -i nvidia
?
r
Copy code
harvester01:~ # lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
g
are you able to enable pcideviceclaim on
NVIDIA Corporation GA102 High Definition Audio Controller
as well
they should be in same iommu group.. but i see iommu group is not loaded for some reason in device status
r
definitely, but I tried to enable them before, still no luck.
Copy code
time="2023-04-27T04:55:42Z" level=info msg="Reconciling PCI Devices list"
time="2023-04-27T04:55:54Z" level=info msg="Adding harvester01-000001000 to KubeVirt list of permitted devices"
time="2023-04-27T04:55:54Z" level=info msg="Enabling passthrough for PDC: harvester01-000001000"
time="2023-04-27T04:55:54Z" level=info msg="Binding device harvester01-000001000 [10de 2204] to vfio-pci"
time="2023-04-27T04:55:54Z" level=info msg="Binding device 0000:01:00.0 vfio-pci"
time="2023-04-27T04:55:54Z" level=error msg="error syncing 'harvester01-000001000': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
time="2023-04-27T04:55:56Z" level=info msg="Adding harvester01-000001001 to KubeVirt list of permitted devices"
time="2023-04-27T04:55:56Z" level=info msg="Enabling passthrough for PDC: harvester01-000001001"
time="2023-04-27T04:55:56Z" level=info msg="Binding device harvester01-000001001 [10de 1aef] to vfio-pci"
time="2023-04-27T04:55:56Z" level=info msg="Binding device 0000:01:00.1 vfio-pci"
time="2023-04-27T04:55:56Z" level=error msg="error syncing 'harvester01-000001001': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
g
lets try something..
disable the devices..
delete the pcidevices.. and reboot node
may i also see this...
ls -alrt /sys/bus/pci/devices/0000:01:00.0
r
Copy code
harvester01:~ #  ls -alrt /sys/bus/pci/devices/0000:01:00.0
lrwxrwxrwx 1 root root 0 Apr 27 01:58 /sys/bus/pci/devices/0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
g
how is the gpu plugged into the machine?
and i also sorry meant
ls -lart ls -alrt /sys/bus/pci/devices/0000:01:00.0/
r
Copy code
harvester01:~ # ls -alrt /sys/bus/pci/devices/0000:01:00.0/
total 0
-r--r--r--  1 root root      4096 Apr 27 01:58 waiting_for_supplier
-r--r--r--  1 root root      4096 Apr 27 01:58 vendor
-rw-r--r--  1 root root      4096 Apr 27 01:58 uevent
-r--r--r--  1 root root      4096 Apr 27 01:58 subsystem_device
lrwxrwxrwx  1 root root         0 Apr 27 01:58 subsystem -> ../../../../bus/pci
-rw-------  1 root root    524288 Apr 27 01:58 rom
-r--r--r--  1 root root      4096 Apr 27 01:58 revision
-rw-------  1 root root       128 Apr 27 01:58 resource5
-rw-------  1 root root  33554432 Apr 27 01:58 resource3_wc
-rw-------  1 root root  33554432 Apr 27 01:58 resource3
-rw-------  1 root root 268435456 Apr 27 01:58 resource1_wc
-rw-------  1 root root 268435456 Apr 27 01:58 resource1
-rw-------  1 root root  16777216 Apr 27 01:58 resource0
-r--r--r--  1 root root      4096 Apr 27 01:58 resource
-rw-r--r--  1 root root      4096 Apr 27 01:58 reset_method
--w-------  1 root root      4096 Apr 27 01:58 reset
--w-------  1 root root      4096 Apr 27 01:58 rescan
--w--w----  1 root root      4096 Apr 27 01:58 remove
-r--r--r--  1 root root      4096 Apr 27 01:58 power_state
drwxr-xr-x  2 root root         0 Apr 27 01:58 power
-rw-r--r--  1 root root      4096 Apr 27 01:58 numa_node
-rw-r--r--  1 root root      4096 Apr 27 01:58 msi_bus
-r--r--r--  1 root root      4096 Apr 27 01:58 modalias
-r--r--r--  1 root root      4096 Apr 27 01:58 max_link_width
-r--r--r--  1 root root      4096 Apr 27 01:58 max_link_speed
-r--r--r--  1 root root      4096 Apr 27 01:58 local_cpus
-r--r--r--  1 root root      4096 Apr 27 01:58 local_cpulist
drwxr-xr-x  2 root root         0 Apr 27 01:58 link
-r--r--r--  1 root root      4096 Apr 27 01:58 irq
lrwxrwxrwx  1 root root         0 Apr 27 01:58 firmware_node -> ../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/device:01
-rw-r--r--  1 root root      4096 Apr 27 01:58 enable
-rw-r--r--  1 root root      4096 Apr 27 01:58 driver_override
-r--r--r--  1 root root      4096 Apr 27 01:58 dma_mask_bits
-r--r--r--  1 root root      4096 Apr 27 01:58 device
-rw-r--r--  1 root root      4096 Apr 27 01:58 d3cold_allowed
-r--r--r--  1 root root      4096 Apr 27 01:58 current_link_width
-r--r--r--  1 root root      4096 Apr 27 01:58 current_link_speed
lrwxrwxrwx  1 root root         0 Apr 27 01:58 consumer:pci:0000:01:00.1 -> ../../../virtual/devlink/pci:0000:01:00.0--pci:0000:01:00.1
-r--r--r--  1 root root      4096 Apr 27 01:58 consistent_dma_mask_bits
-rw-r--r--  1 root root      4096 Apr 27 01:58 config
-r--r--r--  1 root root      4096 Apr 27 01:58 class
-rw-r--r--  1 root root      4096 Apr 27 01:58 broken_parity_status
-r--r--r--  1 root root      4096 Apr 27 01:58 boot_vga
-r--r--r--  1 root root      4096 Apr 27 01:58 ari_enabled
-r--r--r--  1 root root      4096 Apr 27 01:58 aer_dev_nonfatal
-r--r--r--  1 root root      4096 Apr 27 01:58 aer_dev_fatal
-r--r--r--  1 root root      4096 Apr 27 01:58 aer_dev_correctable
drwxr-xr-x 12 root root         0 Apr 27 01:58 ..
drwxr-xr-x  4 root root         0 Apr 27 01:58 .
-r--r--r--  1 root root      4096 Apr 27 05:00 subsystem_vendor
you mean how gpu physically plugged into the machine?
g
yeah
i need to be away for a short while.. i will check some of our gpu nodes to see what is going on there..
i dont have the same gpu type either
so i need to check up on that too
r
ok
while you are off, here is what i am going to do: 1. disable pcicontroller 2. reboot 3. re-enable pci controller sounds good?
g
Sounds good
I need to check if there is something specific for this devices
r
still no luck
Copy code
/sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
time="2023-04-27T05:22:25Z" level=info msg="Adding harvester01-000001000 to KubeVirt list of permitted devices"
time="2023-04-27T05:22:25Z" level=info msg="Enabling passthrough for PDC: harvester01-000001000"
time="2023-04-27T05:22:25Z" level=info msg="Binding device harvester01-000001000 [10de 2204] to vfio-pci"
time="2023-04-27T05:22:25Z" level=info msg="Binding device 0000:01:00.0 vfio-pci"
time="2023-04-27T05:22:25Z" level=error msg="error syncing 'harvester01-000001000': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
time="2023-04-27T05:22:27Z" level=info msg="Adding harvester01-000001001 to KubeVirt list of permitted devices"
time="2023-04-27T05:22:27Z" level=info msg="Enabling passthrough for PDC: harvester01-000001001"
time="2023-04-27T05:22:27Z" level=info msg="Binding device harvester01-000001001 [10de 1aef] to vfio-pci"
time="2023-04-27T05:22:27Z" level=info msg="Binding device 0000:01:00.1 vfio-pci"
time="2023-04-27T05:22:27Z" level=error msg="error syncing 'harve
g
Can you please check in bios if there is a setting for virtualisation or iommu
I saw no iommu grouping for the pcidevices
What is the machine?
r
Copy code
harvester01:/home/rancher # dmesg | grep IOMMU
[    0.023360] DMAR: IOMMU enabled
is this enough ?
g
Should be fine then
If you check pcidevice for gpu do you see iommu grouping there?
That message could be misleading
r
“If you check pcidevice for gpu do you see iommu grouping there?” how to check this?
g
Kubectl get pcidevices deviceName -o yaml
Status should have a field for iommu group
Which I don’t see for your devices from the support bundle
r
Copy code
apiVersion: <http://devices.harvesterhci.io/v1beta1|devices.harvesterhci.io/v1beta1>
kind: PCIDevice
metadata:
  annotations:
    <http://harvesterhci.io/pcideviceDriver|harvesterhci.io/pcideviceDriver>: ""
  creationTimestamp: "2023-04-26T02:07:48Z"
  generation: 1
  labels:
    nodename: harvester01
  name: harvester01-000001000
  resourceVersion: "22811"
  uid: febacf0f-8ddd-4ad9-b44e-ae4c06114643
spec: {}
status:
  address: "0000:01:00.0"
  classId: "0300"
  description: 'VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090]'
  deviceId: "2204"
  iommuGroup: ""
  nodeName: harvester01
  resourceName: <http://nvidia.com/GA102_GEFORCE_RTX_3090|nvidia.com/GA102_GEFORCE_RTX_3090>
  vendorId: 10de
Copy code
iommuGroup
is empty
g
yeah.. likely why the binding is failing
are you able to please check the bios settings
what is the machine type / processor type?
r
processor is intel i7-13700k motherboard is z690 aorus elite ax
g
you may need to check if
vt-d
is enabled in the bios
the hardware i have.. i see the settings...
r
I enabled virtualization on bios
g
i cant say if there is an additional setting for
VT-d
in the bios
so i disabled VT-d on a node.. and my pcidevice before disable looking like this..
Copy code
apiVersion: <http://devices.harvesterhci.io/v1beta1|devices.harvesterhci.io/v1beta1>
kind: PCIDevice
metadata:
  annotations:
    <http://harvesterhci.io/pcideviceDriver|harvesterhci.io/pcideviceDriver>: ixgbe
  creationTimestamp: "2023-04-14T03:23:29Z"
  generation: 1
  labels:
    nodename: harvester-ldgh9
  name: harvester-ldgh9-000004000
  resourceVersion: "38802"
  uid: 502ab495-414d-45fa-b5b8-0e86fa92e899
spec: {}
status:
  address: "0000:04:00.0"
  classId: "0200"
  description: 'Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
    Network Connection'
  deviceId: 10fb
  iommuGroup: "36"
  kernelDriverInUse: ixgbe
  nodeName: harvester-ldgh9
  resourceName: <http://intel.com/82599ES_10GIGABIT_SFI_SFP_NETWORK_CONNECTION|intel.com/82599ES_10GIGABIT_SFI_SFP_NETWORK_CONNECTION>
  vendorId: "8086"
post disable.. i delete pcidevice crd.. and it was rescanned and recreated as
Copy code
apiVersion: <http://devices.harvesterhci.io/v1beta1|devices.harvesterhci.io/v1beta1>
kind: PCIDevice
metadata:
  annotations:
    <http://harvesterhci.io/pcideviceDriver|harvesterhci.io/pcideviceDriver>: ixgbe
  creationTimestamp: "2023-04-27T06:31:03Z"
  generation: 1
  labels:
    nodename: harvester-ldgh9
  name: harvester-ldgh9-000004000
  resourceVersion: "21133767"
  uid: 989bb8bf-f6e8-424d-be70-f110e251c3b8
spec: {}
status:
  address: "0000:04:00.0"
  classId: "0200"
  description: 'Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
    Network Connection'
  deviceId: 10fb
  iommuGroup: ""
  kernelDriverInUse: ixgbe
  nodeName: harvester-ldgh9
  resourceName: <http://intel.com/82599ES_10GIGABIT_SFI_SFP_NETWORK_CONNECTION|intel.com/82599ES_10GIGABIT_SFI_SFP_NETWORK_CONNECTION>
  vendorId: "8086"
r
ic
leave for a meeting, would come back later
g
and device cant be enabled..
Copy code
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=info msg="Adding harvester-ldgh9-000004000 to KubeVirt list of permitted devices"
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=info msg="Enabling passthrough for PDC: harvester-ldgh9-000004000"
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=info msg="Binding device harvester-ldgh9-000004000 [8086 10fb] to vfio-pci"
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=info msg="Binding device 0000:04:00.0 vfio-pci"
harvester-pcidevices-controller-2tpxr agent time="2023-04-27T06:35:20Z" level=error msg="error syncing 'harvester-ldgh9-000004000': handler PCIDeviceClaimReconcile: error writing to bind file: write /sys/bus/pci/drivers/vfio-pci/bind: invalid argument, requeuing"
so basically we need to find VT-d setting and make sure it is enabled.. and that will fix it
🙌 1
r
let me try from bios setting
bingo
truns out there is another settings called vt-d along with virtualization.
thanks @great-bear-19718
g
Great..