This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

08/18/2023, 11:22 AM

This message was deleted.

great-bear-19718

08/23/2023, 12:03 AM

what is the version of pcidevices?

miniature-lock-53926

08/29/2023, 6:20 AM

It's version 0.2.4. Since I have posted I found some more information about this error and also a bug with the pci-addon that i posted in the pcidevices repo https://github.com/harvester/pcidevices/issues/57

miniature-lock-53926

08/29/2023, 6:26 AM

Also on the harvester nodes all the drives are bound to the vfio-driver

great-bear-19718

08/29/2023, 6:46 AM

are you able to manually update the ds to image tag v0.2.5

great-bear-19718

08/29/2023, 6:46 AM

there are already some fixes in it for a few known issues

great-bear-19718

08/29/2023, 6:47 AM

if your env is not airgapped then the node should be able to pull these images

miniature-lock-53926

08/29/2023, 6:48 AM

I should be able to, but where do I get the new version? The lates release in the repo is 0.2.3 and on the rancher dashboard for the harvester cluster under https://{{HARVESTER_IP}}/*dashboard*/c/local/explorer there also no update available?

great-bear-19718

08/29/2023, 6:49 AM

the image is already in docker hub

great-bear-19718

08/29/2023, 6:49 AM

you just need to update the pcidevices daemonset image to v0.2.5

great-bear-19718

08/29/2023, 6:49 AM

the one packaged in 1.1.2 is v0.2.5

miniature-lock-53926

08/29/2023, 6:49 AM

ah ok so no new helm chart, i see

great-bear-19718

08/29/2023, 6:49 AM

or you could try 1.2.0-rc5

great-bear-19718

08/29/2023, 6:50 AM

i assume you are only testing harvester in a vm.. so may be installing rc5 could be easiest way

miniature-lock-53926

08/29/2023, 6:50 AM

no i am actually working with 5 bare metal nodes

great-bear-19718

08/29/2023, 6:50 AM

miniature-lock-53926

08/29/2023, 6:51 AM

so just increase the image tag for the pcidevices, right?

great-bear-19718

08/29/2023, 6:51 AM

actually may be better to update the helm chart for

harvester-pcidevices

https://github.com/harvester/charts/releases/download/harvester-pcidevices-controller-0.2.5/harvester-pcidevices-controller-0.2.5.tgz

great-bear-19718

08/29/2023, 6:52 AM

there are some rbac changes needed to leverage the additional CRDs introduced

miniature-lock-53926

08/29/2023, 6:53 AM

k thank you very much. I will give this a try and let you know 🙂

great-bear-19718

08/29/2023, 6:55 AM

👍

miniature-lock-53926

08/29/2023, 6:59 AM

Do you know if the changes address the bug that i reported with the wrong "KERNEL DRIVER TO UNBIND" setting in the deviceclaims? Because otherwise I would manually unbind those drives between testing but I am not sure if this helps or hurts

great-bear-19718

08/29/2023, 6:59 AM

there is a change to fix the reconcile logic which should address this..

❤️ 1

great-bear-19718

08/29/2023, 7:00 AM

easiest way would be to remove pcideviceclaims.. reboot node.. this will rebind devices to original driver and then try again

great-bear-19718

08/29/2023, 7:00 AM

if you want you can delete all pcidevices objects on the node..

great-bear-19718

08/29/2023, 7:00 AM

the controller will reconcile and regenerate the crds

miniature-lock-53926

08/29/2023, 7:01 AM

yeah that was my plan. first remove alls pci-passthrough vms, then disable all passthrough than remove controller, reboot nodes, reinstall new pcicontroller chart and cross fingers 🙂

miniature-lock-53926

08/29/2023, 10:20 AM

Ok i was able to install the chart by modifying the addon definition and using

Copy code

<https://charts.harvesterhci.io>

as the repo.

Copy code

apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
kind: Addon
metadata:
  annotations:
    <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: >
      {"apiVersion":"<http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>","kind":"Addon","metadata":{"annotations":{},"labels":{"<http://addon.harvesterhci.io/experimental|addon.harvesterhci.io/experimental>":"true"},"name":"pcidevices-controller","namespace":"harvester-system"},"spec":{"chart":"harvester-pcidevices-controller","enabled":false,"repo":"<http://harvester-cluster-repo.cattle-system.svc/charts>","valuesContent":"image:\n 
      tag: v0.2.4\nfullnameOverride:
      harvester-pcidevices-controller\n","version":"0.2.4"}}
  creationTimestamp: '2023-06-05T16:28:01Z'
  generation: 22
  labels:
    <http://addon.harvesterhci.io/experimental|addon.harvesterhci.io/experimental>: 'true'
  managedFields:
    - apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:<http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: {}
          f:labels:
            .: {}
            f:<http://addon.harvesterhci.io/experimental|addon.harvesterhci.io/experimental>: {}
        f:spec:
          .: {}
          f:chart: {}
      manager: kubectl-client-side-apply
      operation: Update
      time: '2023-06-05T16:28:01Z'
    - apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:enabled: {}
          f:repo: {}
          f:valuesContent: {}
          f:version: {}
      manager: harvester
      operation: Update
      time: '2023-08-29T09:52:50Z'
    - apiVersion: <http://harvesterhci.io/v1beta1|harvesterhci.io/v1beta1>
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:status: {}
      manager: harvester
      operation: Update
      subresource: status
      time: '2023-08-29T09:52:50Z'
  name: pcidevices-controller
  namespace: harvester-system
  resourceVersion: '162404738'
  uid: cc77baa5-5293-425e-9f37-9a79d7095c36
spec:
  chart: harvester-pcidevices-controller
  enabled: true
  repo: <https://charts.harvesterhci.io>
  valuesContent: |
    image:
      tag: v0.2.5
    fullnameOverride: harvester-pcidevices-controller
  version: 0.2.5
status:
  status: AddonDeploySuccessful

miniature-lock-53926

08/29/2023, 10:29 AM

However the problem with the "KERNEL DRIVER TO UNBIND" flag for the pcideviceclaims is still present as well as Problem with using the pci-passthrough devices when starting up my VMs. Any more Ideas? I mean whether or not the driver to unbind is just a display bug or it it connected with not being able to start VMs with exisiting pcideviceclaims I do not know. But as it stands I have to design a new method of setting up a HA minio-cluster and their best practices are strongly discouraging using SAN drives because of perfomance losses.

miniature-lock-53926

08/29/2023, 10:32 AM

But I want to stress that in my first couple of weeks. I did NOT have any problems with starting the VMs at all. That just started to happen after restarting the VMs sets regularly. I just wished it would just work again because my minio-cluster was almost ready to go :/

great-bear-19718

08/29/2023, 10:20 PM

any chance i could please get a support-bundle?

great-bear-19718

08/29/2023, 10:21 PM

the driver to unbind to is just used to try and re-bind to when pcideviceclaim is removed

miniature-lock-53926

08/30/2023, 9:18 AM

Ok I send you the support bundle and updated the github issue at https://github.com/harvester/pcidevices/issues/57 Thank you for your help so far :)

great-bear-19718

08/31/2023, 4:34 AM

Copy code

(⎈|default:N/A)➜  ~ k get pcideviceclaim haa-devops-harvester01-host03-000082000 -o yaml
apiVersion: <http://devices.harvesterhci.io/v1beta1|devices.harvesterhci.io/v1beta1>
kind: PCIDeviceClaim
metadata:
  annotations:
    <http://sim.harvesterhci.io/creationTimestamp|sim.harvesterhci.io/creationTimestamp>: "2023-08-29T10:12:35Z"
  creationTimestamp: "2023-08-29T10:12:35Z"
  finalizers:
  - <http://wrangler.cattle.io/PCIDeviceClaimOnRemove|wrangler.cattle.io/PCIDeviceClaimOnRemove>
  generation: 1
  name: haa-devops-harvester01-host03-000082000
  ownerReferences:
  - apiVersion: <http://devices.harvesterhci.io/v1beta1|devices.harvesterhci.io/v1beta1>
    kind: PCIDevice
    name: haa-devops-harvester01-host03-000082000
    uid: 327b5a68-34a0-4b88-a89c-bb205fd32e16
  resourceVersion: "589"
  uid: 040a15b0-854a-4960-bd4b-68022bbb9b4d
spec:
  address: 0000:82:00.0
  nodeName: haa-devops-harvester01-host03
  userName: admin
status:
  kernelDriverToUnbind: vfio-pci
  passthroughEnabled: true

is this the device in question?

miniature-lock-53926

08/31/2023, 6:01 AM

Yes this it seems so, because when I do not attach this specific Drive the VM will come up. But when I delete the VM again an recreate with this drive attached too, then I will get the error again

great-bear-19718

09/03/2023, 11:43 PM

yeah i see the issue

great-bear-19718

09/03/2023, 11:44 PM

i was away for a few days.. when you have a chance can you please get me the output of

kubectl get pcideviceclaim haa-devops-harvester01-host03-000082000 -o yaml

great-bear-19718

09/03/2023, 11:45 PM

when the claim is deleted it tries to rebind the device to its original driver.. which is stored as an annotation on the pcidevice object..

Copy code

apiVersion: <http://devices.harvesterhci.io/v1beta1|devices.harvesterhci.io/v1beta1>
kind: PCIDevice
metadata:
  annotations:
    <http://harvesterhci.io/pcideviceDriver|harvesterhci.io/pcideviceDriver>: nvme
    <http://sim.harvesterhci.io/creationTimestamp|sim.harvesterhci.io/creationTimestamp>: "2023-07-24T13:40:00Z"
  creationTimestamp: "2023-07-24T13:40:00Z"
  generation: 1
  labels:
    nodename: haa-devops-harvester01-host03
  name: haa-devops-harvester01-host03-000082000
  resourceVersion: "1395"
  uid: 4f9dab23-7cd8-4e3d-8932-987d1eb42786
spec: {}
status:
  address: 0000:82:00.0
  classId: "0108"
  description: 'Non-Volatile memory controller: Western Digital Ultrastar DC SN640
    NVMe SSD'
  deviceId: "2400"
  iommuGroup: "149"
  kernelDriverInUse: vfio-pci
  nodeName: haa-devops-harvester01-host03
  resourceName: <http://western.com/ULTRASTAR_DC_SN640_NVME_SSD|western.com/ULTRASTAR_DC_SN640_NVME_SSD>
  vendorId: 1b96

great-bear-19718

09/03/2023, 11:48 PM

Copy code

addresses:
  - address: 172.16.0.103
    type: InternalIP
  - address: haa-devops-harvester01-host03
    type: Hostname
allocatable:
  cpu: "128"
  <http://devices.kubevirt.io/kvm|devices.kubevirt.io/kvm>: 1k
  <http://devices.kubevirt.io/tun|devices.kubevirt.io/tun>: 1k
  <http://devices.kubevirt.io/vhost-net|devices.kubevirt.io/vhost-net>: 1k
  ephemeral-storage: "3567118004294"
  hugepages-1Gi: "0"
  hugepages-2Mi: "0"
  memory: 1056627872Ki
  pods: "110"
  <http://western.com/ULTRASTAR_DC_SN640_NVME_SSD|western.com/ULTRASTAR_DC_SN640_NVME_SSD>: "3"
capacity:
  cpu: "128"
  <http://devices.kubevirt.io/kvm|devices.kubevirt.io/kvm>: 1k
  <http://devices.kubevirt.io/tun|devices.kubevirt.io/tun>: 1k
  <http://devices.kubevirt.io/vhost-net|devices.kubevirt.io/vhost-net>: 1k
  ephemeral-storage: 3666856504Ki
  hugepages-1Gi: "0"
  hugepages-2Mi: "0"
  memory: 1056627872Ki
  pods: "110"
  <http://western.com/ULTRASTAR_DC_SN640_NVME_SSD|western.com/ULTRASTAR_DC_SN640_NVME_SSD>: "4"

i can see the status has 4 nvme ssd's of which 3 are allocatable..

great-bear-19718

09/03/2023, 11:48 PM

your VM is not in the default namespace so i cannot see the pod launcher logs

great-bear-19718

09/03/2023, 11:49 PM

any chance of getting another support bundle with the extra namespace included?

miniature-lock-53926

09/04/2023, 8:10 AM

Hi of course I will try to get you the logs. How can I extend the supportbundle?

miniature-lock-53926

09/04/2023, 8:12 AM

Copy code

apiVersion: <http://devices.harvesterhci.io/v1beta1|devices.harvesterhci.io/v1beta1>
kind: PCIDeviceClaim
metadata:
  creationTimestamp: "2023-08-29T10:12:35Z"
  finalizers:
  - <http://wrangler.cattle.io/PCIDeviceClaimOnRemove|wrangler.cattle.io/PCIDeviceClaimOnRemove>
  generation: 1
  name: haa-devops-harvester01-host03-000082000
  ownerReferences:
  - apiVersion: <http://devices.harvesterhci.io/v1beta1|devices.harvesterhci.io/v1beta1>
    kind: PCIDevice
    name: haa-devops-harvester01-host03-000082000
    uid: 327b5a68-34a0-4b88-a89c-bb205fd32e16
  resourceVersion: "162429444"
  uid: dc3fe2e6-09a3-4196-be70-10c83094bfd9
spec:
  address: 0000:82:00.0
  nodeName: haa-devops-harvester01-host03
  userName: admin
status:
  kernelDriverToUnbind: vfio-pci
  passthroughEnabled: true

miniature-lock-53926

09/04/2023, 8:18 AM

FYI I have already tried in the past to change the

<http://harvesterhci.io/pcideviceDriver|harvesterhci.io/pcideviceDriver>: nvme

annotations but that did not have any immediate consquences on the PCIDeviceClaim.

miniature-lock-53926

09/04/2023, 2:52 PM

BTW I might have found the problem while trying to setup minio-vms without any PCI-Passthroughs at all and still running into the same Errors. I think that it might be a Problem with the used VM-Template that somehow still tries to use the pci-devices it was created with because after I stopped using the vm-template and started each vm manually only using the cloud-config templates the problem went away. I will throw away all the vms again and try setting them up once more, and if that works I might have a viable workaround atlest for us. Could that have any bearing on the problem, what do you think? In any case I will still try to help getting to the bottom of the other buggy behavior with the KERNEL DRIVER TO UNBIND

great-bear-19718

09/04/2023, 11:51 PM

thanks for the new support bundle.. looks like the VM is looking for 4 disks..

Copy code

Limits:
      cpu:                                      8
      <http://devices.kubevirt.io/kvm|devices.kubevirt.io/kvm>:                  1
      <http://devices.kubevirt.io/tun|devices.kubevirt.io/tun>:                  1
      <http://devices.kubevirt.io/vhost-net|devices.kubevirt.io/vhost-net>:            1
      memory:                                   34921130Ki
      <http://western.com/ULTRASTAR_DC_SN640_NVME_SSD|western.com/ULTRASTAR_DC_SN640_NVME_SSD>:  4
    Requests:
      cpu:                                      500m
      <http://devices.kubevirt.io/kvm|devices.kubevirt.io/kvm>:                  1
      <http://devices.kubevirt.io/tun|devices.kubevirt.io/tun>:                  1
      <http://devices.kubevirt.io/vhost-net|devices.kubevirt.io/vhost-net>:            1
      ephemeral-storage:                        50M
      memory:                                   23735978Ki
      <http://western.com/ULTRASTAR_DC_SN640_NVME_SSD|western.com/ULTRASTAR_DC_SN640_NVME_SSD>:  4

while node only has 3 available

Copy code

(⎈|default:devops)➜  ~ k get node haa-devops-harvester01-host03 -o yaml | yq .status.allocatable
cpu: "128"
<http://devices.kubevirt.io/kvm|devices.kubevirt.io/kvm>: 1k
<http://devices.kubevirt.io/tun|devices.kubevirt.io/tun>: 1k
<http://devices.kubevirt.io/vhost-net|devices.kubevirt.io/vhost-net>: 1k
ephemeral-storage: "3567118004294"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 1056627872Ki
pods: "110"
<http://western.com/ULTRASTAR_DC_SN640_NVME_SSD|western.com/ULTRASTAR_DC_SN640_NVME_SSD>: "3"

miniature-lock-53926

09/05/2023, 6:22 AM

ahhh ok, how is the status.allocatable determined? I mean when I look at /sys/bus/drivers/vfio-pci all the drives are bound there?

great-bear-19718

09/05/2023, 6:23 AM

it does look at the vfio-binding to query..

great-bear-19718

09/05/2023, 6:25 AM

you could edit the pcidevices-controller ds.. and add an environment variable

DEBUG_LOGGING=true

🙌 1

great-bear-19718

09/05/2023, 6:25 AM

this should generate more information from the device plugin

great-bear-19718

09/05/2023, 6:25 AM

device plugin only checks devices are bound to vfio if i remember correctly

great-bear-19718

09/05/2023, 6:26 AM

unless something else has this device already allocatable

miniature-lock-53926

09/05/2023, 6:27 AM

that seems to be the problem, that it is at the same time bound on the os layer to the vfio-pci driver while still not being present as allocatable on the node on the harvester/kubevirt level

miniature-lock-53926

09/05/2023, 6:29 AM

I will try to activate debug logging and try to reproduce the error. yester i was able to get all 5 vm running with 4 nvme each but that was after i deactivated the pci-device addon and then setup all 5 vm without using the template. So you don't think the VM-Template had anything to do with this problem i gather?

great-bear-19718

09/05/2023, 6:30 AM

hopefully the logging can show something useful

miniature-lock-53926

09/05/2023, 7:02 AM

Now I cannot reproduce the error 😞

miniature-lock-53926

09/07/2023, 7:03 AM

I have deleted and created the VMs with passthrough now a couple of times but I cannot reproduce the problem. I am not sure how to procede. This is the behavior that I want from harvester and that was how it worked the first few weeks, but I am quite certain that someday in the future I will hit by this error again, probably during a disaster recovery scenario ¯\_(ツ)_/¯

great-bear-19718

09/08/2023, 3:37 AM

i will keep the issue open

great-bear-19718

09/08/2023, 3:37 AM

i know there is an issue in older version of pcidevice controller which we fixed in 0.2.5

great-bear-19718

09/08/2023, 3:37 AM

i wonder if the pod was not restarted because the issue we fixed addresses the reconcile in node status of devices available

miniature-lock-53926

09/08/2023, 3:05 PM

I am sure that the pci-devices were restarted, but maybe some other components like the virthandler or some things were not.

miniature-lock-53926

09/12/2023, 6:22 AM

FYI were are reinstalling the bare-metal nodes with harvester 1.2 in the hopes that the bug is gone for good. If it returns I will update the Issue

👍 1

miniature-lock-53926

09/22/2023, 6:03 AM

Just to finish up on this. After a couple of days und some more testing the issue thankfully hasn't come up again. We are going with PCI-Passthrough Drives for our MiniO-Cluster at last and if the issue should turn up again in the case of a desaster recovery scenario we will have to deal with it then. I just want to say, thank you for your help 🙂

👍 1

Open in Slack

Previous Next