This message was deleted.
# general
a
This message was deleted.
a
• Rancher v2.8.2 (5-node VMs on VMWare, and has been running for close to a year) • Harvester v1.3.1 (two different clusters less than a week old on dedicated, physical hardware) • Rancher is front-ended by NGINX, which provides health checks and SSL off-load, and also hosts the TLS public certificate. Rancher can deploy clusters to downstream OpenStack and VMWare environments with no issues. However, when attempting to deploy Kubernetes clusters to recently imported Harvester clusters, no VMs are created, and the associated
fleet-default
pods error out with:
Copy code
Downloading driver from https://<public_rancher_URL>/assets/docker-machine-driver-harvester
Doing /etc/rancher/ssl
ls: cannot access 'docker-machine-driver-*': No such file or directory
downloaded file  failed sha256 checksum
download of driver from https://<public_rancher_URL>/assets/docker-machine-driver-harvester failed
I have verified that the
docker-machine-driver-harvester
file is present both by visiting the /assets API URL and by running
ls
commands against the Rancher pods. I can also
curl
the file from other VMs in Harvester, so I do not believe there is a network path issue between Harvester and Rancher. What I suspect is occurring, but am unsure how to prove or troubleshoot, is that Fleet spins up a machine pod in
fleet-default
that attempts to pull down the machine driver using the public URL of the Rancher cluster. This request goes through the NGINX load balancer (which is fine) and reaches a Rancher pod. However, I suspect that the Rancher pod then tries to return a response DIRECTLY to the Fleet machine pod, rather than through the NGINX load balancer (which results in an asymetric path, which would cause the download of the Harvester node driver from the Rancher pod to the Fleet machine pod to fail). Again, just suspicion, but am trying to find a way to: 1. how to prove that the asymmetric path is the root cause of my issue (Fleet machine pod unable to download Harvester machine driver from Rancher pod). 2. how to resolve the issue. I.e. is there a way to trick Fleet into realizing that the Rancher pod that hosts the Harvester driver is in the same Rancher management cluster and use a clusterIP (or similar) to avoid reaching out to the external NGINX load balancer? Again, hoping someone has ran into this before and can provide some pointers. Thanks!
g
@acoustic-addition-45641 I'm running into something similar when trying to create a RKE2 cluster using a custom node driver and running Rancher locally in Docker desktop:
Copy code
failureMessage: |-
      Failure detected from referenced resource <http://rke-machine.cattle.io/v1|rke-machine.cattle.io/v1>, Kind=TritonMachine with name "chad-test-1-pool1-8ec441f2-6bdp9": Downloading driver from <http://localhost/assets/docker-machine-driver-triton>
      Doing /etc/rancher/ssl
      ls: cannot access 'docker-machine-driver-*': No such file or directory
      downloaded file  failed sha256 checksum
      download of driver from <http://localhost/assets/docker-machine-driver-triton> failed
    failureReason: CreateError
Seems like it's related to the SSL cert (https://github.com/rancher/machine/blob/9183b3ff738e16ece4391a2e6bcc8ef88889e8ae/package/download_driver.sh#L15). Did you ever figure this out?
a
Unfortunately I have not. I plan to spin up a test Rancher manager instance that handles HTTPS in the cluster rather then using an external load balancer. This aligns more closely with the supported architecture. I just need to make time to do it.
g
Sounds good. I tested on a separate rancher instance that has a valid SSL cert and I did not encounter the error so for me it is specific to the self signed cert rancher generates when running rancher locally. I tried to find a way to make
download_driver.sh
use
-k
but don't know how to patch the
rancher/machine
image with containerd. I may just spoof the domain locally with
/etc/hosts
for development and use the SSL cert from our other rancher instance.
👍 1