This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

02/15/2025, 5:49 PM

This message was deleted.

gray-room-77418

02/15/2025, 5:50 PM

(I can SSH into the management address of the host, and from there I can ping and resolve DNS for the other nodes, plus the cluster VIP)

gray-room-77418

02/15/2025, 5:50 PM

Harvester 1.4.1, in case that helps

gray-room-77418

02/16/2025, 8:10 PM

I think I may have found the reason:

Copy code

level=info msg="failed to bootstrap system, will retry: generating plan: Get \"<https://10.10.125.110:443/system-agent-install.sh>\": tls: failed to verify certificate: x509: cannot validate certificate for 10.10.125.110 because it doesn't contain any IP SANs

gray-room-77418

02/16/2025, 8:12 PM

will redeploy tomorrow, and use the FQDN on the cert for the VIP instead.

gray-room-77418

02/17/2025, 9:35 AM

Looping back on this one: reinstalled the host and pointed to the cluster VIP FQDN, but still seeing issues joining the cluster.

gray-room-77418

02/17/2025, 9:35 AM

Copy code

rancherd.service - Rancher Bootstrap
     Loaded: loaded (/lib/systemd/system/rancherd.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/rancherd.service.d
             └─override.conf
     Active: activating (start) since Mon 2025-02-17 09:31:38 UTC; 2min 15s ago
       Docs: <https://github.com/rancher/rancherd>
   Main PID: 2306 (rancherd)
      Tasks: 25
     CGroup: /system.slice/rancherd.service
             └─ 2306 /usr/bin/rancherd bootstrap

Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20250217-093340-applied.plan/_0"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="Running command: update-ca-certificates []"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20250217-093340-applied.plan/_1"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="Running command: /usr/bin/env [sh /var/lib/rancher/rancherd/install.sh]"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stdout]: [INFO]  CA strict verification is set to true"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stdout]: [INFO]  Using default agent configuration directory /etc/rancher/agent"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stdout]: [INFO]  Using default agent var directory /var/lib/rancher/agent"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stderr]: [WARN]  /usr/local is read-only or a mount point; installing to /opt/rancher-system-agent"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stderr]: [FATAL]  Aborting system-agent installation due to requested strict CA verification with no CA checksum provided"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="failed to bootstrap system, will retry: running plan: error executing instruction 1: exit status 1"

gray-room-77418

02/17/2025, 9:36 AM

When originally deploying the cluster, I provided the VIP IP, and all was fine. After deployment, I added a LetsEncrypt cert to the cluster, so this is likely something related to that

gray-room-77418

02/17/2025, 5:54 PM

Looks like this is the cause https://github.com/harvester/harvester/issues/2199

gray-room-77418

02/17/2025, 8:59 PM

Ok, so I thought it was this, but my certs are structured as the workaround detailed in this bug suggests already (ie my public cert is the VIP cert + the LetsEncrypt intermediate, and then the root is... the root) From the node that is trying to join the cluster, I see the exact same errors in journalctl, curl suggests that the certs are in fact fine.

Copy code

# curl --verbose <https://harvester.v-it.pro>
*   Trying x.x.x.x:443...
* Connected to harvester.v-it.pro (x.x.x.x) port 443 (#0)
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=harvester.v-it.pro
*  start date: Jan 21 20:20:48 2025 GMT
*  expire date: Apr 21 20:20:47 2025 GMT
*  subjectAltName: host "harvester.v-it.pro" matched cert's "harvester.v-it.pro"
*  issuer: C=US; O=Let's Encrypt; CN=E5
*  SSL certificate verify ok.
* using HTTP/2
* h2h3 [:method: GET]
* h2h3 [:path: /]
* h2h3 [:scheme: https]
* h2h3 [:authority: harvester.v-it.pro]
* h2h3 [user-agent: curl/8.0.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x55e8157443e0)
> GET / HTTP/2
> Host: harvester.v-it.pro
> user-agent: curl/8.0.1
> accept: */*
>
< HTTP/2 302
< date: Mon, 17 Feb 2025 20:53:37 GMT
< content-type: text/html; charset=utf-8
< content-length: 34
< cache-control: no-cache, no-store, must-revalidate
< location: /dashboard/
< x-api-cattle-auth: false
< x-content-type-options: nosniff
< strict-transport-security: max-age=31536000; includeSubDomains
<
<a href="/dashboard/">Found</a>.

* Connection #0 to host harvester.v-it.pro left intact

Any ideas on this one?

110 Views

Open in Slack

Previous Next