This message was deleted.
# harvester
a
This message was deleted.
g
(I can SSH into the management address of the host, and from there I can ping and resolve DNS for the other nodes, plus the cluster VIP)
Harvester 1.4.1, in case that helps
I think I may have found the reason:
Copy code
level=info msg="failed to bootstrap system, will retry: generating plan: Get \"<https://10.10.125.110:443/system-agent-install.sh>\": tls: failed to verify certificate: x509: cannot validate certificate for 10.10.125.110 because it doesn't contain any IP SANs
will redeploy tomorrow, and use the FQDN on the cert for the VIP instead.
Looping back on this one: reinstalled the host and pointed to the cluster VIP FQDN, but still seeing issues joining the cluster.
Copy code
rancherd.service - Rancher Bootstrap
     Loaded: loaded (/lib/systemd/system/rancherd.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/rancherd.service.d
             └─override.conf
     Active: activating (start) since Mon 2025-02-17 09:31:38 UTC; 2min 15s ago
       Docs: <https://github.com/rancher/rancherd>
   Main PID: 2306 (rancherd)
      Tasks: 25
     CGroup: /system.slice/rancherd.service
             └─ 2306 /usr/bin/rancherd bootstrap

Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20250217-093340-applied.plan/_0"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="Running command: update-ca-certificates []"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20250217-093340-applied.plan/_1"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="Running command: /usr/bin/env [sh /var/lib/rancher/rancherd/install.sh]"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stdout]: [INFO]  CA strict verification is set to true"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stdout]: [INFO]  Using default agent configuration directory /etc/rancher/agent"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stdout]: [INFO]  Using default agent var directory /var/lib/rancher/agent"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stderr]: [WARN]  /usr/local is read-only or a mount point; installing to /opt/rancher-system-agent"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="[stderr]: [FATAL]  Aborting system-agent installation due to requested strict CA verification with no CA checksum provided"
Feb 17 09:33:40 harvester1 rancherd[2306]: time="2025-02-17T09:33:40Z" level=info msg="failed to bootstrap system, will retry: running plan: error executing instruction 1: exit status 1"
When originally deploying the cluster, I provided the VIP IP, and all was fine. After deployment, I added a LetsEncrypt cert to the cluster, so this is likely something related to that
Ok, so I thought it was this, but my certs are structured as the workaround detailed in this bug suggests already (ie my public cert is the VIP cert + the LetsEncrypt intermediate, and then the root is... the root) From the node that is trying to join the cluster, I see the exact same errors in journalctl, curl suggests that the certs are in fact fine.
Copy code
# curl --verbose <https://harvester.v-it.pro>
*   Trying x.x.x.x:443...
* Connected to harvester.v-it.pro (x.x.x.x) port 443 (#0)
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=harvester.v-it.pro
*  start date: Jan 21 20:20:48 2025 GMT
*  expire date: Apr 21 20:20:47 2025 GMT
*  subjectAltName: host "harvester.v-it.pro" matched cert's "harvester.v-it.pro"
*  issuer: C=US; O=Let's Encrypt; CN=E5
*  SSL certificate verify ok.
* using HTTP/2
* h2h3 [:method: GET]
* h2h3 [:path: /]
* h2h3 [:scheme: https]
* h2h3 [:authority: harvester.v-it.pro]
* h2h3 [user-agent: curl/8.0.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x55e8157443e0)
> GET / HTTP/2
> Host: harvester.v-it.pro
> user-agent: curl/8.0.1
> accept: */*
>
< HTTP/2 302
< date: Mon, 17 Feb 2025 20:53:37 GMT
< content-type: text/html; charset=utf-8
< content-length: 34
< cache-control: no-cache, no-store, must-revalidate
< location: /dashboard/
< x-api-cattle-auth: false
< x-content-type-options: nosniff
< strict-transport-security: max-age=31536000; includeSubDomains
<
<a href="/dashboard/">Found</a>.

* Connection #0 to host harvester.v-it.pro left intact
Any ideas on this one?