This message was deleted.
# harvester
a
This message was deleted.
a
...and, all node and VIP addresses were statically assigned, so this is not a DHCP-related issue (forgot to mention that). The failed node can
curl -k https://<harvester-VIP>
and get the "dashboard" response, so network path is not an issue (all nodes and VIP on the same Layer-2 VLAN).
r
Do all three nodes have the same issue?
a
Not at the moment. Only the node that was rebooted is currently in this state.
r
could you help check whether rke2-server is active or not?
Copy code
systemctl status rke2-service.service
a
harvester-test-02:~ #
systemctl status rke2-service.service
Unit rke2-service.service could not be found.
r
sorry, my bad. it should be
rke2-server.service
a
No worries. Piped the command to
less
Here is the output:
● rke2-server.service - Rancher Kubernetes Engine v2 (server) Loaded: loaded (/etc/systemd/system/rke2-server.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/rke2-server.service.d └─override.conf Active: activating (start) since Thu 2024-07-18 153202 UTC; 13min ago Docs: https://github.com/rancher/rke2#readme Process: 11659 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS) Process: 11661 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Process: 11662 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Process: 11663 ExecStartPre=/usr/sbin/harv-update-rke2-server-url server (code=exited, status=0/SUCCESS) Main PID: 11665 (rke2) Tasks: 22 CGroup: /system.slice/rke2-server.service └─ 11665 "/opt/rke2/bin/rke2 server" Jul 18 154457 harvester-test-02 rke2[11665]: time="2024-07-18T154457Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: failed to write cert /var/lib/rancher/rke2/agent/client-kube-proxy.crt: open /var/lib/rancher/rke2/agent/client-kube-proxy.crt: is a directory" Jul 18 154503 harvester-test-02 rke2[11665]: time="2024-07-18T154503Z" level=info msg="Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory\"" Jul 18 154503 harvester-test-02 rke2[11665]: time="2024-07-18T154503Z" level=info msg="Waiting for etcd server to become available" Jul 18 154503 harvester-test-02 rke2[11665]: time="2024-07-18T154503Z" level=info msg="Waiting for API server to become available" Jul 18 154504 harvester-test-02 rke2[11665]: time="2024-07-18T154504Z" level=info msg="certificate CN=harvester-test-02 signed by CN=rke2-server-ca@1719517289: notBefore=2024-06-27 194129 +0000 UTC notAfter=2025-07-18 154504 +0000 UTC" Jul 18 154505 harvester-test-02 rke2[11665]: time="2024-07-18T154505Z" level=info msg="certificate CN=systemnodeharvester-test-02,O=system:nodes signed by CN=rke2-client-ca@1719517289: notBefore=2024-06-27 194129 +0000 UTC notAfter=2025-07-18 154505 +0000 UTC" Jul 18 154505 harvester-test-02 rke2[11665]: time="2024-07-18T154505Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: failed to write cert /var/lib/rancher/rke2/agent/client-kube-proxy.crt: open /var/lib/rancher/rke2/agent/client-kube-proxy.crt: is a directory" Jul 18 154511 harvester-test-02 rke2[11665]: time="2024-07-18T154511Z" level=info msg="certificate CN=harvester-test-02 signed by CN=rke2-server-ca@1719517289: notBefore=2024-06-27 194129 +0000 UTC notAfter=2025-07-18 154511 +0000 UTC" Jul 18 154512 harvester-test-02 rke2[11665]: time="2024-07-18T154512Z" level=info msg="certificate CN=systemnodeharvester-test-02,O=system:nodes signed by CN=rke2-client-ca@1719517289: notBefore=2024-06-27 194129 +0000 UTC notAfter=2025-07-18 154512 +0000 UTC" Jul 18 154512 harvester-test-02 rke2[11665]: time="2024-07-18T154512Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: failed to write cert /var/lib/rancher/rke2/agent/client-kube-proxy.crt: open /var/lib/rancher/rke2/agent/client-kube-proxy.crt: is a directory"
r
could you check if the file is actually there?
Copy code
ls -l /var/lib/rancher/rke2/agent/
a
Appears to be. harvester-test-02:~ #
ls -l /var/lib/rancher/rke2/agent/
Copy code
total 64
-rw-------  1 root root  570 Jul 18 15:58 client-ca.crt
drwxr-xr-x  2 root root 4096 Jul 10 17:20 client-kube-proxy.crt
drwxr-xr-x  2 root root 4096 Jul 10 17:20 client-kube-proxy.key
-rw-------  1 root root 1193 Jul 18 15:58 client-kubelet.crt
-rw-------  1 root root  227 Jul 18 15:58 client-kubelet.key
drwx------ 17 root root 4096 Jul 16 00:11 containerd
drwx------  3 root root 4096 Jun 27 20:26 etc
drwxr-xr-x  2 root root 4096 Jun 27 20:37 images
-rw-------  1 root root  464 Jul 18 15:58 kubelet.kubeconfig
-rw-------  1 root root  470 Jul 10 16:20 kubeproxy.kubeconfig
drwxr-xr-x  2 root root 4096 Jul 16 22:04 logs
drwx------  2 root root 4096 Jul 17 19:23 pod-manifests
-rw-------  1 root root  480 Jul 10 16:20 rke2controller.kubeconfig
-rw-------  1 root root  574 Jul 18 15:58 server-ca.crt
-rw-------  1 root root 1226 Jul 18 15:58 serving-kubelet.crt
-rw-------  1 root root  227 Jul 18 15:58 serving-kubelet.key
Wonder how the client-kube-proxy files got changed to directories. My other two (working) nodes have this (for comparison):
Copy code
harvester-test-03:~ # ls /var/lib/rancher/rke2/agent/ -l
total 72
-rw-------  1 root root  570 Jul 10 17:18 client-ca.crt
-rw-------  1 root root 1149 Jul 10 17:18 client-kube-proxy.crt
-rw-------  1 root root  227 Jul 10 17:18 client-kube-proxy.key
-rw-------  1 root root 1197 Jul 10 17:18 client-kubelet.crt
-rw-------  1 root root  227 Jul 10 17:18 client-kubelet.key
-rw-------  1 root root 1157 Jul 10 17:18 client-rke2-controller.crt
-rw-------  1 root root  227 Jul 10 17:18 client-rke2-controller.key
drwx------ 17 root root 4096 Jul 10 17:18 containerd
drwx------  3 root root 4096 Jun 27 20:36 etc
drwxr-xr-x  2 root root 4096 Jun 27 20:39 images
-rw-------  1 root root  464 Jul 10 17:18 kubelet.kubeconfig
-rw-------  1 root root  470 Jul 10 17:18 kubeproxy.kubeconfig
drwxr-xr-x  2 root root 4096 Jun 27 20:36 logs
drwx------  2 root root 4096 Jul 10 17:18 pod-manifests
-rw-------  1 root root  480 Jul 10 17:18 rke2controller.kubeconfig
-rw-------  1 root root  574 Jul 10 17:18 server-ca.crt
-rw-------  1 root root 1222 Jul 10 17:18 serving-kubelet.crt
-rw-------  1 root root  227 Jul 10 17:18 serving-kubelet.key
r
Could you also check the rancher-system-agent status?
Copy code
systemctl status rancher-system-agent.service
a
harvester-test-02:~ #
systemctl status rancher-system-agent.service | more
Copy code
● rancher-system-agent.service - Rancher System Agent
     Loaded: loaded (/etc/systemd/system/rancher-system-agent.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/rancher-system-agent.service.d
             └─env.conf
     Active: activating (auto-restart) (Result: exit-code) since Thu 2024-07-18 17:07:49 UTC; 906ms ago
       Docs: <https://www.rancher.com>
    Process: 32529 ExecStart=/opt/rancher-system-agent/bin/rancher-system-agent sentinel (code=exited, stat
us=1/FAILURE)
   Main PID: 32529 (code=exited, status=1/FAILURE)
r
Is there any log for it?
a
Do you mean
/var/log/console.log
(system), RKE2 log, or Rancher system agent log? I may need some command direction if it is the RKE2 or agent logs that I need to pull. And, huge thanks for helping with this.
r
Oh I mean rancher-system-agent
journalctl -u rancher-system-agent -f
a
harvester-test-02:~ #
journalctl -u rancher-system-agent -f
Copy code
Jul 19 13:51:08 harvester-test-02 rancher-system-agent[306]: time="2024-07-19T13:51:08Z" level=info msg="Starting remote watch of plans"
Jul 19 13:51:11 harvester-test-02 rancher-system-agent[306]: time="2024-07-19T13:51:11Z" level=fatal msg="error while connecting to Kubernetes cluster: Get \"<https://10.53.232.72/version>\": dial tcp 10.53.232.72:443: connect: no route to host"
Jul 19 13:51:11 harvester-test-02 systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
Jul 19 13:51:11 harvester-test-02 systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
Jul 19 13:51:16 harvester-test-02 systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 15231.
Jul 19 13:51:16 harvester-test-02 systemd[1]: Stopped Rancher System Agent.
Jul 19 13:51:16 harvester-test-02 systemd[1]: Started Rancher System Agent.
Jul 19 13:51:16 harvester-test-02 rancher-system-agent[328]: time="2024-07-19T13:51:16Z" level=info msg="Rancher System Agent version v0.3.6 (41c07d0) is starting"
Jul 19 13:51:16 harvester-test-02 rancher-system-agent[328]: time="2024-07-19T13:51:16Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Jul 19 13:51:16 harvester-test-02 rancher-system-agent[328]: time="2024-07-19T13:51:16Z" level=info msg="Starting remote watch of plans"
r
seems to point to the same cause. could you help confirm that the two ought-to-be-files directories contain no other files under them?
a
Both appear to be directories with no files beneath them:
Copy code
harvester-test-02:/var/lib/rancher/rke2/agent/client-kube-proxy.crt # ls -al
total 8
drwxr-xr-x 2 root root 4096 Jul 10 17:20 .
drwxr-xr-x 9 root root 4096 Jul 17 19:23 ..
harvester-test-02:/var/lib/rancher/rke2/agent/client-kube-proxy.crt # cd ..
harvester-test-02:/var/lib/rancher/rke2/agent # cd client-kube-proxy.key/
harvester-test-02:/var/lib/rancher/rke2/agent/client-kube-proxy.key # ls -al
total 8
drwxr-xr-x 2 root root 4096 Jul 10 17:20 .
drwxr-xr-x 9 root root 4096 Jul 17 19:23 ..
harvester-test-02:/var/lib/rancher/rke2/agent/client-kube-proxy.key #
r
I think the easy fix is to remove those two directories. However, the root cause is still unknown. If you have time, would you help us file a new GitHub issue for this? A support bundle would be great. Thank you!
a
Thanks! Question: If I remove the two directories, will the agent rebuild/import the necessary .key and .crt files? Or do I also need to plan on copying the files from a running server? As these servers are for lab testing, I have no issues with leaving the server in the current state and filing a GitHub issue (and attaching a support bundle). Happy to help!
🙌 1
r
If I remove the two directories, will the agent rebuild/import the necessary .key and .crt files? Or do I also need to plan on copying the files from a running server?
rke2-server will write data to the path where the cert and key are supposed to be (according to the logs). We’re just helping by removing the obstacle.
👍 1
a
Issue generated: https://github.com/harvester/harvester/issues/6211 Generating the support bundle now and will attach it to the issue. I have not deleted the associated .crt and .key directories to see if this recovers the node. We can take time to poke around the node some more if we need to do so.
🙌 1