This message was deleted.
# vsphere
a
This message was deleted.
a
First i'd check the
rancher-system-agent
logs from the provisioned node
b
Hi @agreeable-oil-87482, thanks for the prompt reply. The logs should be located under
/var/log/rancher/
or somewhere else? Under the
/etc/rancher/agent/
directory I do not have any relevant files. Sorry for the silly question and thanks for your help.
a
Depends on the OS but if it's using systemd -
journalctl -u rancher-system-agent
b
Okay, the rancher-system-agent does not seem to exist. Thus no logs. Is there something else I could check? This is the current state of my cluster. Basically it stuck there with no further changes. I am definitely missing something here.
a
Has the hostname changed for the node?
b
Instead, the hostname was updated and I have an IP address assigned from the DHCP server. I am also able to execute cloud_config code blocks (defined in the Rancher UI under cloud_config).
I also tested the websocket connection to the Rancher instance. It works fine.
a
If you look in the user-data file from cloud init it will reference writing and running a script. Can you try running that manually?
b
Okay, so I believe the execution of the script does not work due to a certificate issue
2023-08-03 11:50:19,894 - subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=False, capture=False)
[INFO] --no-roles flag passed, unsetting all other requested roles
[INFO] Using default agent configuration directory /etc/rancher/agent
[INFO] Using default agent var directory /var/lib/rancher/agent
[INFO] Determined CA is necessary to connect to Rancher
[INFO] Successfully downloaded CA certificate
[INFO] Value from <https://x.x.x.x/cacerts> is an x509 certificate
[ERROR] Configured cacerts checksum (xxx) does not match given --ca-checksum (xxx)
[ERROR] Please check if the correct certificate is configured at<https://x.x.x.x/cacerts>
2023-08-03 11:50:20,010 - subp.py[DEBUG]: Unexpected error while running command.
Okay, at least I have a pointer for further tshoot steps. The Rancher instance uses a self-sign cert atm.
a
Run something like
openssl s_client -showcerts -connect <http://YourRancherURL.com:443|YourRancherURL.com:443>
from a node other than what Rancher is running on and see which cert is being presented
b
Hey @agreeable-oil-87482, thanks a lot for the hint. For anyone who is having similar issues, for Ubuntu 22.04 the openssl command is
openssl s_client -showcerts -connect IP ADDR:PORT
. I was able to re-run the user-data file from the cloud-init manually and the
rancher-system-agent
is there. Below is the output from the journalctl command
Copy code
Aug 03 15:32:16 rke2-self-sign-pool1-3aca6d04-4vx4m systemd[1]: Started Rancher System Agent.
░░ Subject: A start job for unit rancher-system-agent.service has finished successfully
░░ Defined-By: systemd
░░ Support: <http://www.ubuntu.com/support>
░░ 
░░ A start job for unit rancher-system-agent.service has finished successfully.
░░ 
░░ The job identifier is 1465.
Aug 03 15:32:16 rke2-self-sign-pool1-3aca6d04-4vx4m rancher-system-agent[2611]: time="2023-08-03T15:32:16Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Aug 03 15:32:16 rke2-self-sign-pool1-3aca6d04-4vx4m rancher-system-agent[2611]: time="2023-08-03T15:32:16Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Aug 03 15:32:16 rke2-self-sign-pool1-3aca6d04-4vx4m rancher-system-agent[2611]: time="2023-08-03T15:32:16Z" level=info msg="Starting remote watch of plans"
Aug 03 15:32:16 rke2-self-sign-pool1-3aca6d04-4vx4m rancher-system-agent[2611]: E0803 15:32:16.783848    2611 memcache.go:206] couldn't get resource list for <http://management.cattle.io/v3|management.cattle.io/v3>:
Aug 03 15:32:16 rke2-self-sign-pool1-3aca6d04-4vx4m rancher-system-agent[2611]: time="2023-08-03T15:32:16Z" level=info msg="Starting /v1, Kind=Secret controller"
Okay, so the rancher-agent is running and I do see the below log messages.
Copy code
# systemctl status rancher-system-agent
● rancher-system-agent.service - Rancher System Agent
     Loaded: loaded (/etc/systemd/system/rancher-system-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-08-04 08:46:00 UTC; 14min ago
       Docs: <https://www.rancher.com>
   Main PID: 2632 (rancher-system-)
      Tasks: 11 (limit: 4556)
     Memory: 101.7M
        CPU: 3.425s
     CGroup: /system.slice/rancher-system-agent.service
             └─2632 /usr/local/bin/rancher-system-agent sentinel

Aug 04 09:00:11 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:11Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manag>
Aug 04 09:00:11 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:11Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
Aug 04 09:00:16 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:16Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manag>
Aug 04 09:00:16 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:16Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
Aug 04 09:00:16 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:16Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/ranch>
Aug 04 09:00:16 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:16Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
Aug 04 09:00:21 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:21Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/ranch>
Aug 04 09:00:21 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:21Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
Aug 04 09:00:21 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:21Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manag>
Aug 04 09:00:21 rke2-canal-self-signed-pool1-8cc19983-dqmbb rancher-system-agent[2632]: time="2023-08-04T09:00:21Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
The kube-scheduler and kube-controller-manager directories do not exist indeed.
a
has the
rke2-server
service started?
b
Copy code
systemctl status rke2-server
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
     Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; enabled; vendor preset: enabled)
     Active: activating (start) since Fri 2023-08-04 09:01:10 UTC; 2min 44s ago
       Docs: <https://github.com/rancher/rke2#readme>
    Process: 7520 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
    Process: 7522 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 7523 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 7524 (rke2)
      Tasks: 59
     Memory: 1.7G
        CPU: 2min 46.683s
     CGroup: /system.slice/rke2-server.service
             ├─3450 /var/lib/rancher/rke2/data/v1.26.7-rke2r1-9873cf6e613f/bin/containerd-shim-runc-v2 -namespace <http://k8s.io|k8s.io> -id e05d0df1d14bddb17fec8c3eef22187a6302fbfb752f4316ec6f9b4c0d302856 -address /run/k3s/containerd/containerd.sock
             ├─3539 /var/lib/rancher/rke2/data/v1.26.7-rke2r1-9873cf6e613f/bin/containerd-shim-runc-v2 -namespace <http://k8s.io|k8s.io> -id 6851153a309b7049795e2300a17d9a88ecfb08c10df85c5dcb3033bd82fd53d1 -address /run/k3s/containerd/containerd.sock
             ├─7524 "/usr/local/bin/rke2 server"
             ├─7536 containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd
             └─8246 kubelet --volume-plugin-dir=/var/lib/kubelet/volumeplugins --file-check-frequency=5s --sync-frequency=30s --cloud-provider=external --cloud-config= --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Web>

Aug 04 09:03:19 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:19Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
Aug 04 09:03:24 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:24Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
Aug 04 09:03:29 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:29Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
Aug 04 09:03:34 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:34Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
Aug 04 09:03:39 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:39Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
Aug 04 09:03:44 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:44Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
Aug 04 09:03:44 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:44Z" level=error msg="Kubelet exited: exit status 255"
Aug 04 09:03:46 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:46Z" level=info msg="Waiting for API server to become available"
Aug 04 09:03:49 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:49Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
Aug 04 09:03:54 rke2-canal-self-signed-pool1-8cc19983-dqmbb rke2[7524]: time="2023-08-04T09:03:54Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
a
and you post the output of
journalctl -u rke2-server
please
b
I cannot paste the output of the file as code becuase is quite log. I uploaded the file instead.
a
Are you using a private registry?
b
You mean a container private registry?
a
Yes
b
No, we do not use a private registry.
a
Does
/etc/rancher/rke2/registries.yaml"
exist on the node?
b
yes, and the output is the following
{"configs":{},"mirrors":null}
I am checking the directory
/var/lib/rancher/rke2/agent/images
and looking at the string we try to pull from docker hub.
Btw, do I need to setup a private registry on the node? Additionally, I thought I did not need Docker on the nodes for RKE2 to work. Is this valid?
a
No, its optional
b
Any other ideas how to continue further with the tshoot?
@agreeable-oil-87482 Thanks a lot for your help and hints for further tshoot. My issues are resolved. The Loopback DNS resolution was weird on my VM. Have a great day! :)
415 Views