adamant-kite-43734
05/10/2024, 9:25 AMcreamy-pencil-82913
05/10/2024, 2:25 PMpowerful-librarian-10572
05/12/2024, 8:57 AMpowerful-librarian-10572
05/12/2024, 8:57 AMfew-coat-36487
05/13/2024, 7:53 AMfew-coat-36487
05/13/2024, 8:18 AM/usr/local/bin/rke2-uninstall.sh
.
# here are some errors
rm: cannot remove '/etc/rancher': Directory not empty
+ true
+ rm -rf /etc/cni
+ rm -rf /opt/cni/bin
+ rm -rf /var/lib/kubelet
+ rm -rf /var/lib/rancher/rke2
+ rm -d /var/lib/rancher
rm: cannot remove '/var/lib/rancher': Directory not empty
So I just removed them manually
rm -rf /var/lib/rancher
rm -rf /etc/rancher
So I get a new error when I recreate a cluster an append the node.
Rancher logs
[INFO ] configuring bootstrap node(s) custom-b02f5fcaccb8: error applying plan -- check rancher-system-agent.service logs on node for more information, waiting for agent to check in and apply initial plan
node logs rancher-system-agent
ini
# at the beginningMay 13 10:07:51 server1 rancher-system-agent[3102148]: time="2024-05-13T10:07:51+02:00" level=info msg="Rancher System Agent version v0.3.4 (63eb11a) is starting"
May 13 10:07:51 server1 rancher-system-agent[3102148]: time="2024-05-13T10:07:51+02:00" level=fatal msg="Fatal error running: unable to parse config file: error gathering file information for file /etc/rancher/agent/config.yaml: stat /etc/rancher/agen>
May 13 10:07:51 server1 systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
May 13 10:07:51 server1 systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
May 13 10:07:56 server1 systemd[1]: Stopped Rancher System Agent.
May 13 10:07:56 server1 systemd[1]: Started Rancher System Agent.
# loop 3-5 time
# after a while
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Rancher System Agent version v0.3.4 (63eb11a) is starting"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Starting remote watch of plans"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Starting /v1, Kind=Secret controller"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Detected first start, force-applying one-time instruction set"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] Applying one-time instructions for plan with checksum 9b3159574e2c6bf80fca35e674677b2fbb576aee9a75f150e0a82af088b3e477"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] Extracting image 192.168.137.50/nexus2/rancher/system-agent-installer-rke2:v1.27.8-rke2r1 to directory /var/lib/rancher/agent/work/20>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Pulling image 192.168.137.50/nexus2/rancher/system-agent-installer-rke2:v1.27.8-rke2r1"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=warning msg="Ignoring relative endpoint URL for registry 192.168.137.50: \"192.168.137.50/nexus2\""
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=warning msg="Failed to get image from endpoint: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 because it doesn't c>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=error msg="error while staging: all endpoints failed: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 because it doe>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=error msg="error executing instruction 0: all endpoints failed: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 beca>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240513-100756/9b3159574e2c6bf80fca35e674677b2fbb576>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 127"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: op>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading x509 client cert/key for probe kube-apiserver (/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt//var/lib/rancher/rke2/serv>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/se>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-apiserver) /var/lib/rancher/rke2/server/tls/server-ca.crt: open /var/lib/rancher/rke2/server/tls/server-ca.c>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-apiserver"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=info msg="[K8s] updated plan secret fleet-default/custom-b02f5fcaccb8-machine-plan with feedback"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading x509 client cert/key for probe kube-apiserver (/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt//var/lib/rancher/rke2/serv>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/se>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-apiserver) /var/lib/rancher/rke2/server/tls/server-ca.crt: open /var/lib/rancher/rke2/server/tls/server-ca.c>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-apiserver"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: op>
# then loop idefinitly
So is it normal ?
- May 13 10:07:51 server1 systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
- the pulling seems to block
- CA error ???powerful-librarian-10572
05/13/2024, 8:18 AMpowerful-librarian-10572
05/13/2024, 9:15 AMMay 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=warning msg="Failed to get image from endpoint: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 because it doesn't c>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=error msg="error while staging: all endpoints failed: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 because it doe>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=error msg="error executing instruction 0: all endpoints failed: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 beca>
powerful-librarian-10572
05/13/2024, 9:15 AMpowerful-librarian-10572
05/13/2024, 10:01 AMerror gathering file information for file /etc/rancher/agent/config.yaml
powerful-librarian-10572
05/13/2024, 10:01 AMfew-coat-36487
05/13/2024, 10:11 AM[root@server1 ~]# /usr/local/bin/rancher-system-agent-uninstall.sh
Removed /etc/systemd/system/multi-user.target.wants/rancher-system-agent.service.
Failed to reset failed state of unit rancher-system-agent.service: Unit rancher-system-agent.service not loaded.
[root@server1 ~]# systemctl stop rancher-system-agent.service
Failed to stop rancher-system-agent.service: Unit rancher-system-agent.service not loaded.
[root@server1 ~]# systemctl disable rancher-system-agent.service
Failed to disable unit: Unit file rancher-system-agent.service does not exist.
[root@server1 ~]#
[root@server1 ~]# systemctl start rancher-system-agent.service
Failed to start rancher-system-agent.service: Unit rancher-system-agent.service not found.
le journalctl -u rancher-system-agent
May 13 12:05:36 localhost.localdomain systemd[1]: rancher-system-agent.service: Service RestartSec=5s expired, scheduling restart.
May 13 12:05:36 localhost.localdomain systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 1819.
May 13 12:05:36 localhost.localdomain systemd[1]: Stopped Rancher System Agent.
May 13 12:05:36 localhost.localdomain systemd[1]: Started Rancher System Agent.
May 13 12:05:36 localhost.localdomain rancher-system-agent[46032]: time="2024-05-13T12:05:36+02:00" level=info msg="Rancher System Agent version v0.3.4 (63eb11a) is starting"
May 13 12:05:36 localhost.localdomain rancher-system-agent[46032]: time="2024-05-13T12:05:36+02:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
May 13 12:05:36 localhost.localdomain rancher-system-agent[46032]: time="2024-05-13T12:05:36+02:00" level=info msg="Starting remote watch of plans"
May 13 12:05:36 localhost.localdomain rancher-system-agent[46032]: time="2024-05-13T12:05:36+02:00" level=fatal msg="error while connecting to Kubernetes cluster: the server has asked for the client to provide credentials"
May 13 12:05:36 localhost.localdomain systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
May 13 12:05:36 localhost.localdomain systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
May 13 12:05:38 localhost.localdomain systemd[1]: Stopped Rancher System Agent.
powerful-librarian-10572
05/13/2024, 11:47 AMfew-coat-36487
05/13/2024, 2:29 PMfew-coat-36487
05/13/2024, 2:31 PMpowerful-librarian-10572
05/13/2024, 2:36 PMfew-coat-36487
05/13/2024, 2:37 PMpowerful-librarian-10572
05/13/2024, 2:39 PMfew-coat-36487
05/13/2024, 2:39 PM[INFO ] non-ready bootstrap machine(s) custom-4cedb4f8e960 and join url to be available on bootstrap node
powerful-librarian-10572
05/13/2024, 2:40 PMfew-coat-36487
05/13/2024, 2:40 PMpowerful-librarian-10572
05/13/2024, 2:40 PMpowerful-librarian-10572
05/13/2024, 2:41 PMfew-coat-36487
05/13/2024, 2:41 PMfew-coat-36487
05/13/2024, 2:42 PMpowerful-librarian-10572
05/13/2024, 2:42 PMlevel=error msg="error while appending ca cert to pool for probe kube-controller-manager"
is a non-error, it just happens because a component is started before another one (who loads the certs) doespowerful-librarian-10572
05/13/2024, 2:43 PMpowerful-librarian-10572
05/13/2024, 2:43 PMfew-coat-36487
05/13/2024, 2:44 PMfew-coat-36487
05/14/2024, 7:58 AMWaiting for Node
. The logs are the following:
- rancher-system-agent
It's loop every 10 minutes
ini
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240514-0946>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: Name >
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: etcd-snapshot-server1-17156052>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: etcd-snapshot-server1-17156232>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: etcd-snapshot-server1-17156376>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: etcd-snapshot-server1-17156556>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> >
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[K8s] updated plan secret fleet-default/custom-4cedb4f8e960-machine-plan with feedback"
- rke2-server
ini
May 14 00:00:03 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:03.838427+0200","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snaps>
May 14 00:00:03 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:03.845245+0200","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
May 14 00:00:03 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:03.845309+0200","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"<https://127.0.0.1:2379>"}
May 14 00:00:04 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:04.01648+0200","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
May 14 00:00:04 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:04.023563+0200","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"<https://127.0.0.1:2379>","size":"16 MB","to>
May 14 00:00:04 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:04.023635+0200","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-s>
May 14 00:00:04 server1 rke2[8661]: time="2024-05-14T00:00:04+02:00" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/etcd-snapshot-server1-1715637604"
May 14 00:00:04 server1 rke2[8661]: time="2024-05-14T00:00:04+02:00" level=info msg="Applying snapshot retention=5 to local snapshots with prefix etcd-snapshot in /var/lib/rancher/rke2/server/db/snapshots"
May 14 00:00:04 server1 rke2[8661]: time="2024-05-14T00:00:04+02:00" level=info msg="Reconciling ETCDSnapshotFile resources"
May 14 00:00:04 server1 rke2[8661]: time="2024-05-14T00:00:04+02:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
May 14 00:53:13 server1 rke2[8661]: time="2024-05-14T00:53:13+02:00" level=info msg="Updating TLS secret for kube-system/rke2-serving (count: 10): map[listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle>
May 14 01:09:47 server1 rke2[8661]: time="2024-05-14T01:09:47+02:00" level=info msg="Reconciling snapshot ConfigMap data"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="wake, now=2024-05-14T05:00:00+02:00"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="run, now=2024-05-14T05:00:00+02:00, entry=1, next=2024-05-14T10:00:00+02:00"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-server1-1715655600"
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.198445+0200","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snaps>
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.205493+0200","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.205617+0200","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"<https://127.0.0.1:2379>"}
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.341891+0200","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.350396+0200","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"<https://127.0.0.1:2379>","size":"16 MB","to>
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.350619+0200","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-s>
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/etcd-snapshot-server1-1715655600"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Applying snapshot retention=5 to local snapshots with prefix etcd-snapshot in /var/lib/rancher/rke2/server/db/snapshots"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Reconciling ETCDSnapshotFile resources"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
few-coat-36487
05/14/2024, 8:00 AMfew-coat-36487
05/14/2024, 8:56 AMfew-coat-36487
05/14/2024, 2:12 PMpowerful-librarian-10572
05/14/2024, 2:43 PM