This message was deleted.
# rke2
a
This message was deleted.
c
This is a rancher provisioning issue, not rke2. The message indicates that rancher thinks this cluster already exists and was running at some point but no longer has any nodes left. Have you tried making a new cluster?
p
if you rke2-uninstall after deleting the previous cluster, this will not happen.
At least, when i was bonking on my test cluster, this would work.
f
Okay, I will try to uninstall rke2 manually.
@powerful-librarian-10572 I did uninstall via the manul command via the
/usr/local/bin/rke2-uninstall.sh
.
Copy code
# here are some errors
rm: cannot remove '/etc/rancher': Directory not empty
+ true
+ rm -rf /etc/cni
+ rm -rf /opt/cni/bin
+ rm -rf /var/lib/kubelet
+ rm -rf /var/lib/rancher/rke2
+ rm -d /var/lib/rancher
rm: cannot remove '/var/lib/rancher': Directory not empty

So I just removed them manually
rm -rf /var/lib/rancher
rm -rf /etc/rancher
So I get a new error when I recreate a cluster an append the node. Rancher logs
Copy code
[INFO ] configuring bootstrap node(s) custom-b02f5fcaccb8: error applying plan -- check rancher-system-agent.service logs on node for more information, waiting for agent to check in and apply initial plan
node logs
rancher-system-agent
Copy code
ini
# at the beginningMay 13 10:07:51 server1 rancher-system-agent[3102148]: time="2024-05-13T10:07:51+02:00" level=info msg="Rancher System Agent version v0.3.4 (63eb11a) is starting"
May 13 10:07:51 server1 rancher-system-agent[3102148]: time="2024-05-13T10:07:51+02:00" level=fatal msg="Fatal error running: unable to parse config file: error gathering file information for file /etc/rancher/agent/config.yaml: stat /etc/rancher/agen>
May 13 10:07:51 server1 systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
May 13 10:07:51 server1 systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
May 13 10:07:56 server1 systemd[1]: Stopped Rancher System Agent.
May 13 10:07:56 server1 systemd[1]: Started Rancher System Agent.
# loop 3-5 time
# after a while
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Rancher System Agent version v0.3.4 (63eb11a) is starting"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Starting remote watch of plans"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Starting /v1, Kind=Secret controller"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Detected first start, force-applying one-time instruction set"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] Applying one-time instructions for plan with checksum 9b3159574e2c6bf80fca35e674677b2fbb576aee9a75f150e0a82af088b3e477"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] Extracting image 192.168.137.50/nexus2/rancher/system-agent-installer-rke2:v1.27.8-rke2r1 to directory /var/lib/rancher/agent/work/20>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Using private registry config file at /etc/rancher/agent/registries.yaml"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="Pulling image 192.168.137.50/nexus2/rancher/system-agent-installer-rke2:v1.27.8-rke2r1"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=warning msg="Ignoring relative endpoint URL for registry 192.168.137.50: \"192.168.137.50/nexus2\""
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=warning msg="Failed to get image from endpoint: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 because it doesn't c>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=error msg="error while staging: all endpoints failed: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 because it doe>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=error msg="error executing instruction 0: all endpoints failed: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 beca>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240513-100756/9b3159574e2c6bf80fca35e674677b2fbb576>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 127"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: op>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading x509 client cert/key for probe kube-apiserver (/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt//var/lib/rancher/rke2/serv>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/se>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-apiserver) /var/lib/rancher/rke2/server/tls/server-ca.crt: open /var/lib/rancher/rke2/server/tls/server-ca.c>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-apiserver"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=info msg="[K8s] updated plan secret fleet-default/custom-b02f5fcaccb8-machine-plan with feedback"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading x509 client cert/key for probe kube-apiserver (/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt//var/lib/rancher/rke2/serv>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/se>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-apiserver) /var/lib/rancher/rke2/server/tls/server-ca.crt: open /var/lib/rancher/rke2/server/tls/server-ca.c>
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error while appending ca cert to pool for probe kube-apiserver"
May 13 10:07:57 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:57+02:00" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: op>
# then loop idefinitly
So is it normal ? -
May 13 10:07:51 server1 systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
- the pulling seems to block - CA error ???
p
yep
Copy code
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=warning msg="Failed to get image from endpoint: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 because it doesn't c>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=error msg="error while staging: all endpoints failed: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 because it doe>
May 13 10:07:56 server1 rancher-system-agent[3102296]: time="2024-05-13T10:07:56+02:00" level=error msg="error executing instruction 0: all endpoints failed: Get \"<https://192.168.137.50/v2/>\": x509: cannot validate certificate for 192.168.137.50 beca>
No clue where does this comes from
error gathering file information for file /etc/rancher/agent/config.yaml
Did you do rancher-agent-uninstall as well before rke2 ?
f
It's because I'm offline I use a nexus repository that store all my image / rpm / pip etc if He don't have it nexus can download the images. So I added in the I didn't remove it because I couldn't
Copy code
[root@server1 ~]# /usr/local/bin/rancher-system-agent-uninstall.sh
Removed /etc/systemd/system/multi-user.target.wants/rancher-system-agent.service.
Failed to reset failed state of unit rancher-system-agent.service: Unit rancher-system-agent.service not loaded.
[root@server1 ~]# systemctl stop rancher-system-agent.service
Failed to stop rancher-system-agent.service: Unit rancher-system-agent.service not loaded.
[root@server1 ~]# systemctl disable rancher-system-agent.service
Failed to disable unit: Unit file rancher-system-agent.service does not exist.
[root@server1 ~]#
[root@server1 ~]# systemctl start rancher-system-agent.service
Failed to start rancher-system-agent.service: Unit rancher-system-agent.service not found.
le journalctl -u rancher-system-agent
Copy code
May 13 12:05:36 localhost.localdomain systemd[1]: rancher-system-agent.service: Service RestartSec=5s expired, scheduling restart.
May 13 12:05:36 localhost.localdomain systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 1819.
May 13 12:05:36 localhost.localdomain systemd[1]: Stopped Rancher System Agent.
May 13 12:05:36 localhost.localdomain systemd[1]: Started Rancher System Agent.
May 13 12:05:36 localhost.localdomain rancher-system-agent[46032]: time="2024-05-13T12:05:36+02:00" level=info msg="Rancher System Agent version v0.3.4 (63eb11a) is starting"
May 13 12:05:36 localhost.localdomain rancher-system-agent[46032]: time="2024-05-13T12:05:36+02:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
May 13 12:05:36 localhost.localdomain rancher-system-agent[46032]: time="2024-05-13T12:05:36+02:00" level=info msg="Starting remote watch of plans"
May 13 12:05:36 localhost.localdomain rancher-system-agent[46032]: time="2024-05-13T12:05:36+02:00" level=fatal msg="error while connecting to Kubernetes cluster: the server has asked for the client to provide credentials"
May 13 12:05:36 localhost.localdomain systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
May 13 12:05:36 localhost.localdomain systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
May 13 12:05:38 localhost.localdomain systemd[1]: Stopped Rancher System Agent.
p
Ah, i'm uinfamiliar with airgapped installs...
f
Okay, for testing, I tried online without configuring the registries. I have a very poor connection. It's been 1 hour and 30 minutes that I have been waiting for the node. See the logs in the attached file.
Is it normal?
p
I don't see anything wrong
f
So the CA cert error are okay ?
p
I don't know? All i know is your two logs indicate a final success
f
The last logs in rancher are the following:
Copy code
[INFO ] non-ready bootstrap machine(s) custom-4cedb4f8e960 and join url to be available on bootstrap node
p
This error happens when you try to join a second node when the first one is not yet initialized
f
But I only have one node it's strange.
p
Yeah its not looking good
Rancher is not intuitive with its errors, always comlpaling for nothing of importance and not telling when it's really wrong
f
I should maybe try without rancher only rke2 ?
What do you use?
p
IE :
level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
is a non-error, it just happens because a component is started before another one (who loads the certs) does
I use rancher2 + rke2 as stock as possible
Try to wait some more if possible
🫠 1
f
okay thanks
after a good night's of sleep the server is still stuck in the same state
Waiting for Node
. The logs are the following: - rancher-system-agent It's loop every 10 minutes
Copy code
ini
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240514-0946>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: Name                          >
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: etcd-snapshot-server1-17156052>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: etcd-snapshot-server1-17156232>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: etcd-snapshot-server1-17156376>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[a354805aaee7ac98533ad574dd55d9a4fef1970ee702f98dd40a5923424295da_0:stdout]: etcd-snapshot-server1-17156556>
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> >
May 14 09:46:10 server1 rancher-system-agent[8485]: time="2024-05-14T09:46:10+02:00" level=info msg="[K8s] updated plan secret fleet-default/custom-4cedb4f8e960-machine-plan with feedback"
- rke2-server
Copy code
ini
May 14 00:00:03 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:03.838427+0200","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snaps>
May 14 00:00:03 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:03.845245+0200","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
May 14 00:00:03 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:03.845309+0200","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"<https://127.0.0.1:2379>"}
May 14 00:00:04 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:04.01648+0200","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
May 14 00:00:04 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:04.023563+0200","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"<https://127.0.0.1:2379>","size":"16 MB","to>
May 14 00:00:04 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T00:00:04.023635+0200","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-s>
May 14 00:00:04 server1 rke2[8661]: time="2024-05-14T00:00:04+02:00" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/etcd-snapshot-server1-1715637604"
May 14 00:00:04 server1 rke2[8661]: time="2024-05-14T00:00:04+02:00" level=info msg="Applying snapshot retention=5 to local snapshots with prefix etcd-snapshot in /var/lib/rancher/rke2/server/db/snapshots"
May 14 00:00:04 server1 rke2[8661]: time="2024-05-14T00:00:04+02:00" level=info msg="Reconciling ETCDSnapshotFile resources"
May 14 00:00:04 server1 rke2[8661]: time="2024-05-14T00:00:04+02:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
May 14 00:53:13 server1 rke2[8661]: time="2024-05-14T00:53:13+02:00" level=info msg="Updating TLS secret for kube-system/rke2-serving (count: 10): map[listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle>
May 14 01:09:47 server1 rke2[8661]: time="2024-05-14T01:09:47+02:00" level=info msg="Reconciling snapshot ConfigMap data"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="wake, now=2024-05-14T05:00:00+02:00"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="run, now=2024-05-14T05:00:00+02:00, entry=1, next=2024-05-14T10:00:00+02:00"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-server1-1715655600"
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.198445+0200","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snaps>
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.205493+0200","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.205617+0200","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"<https://127.0.0.1:2379>"}
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.341891+0200","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.350396+0200","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"<https://127.0.0.1:2379>","size":"16 MB","to>
May 14 05:00:00 server1 rke2[8661]: {"level":"info","ts":"2024-05-14T05:00:00.350619+0200","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-s>
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/etcd-snapshot-server1-1715655600"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Applying snapshot retention=5 to local snapshots with prefix etcd-snapshot in /var/lib/rancher/rke2/server/db/snapshots"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Reconciling ETCDSnapshotFile resources"
May 14 05:00:00 server1 rke2[8661]: time="2024-05-14T05:00:00+02:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
You know where, how to find the dev of rancher / rke2 in slack ?
New info in the dashboard
I'm trying only with rke2
p
I have no idea