Hey all, trying to figure out exactly where to loo...
# general
a
Hey all, trying to figure out exactly where to look to track this down. We're running and RKE2 downstream cluster in Rancher 2.8.2, on a local K3s cluster. Randomly in the middle of the day today I can't add nodes to my RKE2 cluster. This is the output from running the registration command. I've tried redeploying Rancher, rancher-webhook, and the capi deployment.
Copy code
[INFO]  Label: <http://cattle.io/os=linux|cattle.io/os=linux>
[INFO]  Role requested: worker
[INFO]  Using default agent configuration directory /etc/rancher/agent
[INFO]  Using default agent var directory /var/lib/rancher/agent
[INFO]  Determined CA is not necessary to connect to Rancher
[INFO]  Successfully tested Rancher connection
[INFO]  Downloading rancher-system-agent binary from <https://rancher-url/assets/rancher-system-agent-amd64>
[INFO]  Successfully downloaded the rancher-system-agent binary.
[INFO]  Downloading rancher-system-agent-uninstall.sh script from <https://rancher-url/assets/system-agent-uninstall.sh>
[INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO]  Generating Cattle ID
curl: (28) Operation timed out after 60001 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
[ERROR]  500 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
curl: (28) Operation timed out after 60001 milliseconds with 0 bytes received
I have a bunch of Machines in Cluster Management with
Copy code
message: waiting for agent to check in and apply initial plan
      reason: Waiting
      status: Unknown
      type: Reconciled
c
agents are timing out accessing your rancher server to get their ID
Looks like they’re able to download the scripts but for some reason the ID request is timing out…
Maybe look at the rancher pod logs on the management cluster?
a
I've made some progress, I'm not sure if this is lingering issues from our security group missing a couple rules for about 24 hours last week, but I cycle out one of the K3s manager nodes that was brought online during all that, and then nodes could start joining the cluster. However, of the 30 some worker nodes that joined, only a few of them are showing up properly in Cluster management. The rest appear to be waiting for the SUC jobs to complete.
image.png
I didn't explicitly configure the SUC, it's part of the default install of RKE2, so I'm not sure what exactly it's trying to do since I haven't created an upgrade plan for this cluster yet. I have noticed that usually it takes 3 tries for it to complete on just about every node I join to the cluster. I've just ignored it because it doesn't appear to cause any problems
The logs for those look pretty much the same, here's a snippet from the end
Copy code
+ grep -q '<http://node-role.kubernetes.io/controlplane|node-role.kubernetes.io/controlplane>: "true"' /host/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/node.yaml
2024-06-05T19:50:58.471695494Z + '[' -z  ]
+ grep -q '<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"' /host/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/node.yaml
2024-06-05T19:50:58.476827471Z + '[' -z  ]
2024-06-05T19:50:58.476984524Z + grep -q '<http://node-role.kubernetes.io/worker|node-role.kubernetes.io/worker>: "true"' /host/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/node.yaml
2024-06-05T19:50:58.482114851Z + export 'CATTLE_AGENT_BINARY_LOCAL=true'
2024-06-05T19:50:58.482149972Z + export 'CATTLE_AGENT_UNINSTALL_LOCAL=true'
2024-06-05T19:50:58.482158152Z + export 'CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/rancher-system-agent'
+ export 'CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/rancher-system-agent-uninstall.sh'
2024-06-05T19:50:58.482170762Z + '[' -s /host/etc/systemd/system/rancher-system-agent.env ]
+ chroot /host /var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/install.sh
[FATAL]  You must select at least one role.
2024-06-05T19:50:58.510432123Z + cleanup
+ rm -rf /host/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY
c
yeah, rancher does that on imported clusters
Copy code
[FATAL]  You must select at least one role.
that doesn’t seem right
a
It seems like the the Machine resources in cluster management aren't happy until that finally runs successfully
These jobs are failing on worker nodes that are otherwise working fine, all of the standard pods are running, Longhorn pods are up and that's working, they're even accepting normal workloads
the SUC deployment was spamming new pods every second, now it hasn't done anything for about 10 minutes
There's nothing of substance in the controller log, I redeployed the SUC about 25 minutes ago and the controller just has this in the log
Copy code
W0605 19:50:45.333734       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-05T19:50:46.015032960Z time="2024-06-05T19:50:46Z" level=info msg="Applying CRD <http://plans.upgrade.cattle.io|plans.upgrade.cattle.io>"
2024-06-05T19:50:46.750979149Z E0605 19:50:46.750623       1 memcache.go:206] couldn't get resource list for <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>: Got empty response for: <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>
2024-06-05T19:50:46.753723738Z time="2024-06-05T19:50:46Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2024-06-05T19:50:46Z" level=info msg="Starting /v1, Kind=Node controller"
2024-06-05T19:50:46.785056480Z E0605 19:50:46.784534       1 memcache.go:206] couldn't get resource list for <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>: Got empty response for: <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>
2024-06-05T19:50:46.790929080Z time="2024-06-05T19:50:46Z" level=info msg="Starting batch/v1, Kind=Job controller"
E0605 19:50:46.826713       1 memcache.go:206] couldn't get resource list for <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>: Got empty response for: <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>
2024-06-05T19:50:46.832382079Z time="2024-06-05T19:50:46Z" level=info msg="Starting <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>, Kind=Plan controller"
just as an example, in our test cluster that isn't having any problems I added a new node. The first time the SUC apply job ran it gave that error above about having to select a node role. Then a new job immmediately started and it succeeded. The difference comes after the
chroot
Copy code
2024-06-05T20:34:23.999833095Z + chroot /host /var/lib/rancher/agent/tmp/tmp.6BacBkc7EO/install.sh
2024-06-05T20:34:24.020768795Z [INFO]  Using default agent configuration directory /etc/rancher/agent
2024-06-05T20:34:24.020789505Z [INFO]  Using default agent var directory /var/lib/rancher/agent
2024-06-05T20:34:24.105857891Z [INFO]  Determined CA is not necessary to connect to Rancher
2024-06-05T20:34:24.202566186Z [INFO]  Successfully tested Rancher connection
2024-06-05T20:34:24.241500688Z [INFO]  Rancher System Agent was detected on this host. Ensuring the rancher-system-agent is stopped.
2024-06-05T20:34:24.289151802Z [INFO]  Using local rancher-system-agent binary from /var/lib/rancher/agent/tmp/tmp.6BacBkc7EO/rancher-system-agent
2024-06-05T20:34:24.427582205Z [INFO]  Using local rancher-system-agent-uninstall.sh script from /var/lib/rancher/agent/tmp/tmp.6BacBkc7EO/rancher-system-agent-uninstall.sh
2024-06-05T20:34:24.439181254Z [INFO]  Generating Cattle ID
2024-06-05T20:34:24.440666763Z [INFO]  Cattle ID was already detected as 6d59e8ed240754e19777733c16fd6597be1b769c6b1fcd7ecb7bf4a44368bd8. Not generating a new one.
2024-06-05T20:34:25.141478170Z [INFO]  Successfully downloaded Rancher connection information
2024-06-05T20:34:25.141613322Z [INFO]  systemd: Creating service file
2024-06-05T20:34:25.146069569Z [INFO]  Creating environment file /etc/systemd/system/rancher-system-a
gent.env
2024-06-05T20:34:25.383220384Z [INFO]  Enabling rancher-system-agent.service
2024-06-05T20:34:25.574670219Z [INFO]  Starting/restarting rancher-system-agent.service
2024-06-05T20:34:25.615165660Z + cleanup
2024-06-05T20:34:25.615189921Z + rm -rf /host/var/lib/rancher/agent/tmp/tmp.6BacBkc7EO
Well, back to the same issue, which is probably all part of the above, with new nodes not being able to join and timing out trying to get a Cattle ID. I see this in my rancher deployment logs, these machines do not exist when looking in cluster management.
Copy code
2024/06/05 20:56:07 [ERROR] error syncing 'fleet-default/custom-c653a9a198a3': handler rke-bootstrap: failed to delete fleet-default/custom-c653a9a198a3-machine-bootstrap /v1, Kind=ServiceAccount for rke-bootstrap fleet-default/custom-c653a9a198a3: serviceaccounts "custom-c653a9a198a3-machine-bootstrap" not found, failed to delete fleet-default/custom-c653a9a198a3-machine-plan /v1, Kind=ServiceAccount for rke-bootstrap fleet-default/custom-c653a9a198a3: serviceaccounts "custom-c653a9a198a3-machine-plan" not found, requeuing
there's a lot more than just that but you get the idea
whelp, I gave up and restored from backup before we had the network issues, maybe there was something weird happened in the RDS database. One question @creamy-pencil-82913, I killed my 3 K3s manager nodes, brought new ones up connecting to the restored database and after 5 minutes everything is up and fine. But in Cluster management it's only showing the Machines that were in the downstream cluster as of 10 days ago (when the snapshot was made). How do I get that back in sync with what's actually in my RKE2 cluster?
c
that I’m not sure about. Are you sure the cluster agent is running and checking in with rancher?
a
I just added a few new worker nodes to verify that everything worked and they joined right away without any of the above issues. Other than that, no I'm not sure, how would I check that?
c
cluster agent logs
a
I have a bunch of Machines in
Nodenotfound
status from the other issue we've discussed. I have a script that clears those out while checking for healthy nodes with the same name, but I'm reluctant to run that if all of the healthy nodes aren't showing up as a Machine resource, since it will kill those https://github.com/rancher/rancher/issues/45646
re: the cluster agent logs. I don't really know what to look for, but I assume this ok. I went to one of my worker nodes that was in the downstream cluster earlier today before I restored the local cluster from backups, and I see this as the last few lines. This is after the restore
Copy code
Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: 10.114.38.114:6443"
Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [10.114.30.239:6
Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 10.114.38.114:9345"
Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [10.114.30.239:9345 10.114.
Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Connecting to proxy" url="<wss://10.114.38.114:9345/v1-rke2/connect>"
c
that’s the rke2 log, not the cluster-agent pod log
a
oh, cattle-cluster-agent pod in cattle-system namespace in the downstream cluster?
There wasn't much in there, some prometheus-node-exporter spam in one and this in the other
Copy code
time="2024-06-05T22:45:40Z" level=error msg="Error during subscribe websocket: close sent"
2024-06-05T22:46:00.942449069Z time="2024-06-05T22:46:00Z" level=error msg="Error during subscribe websocket: close sent"
2024-06-05T23:11:57.553310379Z W0605 23:11:57.552896      56 transport.go:301] Unable to cancel request for *client.addQuery
2024-06-05T23:11:57.567532009Z W0605 23:11:57.567164      56 transport.go:301] Unable to cancel request for *client.addQuery
I went ahead and deleted those one at a time, guess I could've just redeployed the cattle-cluster-agent deployment
869 Views