Hey all trying to figure out exactly where to look to track Rancher Users #general

Hey all, trying to figure out exactly where to loo...

abundant-hair-58573

06/05/2024, 6:23 PM

Hey all, trying to figure out exactly where to look to track this down. We're running and RKE2 downstream cluster in Rancher 2.8.2, on a local K3s cluster. Randomly in the middle of the day today I can't add nodes to my RKE2 cluster. This is the output from running the registration command. I've tried redeploying Rancher, rancher-webhook, and the capi deployment.

Copy code

[INFO]  Label: <http://cattle.io/os=linux|cattle.io/os=linux>
[INFO]  Role requested: worker
[INFO]  Using default agent configuration directory /etc/rancher/agent
[INFO]  Using default agent var directory /var/lib/rancher/agent
[INFO]  Determined CA is not necessary to connect to Rancher
[INFO]  Successfully tested Rancher connection
[INFO]  Downloading rancher-system-agent binary from <https://rancher-url/assets/rancher-system-agent-amd64>
[INFO]  Successfully downloaded the rancher-system-agent binary.
[INFO]  Downloading rancher-system-agent-uninstall.sh script from <https://rancher-url/assets/system-agent-uninstall.sh>
[INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO]  Generating Cattle ID
curl: (28) Operation timed out after 60001 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
[ERROR]  500 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
curl: (28) Operation timed out after 60001 milliseconds with 0 bytes received

abundant-hair-58573

06/05/2024, 6:46 PM

I have a bunch of Machines in Cluster Management with

Copy code

message: waiting for agent to check in and apply initial plan
      reason: Waiting
      status: Unknown
      type: Reconciled

creamy-pencil-82913

06/05/2024, 7:46 PM

agents are timing out accessing your rancher server to get their ID

creamy-pencil-82913

06/05/2024, 7:47 PM

Looks like they’re able to download the scripts but for some reason the ID request is timing out…

creamy-pencil-82913

06/05/2024, 7:47 PM

Maybe look at the rancher pod logs on the management cluster?

abundant-hair-58573

06/05/2024, 7:58 PM

I've made some progress, I'm not sure if this is lingering issues from our security group missing a couple rules for about 24 hours last week, but I cycle out one of the K3s manager nodes that was brought online during all that, and then nodes could start joining the cluster. However, of the 30 some worker nodes that joined, only a few of them are showing up properly in Cluster management. The rest appear to be waiting for the SUC jobs to complete.

abundant-hair-58573

06/05/2024, 8:00 PM

image.png

abundant-hair-58573

06/05/2024, 8:01 PM

I didn't explicitly configure the SUC, it's part of the default install of RKE2, so I'm not sure what exactly it's trying to do since I haven't created an upgrade plan for this cluster yet. I have noticed that usually it takes 3 tries for it to complete on just about every node I join to the cluster. I've just ignored it because it doesn't appear to cause any problems

abundant-hair-58573

06/05/2024, 8:03 PM

The logs for those look pretty much the same, here's a snippet from the end

Copy code

+ grep -q '<http://node-role.kubernetes.io/controlplane|node-role.kubernetes.io/controlplane>: "true"' /host/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/node.yaml
2024-06-05T19:50:58.471695494Z + '[' -z  ]
+ grep -q '<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"' /host/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/node.yaml
2024-06-05T19:50:58.476827471Z + '[' -z  ]
2024-06-05T19:50:58.476984524Z + grep -q '<http://node-role.kubernetes.io/worker|node-role.kubernetes.io/worker>: "true"' /host/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/node.yaml
2024-06-05T19:50:58.482114851Z + export 'CATTLE_AGENT_BINARY_LOCAL=true'
2024-06-05T19:50:58.482149972Z + export 'CATTLE_AGENT_UNINSTALL_LOCAL=true'
2024-06-05T19:50:58.482158152Z + export 'CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/rancher-system-agent'
+ export 'CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/rancher-system-agent-uninstall.sh'
2024-06-05T19:50:58.482170762Z + '[' -s /host/etc/systemd/system/rancher-system-agent.env ]
+ chroot /host /var/lib/rancher/agent/tmp/tmp.wRwguY8TlY/install.sh
[FATAL]  You must select at least one role.
2024-06-05T19:50:58.510432123Z + cleanup
+ rm -rf /host/var/lib/rancher/agent/tmp/tmp.wRwguY8TlY

creamy-pencil-82913

06/05/2024, 8:03 PM

yeah, rancher does that on imported clusters

creamy-pencil-82913

06/05/2024, 8:03 PM

Copy code

[FATAL]  You must select at least one role.

that doesn’t seem right

abundant-hair-58573

06/05/2024, 8:04 PM

It seems like the the Machine resources in cluster management aren't happy until that finally runs successfully

abundant-hair-58573

06/05/2024, 8:05 PM

These jobs are failing on worker nodes that are otherwise working fine, all of the standard pods are running, Longhorn pods are up and that's working, they're even accepting normal workloads

abundant-hair-58573

06/05/2024, 8:12 PM

the SUC deployment was spamming new pods every second, now it hasn't done anything for about 10 minutes

abundant-hair-58573

06/05/2024, 8:15 PM

There's nothing of substance in the controller log, I redeployed the SUC about 25 minutes ago and the controller just has this in the log

Copy code

W0605 19:50:45.333734       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-05T19:50:46.015032960Z time="2024-06-05T19:50:46Z" level=info msg="Applying CRD <http://plans.upgrade.cattle.io|plans.upgrade.cattle.io>"
2024-06-05T19:50:46.750979149Z E0605 19:50:46.750623       1 memcache.go:206] couldn't get resource list for <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>: Got empty response for: <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>
2024-06-05T19:50:46.753723738Z time="2024-06-05T19:50:46Z" level=info msg="Starting /v1, Kind=Secret controller"
time="2024-06-05T19:50:46Z" level=info msg="Starting /v1, Kind=Node controller"
2024-06-05T19:50:46.785056480Z E0605 19:50:46.784534       1 memcache.go:206] couldn't get resource list for <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>: Got empty response for: <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>
2024-06-05T19:50:46.790929080Z time="2024-06-05T19:50:46Z" level=info msg="Starting batch/v1, Kind=Job controller"
E0605 19:50:46.826713       1 memcache.go:206] couldn't get resource list for <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>: Got empty response for: <http://custom.metrics.k8s.io/v1beta1|custom.metrics.k8s.io/v1beta1>
2024-06-05T19:50:46.832382079Z time="2024-06-05T19:50:46Z" level=info msg="Starting <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>, Kind=Plan controller"

abundant-hair-58573

06/05/2024, 8:40 PM

just as an example, in our test cluster that isn't having any problems I added a new node. The first time the SUC apply job ran it gave that error above about having to select a node role. Then a new job immmediately started and it succeeded. The difference comes after the

chroot

Copy code

2024-06-05T20:34:23.999833095Z + chroot /host /var/lib/rancher/agent/tmp/tmp.6BacBkc7EO/install.sh
2024-06-05T20:34:24.020768795Z [INFO]  Using default agent configuration directory /etc/rancher/agent
2024-06-05T20:34:24.020789505Z [INFO]  Using default agent var directory /var/lib/rancher/agent
2024-06-05T20:34:24.105857891Z [INFO]  Determined CA is not necessary to connect to Rancher
2024-06-05T20:34:24.202566186Z [INFO]  Successfully tested Rancher connection
2024-06-05T20:34:24.241500688Z [INFO]  Rancher System Agent was detected on this host. Ensuring the rancher-system-agent is stopped.
2024-06-05T20:34:24.289151802Z [INFO]  Using local rancher-system-agent binary from /var/lib/rancher/agent/tmp/tmp.6BacBkc7EO/rancher-system-agent
2024-06-05T20:34:24.427582205Z [INFO]  Using local rancher-system-agent-uninstall.sh script from /var/lib/rancher/agent/tmp/tmp.6BacBkc7EO/rancher-system-agent-uninstall.sh
2024-06-05T20:34:24.439181254Z [INFO]  Generating Cattle ID
2024-06-05T20:34:24.440666763Z [INFO]  Cattle ID was already detected as 6d59e8ed240754e19777733c16fd6597be1b769c6b1fcd7ecb7bf4a44368bd8. Not generating a new one.
2024-06-05T20:34:25.141478170Z [INFO]  Successfully downloaded Rancher connection information
2024-06-05T20:34:25.141613322Z [INFO]  systemd: Creating service file
2024-06-05T20:34:25.146069569Z [INFO]  Creating environment file /etc/systemd/system/rancher-system-a
gent.env
2024-06-05T20:34:25.383220384Z [INFO]  Enabling rancher-system-agent.service
2024-06-05T20:34:25.574670219Z [INFO]  Starting/restarting rancher-system-agent.service
2024-06-05T20:34:25.615165660Z + cleanup
2024-06-05T20:34:25.615189921Z + rm -rf /host/var/lib/rancher/agent/tmp/tmp.6BacBkc7EO

abundant-hair-58573

06/05/2024, 8:59 PM

Well, back to the same issue, which is probably all part of the above, with new nodes not being able to join and timing out trying to get a Cattle ID. I see this in my rancher deployment logs, these machines do not exist when looking in cluster management.

Copy code

2024/06/05 20:56:07 [ERROR] error syncing 'fleet-default/custom-c653a9a198a3': handler rke-bootstrap: failed to delete fleet-default/custom-c653a9a198a3-machine-bootstrap /v1, Kind=ServiceAccount for rke-bootstrap fleet-default/custom-c653a9a198a3: serviceaccounts "custom-c653a9a198a3-machine-bootstrap" not found, failed to delete fleet-default/custom-c653a9a198a3-machine-plan /v1, Kind=ServiceAccount for rke-bootstrap fleet-default/custom-c653a9a198a3: serviceaccounts "custom-c653a9a198a3-machine-plan" not found, requeuing

abundant-hair-58573

06/05/2024, 9:00 PM

there's a lot more than just that but you get the idea

abundant-hair-58573

06/05/2024, 10:48 PM

whelp, I gave up and restored from backup before we had the network issues, maybe there was something weird happened in the RDS database. One question @creamy-pencil-82913, I killed my 3 K3s manager nodes, brought new ones up connecting to the restored database and after 5 minutes everything is up and fine. But in Cluster management it's only showing the Machines that were in the downstream cluster as of 10 days ago (when the snapshot was made). How do I get that back in sync with what's actually in my RKE2 cluster?

creamy-pencil-82913

06/05/2024, 10:55 PM

that I’m not sure about. Are you sure the cluster agent is running and checking in with rancher?

abundant-hair-58573

06/05/2024, 10:57 PM

I just added a few new worker nodes to verify that everything worked and they joined right away without any of the above issues. Other than that, no I'm not sure, how would I check that?

creamy-pencil-82913

06/05/2024, 10:59 PM

cluster agent logs

abundant-hair-58573

06/05/2024, 11:00 PM

I have a bunch of Machines in

Nodenotfound

status from the other issue we've discussed. I have a script that clears those out while checking for healthy nodes with the same name, but I'm reluctant to run that if all of the healthy nodes aren't showing up as a Machine resource, since it will kill those https://github.com/rancher/rancher/issues/45646

abundant-hair-58573

06/05/2024, 11:06 PM

re: the cluster agent logs. I don't really know what to look for, but I assume this ok. I went to one of my worker nodes that was in the downstream cluster earlier today before I restored the local cluster from backups, and I see this as the last few lines. This is after the restore

Copy code

Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Adding server to load balancer rke2-api-server-agent-load-balancer: 10.114.38.114:6443"
Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Updated load balancer rke2-api-server-agent-load-balancer server addresses -> [10.114.30.239:6
Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Adding server to load balancer rke2-agent-load-balancer: 10.114.38.114:9345"
Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Updated load balancer rke2-agent-load-balancer server addresses -> [10.114.30.239:9345 10.114.
Jun 05 22:52:58 ip-10-114-30-134.domain.org rke2[7804]: time="2024-06-05T22:52:58Z" level=info msg="Connecting to proxy" url="<wss://10.114.38.114:9345/v1-rke2/connect>"

creamy-pencil-82913

06/05/2024, 11:08 PM

that’s the rke2 log, not the cluster-agent pod log

abundant-hair-58573

06/05/2024, 11:10 PM

oh, cattle-cluster-agent pod in cattle-system namespace in the downstream cluster?

abundant-hair-58573

06/05/2024, 11:17 PM

There wasn't much in there, some prometheus-node-exporter spam in one and this in the other

Copy code

time="2024-06-05T22:45:40Z" level=error msg="Error during subscribe websocket: close sent"
2024-06-05T22:46:00.942449069Z time="2024-06-05T22:46:00Z" level=error msg="Error during subscribe websocket: close sent"
2024-06-05T23:11:57.553310379Z W0605 23:11:57.552896      56 transport.go:301] Unable to cancel request for *client.addQuery
2024-06-05T23:11:57.567532009Z W0605 23:11:57.567164      56 transport.go:301] Unable to cancel request for *client.addQuery

I went ahead and deleted those one at a time, guess I could've just redeployed the cattle-cluster-agent deployment

1064 Views

Open in Slack

Previous Next