Hi all, I'm running Rancher 2.8.2 in a K3s v1.27 l...
# general
a
Hi all, I'm running Rancher 2.8.2 in a K3s v1.27 local cluster, with RKE2 1.26 downstream cluster. We had an issue with our EC2 Security groups yesterday that dropped the rules for ports 80, 2379, and 5432. I've fixed that, but I'm still having major problems with Rancher and trying to add nodes to the downstream cluster. Rancher is up but when I restart the pods or redeploy Rancher I get this in the logs
Copy code
2024-05-30T17:09:32.206261727Z 2024/05/30 17:09:32 [INFO] Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller
2024-05-30T17:09:32.206801192Z 2024/05/30 17:09:32 [INFO] Starting /v1, Kind=Namespace controller
2024-05-30T17:09:35.202244878Z 2024/05/30 17:09:35 [ERROR] Failed to connect to peer <wss://10.42.4.4/v3/connect> [local ID=10.42.0.3]: websocket: bad handshake
2024-05-30T17:09:35.311101078Z 2024/05/30 17:09:35 [ERROR] Failed to connect to peer <wss://10.42.3.9/v3/connect> [local ID=10.42.0.3]: websocket: bad handshake
2024/05/30 17:09:40 [ERROR] Failed to connect to peer <wss://10.42.4.4/v3/connect> [local ID=10.42.0.3]: websocket: bad handshake
2024/05/30 17:09:40 [ERROR] Failed to connect to peer <wss://10.42.3.9/v3/connect> [local ID=10.42.0.3]: websocket: bad handshake
2024/05/30 17:09:41 [INFO] Shutting down /v1, Kind=Secret workers
Sometimes the pod will come up, but then I'll see this in the logs
Copy code
024-05-30T17:14:12.369798110Z 2024/05/30 17:14:12 [ERROR] Failed to handle tunnel request from remote address 10.42.0.3:33394: response 400: cluster not found
2024-05-30T17:14:17.375865188Z 2024/05/30 17:14:17 [ERROR] Failed to handle tunnel request from remote address 10.42.0.3:35190: response 400: cluster not found
2024-05-30T17:14:22.386047691Z 2024/05/30 17:14:22 [ERROR] Failed to handle tunnel request from remote address 10.42.0.3:35200: response 400: cluster not found
2024-05-30T17:14:27.393378442Z 2024/05/30 17:14:27 [ERROR] Failed to handle tunnel request from remote address 10.42.0.3:45058: response 400: cluster not found
I'm wondering if the networking issues over the last 24 hours caused some unrecoverable problems? I can restore my K3s database if I need to but I'd like to just fix this if possible
I most recently added a new K3s manager node and the rancher pod is in Crashloopbackoff with this at the end of the log
Copy code
2024-05-30T17:17:01.951281245Z 2024/05/30 17:17:01 [ERROR] Failed to connect to peer <wss://10.42.4.4/v3/connect> [local ID=10.42.0.3]: websocket: bad handshake
2024-05-30T17:17:02.253623255Z 2024/05/30 17:17:02 [ERROR] Failed to connect to peer <wss://10.42.3.9/v3/connect> [local ID=10.42.0.3]: websocket: bad handshake
2024-05-30T17:17:06.955583813Z 2024/05/30 17:17:06 [ERROR] Failed to connect to peer <wss://10.42.4.4/v3/connect> [local ID=10.42.0.3]: websocket: bad handshake
2024-05-30T17:17:07.335815489Z 2024/05/30 17:17:07 [ERROR] Failed to connect to peer <wss://10.42.3.9/v3/connect> [local ID=10.42.0.3]: websocket: bad handshake
2024-05-30T17:17:08.350933086Z I0530 17:17:08.350733      34 trace.go:236] Trace[192623510]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231 (30-May-2024 17:16:58.348) (total time: 10002ms):
2024-05-30T17:17:08.350987927Z Trace[192623510]: ---"Objects listed" error:<nil> 9983ms (17:17:08.331)
2024-05-30T17:17:08.350996977Z Trace[192623510]: [10.00227518s] [10.00227518s] END
2024-05-30T17:17:11.567671377Z 2024/05/30 17:17:11 [ERROR] failed to start cluster controllers c-m-sxkpljbn: context canceled
2024-05-30T17:17:11.56781078
2024/05/30 17:17:11 [INFO] requested to terminate, exiting
I think it's clearing itself up after rotating out my k3s manager nodes. Although it left over 1300 machine resources in Reconciling status, from all of the ec2 instances that tried to join the rke2 cluster and then were deleted when they couldn't join...