If you are seeing rancher artifacts deployed to th...
# general
c
If you are seeing rancher artifacts deployed to the imported cluster, then the rancher -> cluster connectivity is working, it is the cluster -> rancher path that is not.
n
Ok yeah when I run the curl, in
target
cluster, it has successfully created the rancher pieces. So the error that it’s bubbling up is not actually the issue but rather it’s from the
target
cluster to the rancher url in
source
cluster?
I’ve allowed NAT IPs for
target
for •
source
cluster rancher LB •
source
cluster node SG where the rancher pod resides The error persists 😞 It’s tough to troubleshoot if the error bubbled up is not actually the error In the source cluster I see the following logs
Copy code
2025/09/08 16:39:00 [INFO] Loaded configuration from <https://releases.rancher.com/kontainer-driver-metadata/release-v2.12/data.json> in [0x10b7ed60]
I0908 16:39:01.080445 42 warnings.go:110] "Warning: v1 ComponentStatus is deprecated in v1.19+"
2025/09/08 16:39:20 [INFO] Handling backend connection request [c-9jl5c]
2025/09/08 16:39:49 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
c-9jl5c
is the unique string i see for the registration yaml from
source
that I ran in the
target
cluster I’ll try re-running the curl but I don’t think the config has changed, just SGs
c
I’m not sure what you mean by source and target
n
source is the mgmt eks cluster that rancher is installed in and target is the cluster im trying to import
c
We call clusters managed by Rancher “downstream clusters”. The cluster rancher is running in is the “local cluster”, as it is named “local” in the rancher UI.
n
Ah ok mb. source = local cluster and target = downstream cluster
c
Downstream clusters need to be able to reach the rancher UI endpoint via tls+websockets on port 443, same as you access via your browser. They do not connect to the local cluster apiserver.
so, if the rancher UI is behind an ELB/ALB, you need to open that up to your downstream clusters. If that would require peering VPCs because it is not public, you’ll need to do that - or make it public.
n
The rancher UI is behind an internet facing ALB (public). The SG allows the NAT IP from downstream cluster. On the downstream cluster on the rancher
cattle-cluster-agent
pod I can even curl the Rancher URL and get a response
c
what do you see in the cluster agent pod logs on the downstream cluster
n
Copy code
time="2025-09-08T16:48:39Z" level=info msg="starting cattle-credential-cleanup goroutine in the background"                                      │
│ time="2025-09-08T16:48:39Z" level=info msg="Listening on /tmp/log.sock"                                                                          │
│ time="2025-09-08T16:48:39Z" level=info msg="Rancher agent version v2.12.1 is starting"                                                           │
│ time="2025-09-08T16:48:39Z" level=error msg="unable to read CA file from /etc/kubernetes/ssl/certs/serverca: open /etc/kubernetes/ssl/certs/serv │
│ time="2025-09-08T16:48:40Z" level=info msg="Connecting to <wss://rancher.MYDOMAIN.com/v3/connect/register> with token starting with k4zhc5 │
│ time="2025-09-08T16:48:40Z" level=info msg="Connecting to proxy" url="<wss://rancher.MYDOMAIN.com/v3/connect/register>"
c
are you sure that websockets works through your ALB?
curl will test basic https connectivity, but that does not mean that websockets work
n
What is the way that the docs say to test that / configure the ALB to ensure it connects? I wasn’t able to find my issue or any information in the documentation unfortunately. A lot of this has been self discovery
c
can you check the ALB log to confirm that it is allowing the websocket upgrade request?
n
I think I was able to test it
Copy code
curl -vk --http1.1 "https://${HOST}/v3/subscribe" \
  -H "Cookie: ${RSESS}; ${CSRF}" \
  -H "X-Api-Csrf: ${CSRF#CSRF=}" \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
  -H "Origin: https://${HOST}"
Returns
Copy code
< HTTP/1.1 101 Switching Protocols
< Date: Mon, 08 Sep 2025 17:42:19 GMT
< Connection: upgrade
< Upgrade: websocket
< Sec-WebSocket-Accept: s3pPLMBiTa3sdfYGzaa2adK+xOo=
I regenerated the registration token on the local cluster . Still didn’t work. Tried several other things as well as adding the CA cert chain in case that error mattered (it didn’t) I fixed it by nuking all rancher resources in the downstream cluster -> deleting the import in local cluster -> retrying import. So it looks like it should have worked how I had it configured but it wasn’t retrying and required an explicit delete / import existing cluster