This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

07/18/2023, 5:38 AM

This message was deleted.

creamy-pencil-82913

07/18/2023, 5:46 AM

Did you check the rancher-system-agent journald logs on the node? Is rke2 even getting installed and the cattle-cluster-agent pod deployed?

mammoth-memory-36508

07/18/2023, 6:35 AM

It is. The cluster goes through the entire setup in fact as far as I can tell. The rancher system agent goes into just waiting for updates eventually. The rke2-server is watching like normal.

mammoth-memory-36508

07/18/2023, 6:39 AM

The rancher-system-agent is at this now:

msg="282b70f1536068eb685b501227878f9a6f5d6015318bb67e8610ba01acc9835f_0:stdout]: Name                                     Location                                                                >

msg="[282b70f1536068eb685b501227878f9a6f5d6015318bb67e8610ba01acc9835f_0:stdout]: etcd-snapshot-pnf-1689638402 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-pnf>

msg="[282b70f1536068eb685b501227878f9a6f5d6015318bb67e8610ba01acc9835f_0:stdout]: etcd-snapshot-pnf-dm-kcoap101-1689656404 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-pnfc>

msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"

The only error I see is sometimes this one in rke2-server:

Jul 18 05:33:48 pnf-dm-kcoap101 rke2[803]: time="2023-07-18T05:33:48Z" level=warning msg="Failed to create Kubernetes secret: Internal error occurred: failed calling webhook \"<http://rancher.cattle.io.se|rancher.cattle.io.se>>

The webhook pod is running normally. I do notice a failed to sync schema showing up in there lately... but I did quite a bit of restarting and messing around on this poor node.

mammoth-memory-36508

07/18/2023, 6:40 AM

The cattle-cluster-agent is just sending these kinds of logs, nothing else: evel=info msg="Watching metadata for /v1, Kind=PersistentVolumeClaim" Although there is an updating TLS secret in there ,but it is just informational and seems to have gone through fine.

blue-kitchen-51801

07/18/2023, 7:09 AM

In our case we have to set MTU size to 1400 because we’ve deployed RKE2 on Harvester & Hetzner VLANs so sometimes big packets caused problems..

blue-kitchen-51801

07/18/2023, 7:10 AM

In another environment, we have the same problem as you have, I don’t know what’s causing the problem.. so looking forward on finding also a solution

blue-kitchen-51801

07/18/2023, 7:12 AM

Also, we’ve noticed that the latest version of RKE2 (v1.25.11+rke2r1) is incompatible with Harvester Cloud Provider so we had to install the previous RKE2 version by checking

Show deprecated Kuberentes patch versions

mammoth-memory-36508

07/18/2023, 2:00 PM

Hm interesting. Our cluster is on custom nodes, so just VMs. The fact that the downstream is able to register back with the cattle-agent initially but then the Rancher server acts like nothing is available is what puzzles me. I'm able to curl for the cacert for example. Something like a MTU issue would make some sense.

👍 1

mammoth-memory-36508

07/18/2023, 2:01 PM

Oh also, my node is in the state Waiting for NodeRef at this point in the join process. Perhaps that is why it's non-ready.

blue-kitchen-51801

07/18/2023, 2:02 PM

if you SSH into each node, most likely you will see that RKE2 is up an running on a single node and on another nodes is not running..

blue-kitchen-51801

07/18/2023, 2:04 PM

on the other nodes*

mammoth-memory-36508

07/18/2023, 2:06 PM

On this deployment I turned it down to just be all three roles on a single node. I was able to previously join 5 nodes - two control plane and three etcd - but it would never try to join the workers. The RKE2 cluster itself in that state was working fine, but the upstream still was waiting for Node Ref.

mammoth-memory-36508

07/18/2023, 2:32 PM

Hm, I think the cattle agent on the downstream cluster does not trust the CA on the upstream one. I installed using the registration command, I assumed that this trust would be added automatically.

mammoth-memory-36508

07/18/2023, 6:06 PM

I think I've found the issue here. I have two interfaces on my nodes, and for some reason node ports are getting assigned to both.

blue-kitchen-51801

07/18/2023, 6:14 PM

yes, it’s normal for node ports to be assigned to all interfaces but how this behaviour caused the problem?

mammoth-memory-36508

07/18/2023, 6:23 PM

You're correct, removing it did not help. How does the Rancher server reach the downstream cluster? As it seems to be unable to establish the connection after the initial time. I've been trying to find the IP it's trying to hit to make sure that connection is available.

creamy-pencil-82913

07/18/2023, 6:27 PM

it uses the hostname you configured when setting up Rancher

creamy-pencil-82913

07/18/2023, 6:27 PM

Rancher does not contact the cluster. The cluster agent phones home to Rancher.

👍 1

creamy-pencil-82913

07/18/2023, 6:27 PM

all connections are outbound from the downstream cluster nodes, to rancher.

mammoth-memory-36508

07/18/2023, 6:27 PM

Okay. Those connections work fine, and the cattle cluster agent even checks in before it goes to non-ready bootstrap machine(s) custom-8ecee98cd4a0 and join url to be available on bootstrap node

creamy-pencil-82913

07/18/2023, 6:31 PM

is everything up and running on the node? Are all the pods running properly, or are there some things stuck pending or crashing?

mammoth-memory-36508

07/18/2023, 6:32 PM

Everything is healthy. There's nothing crashing, nor pending.

mammoth-memory-36508

07/18/2023, 6:32 PM

It's using Calico, which seems to have come online fine too.

creamy-pencil-82913

07/18/2023, 6:33 PM

what does the rancher-system-agent log in journald say?

mammoth-memory-36508

07/18/2023, 6:33 PM

Jul 18 181735 pnf rancher-system-agent[159080]: time="2023-07-18T181735Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]" Jul 18 181735 pnf-dm rancher-system-agent[159080]: time="2023-07-18T181735Z" level=info msg="[ebe618a63d3454d696f83c85f49d0422c1bbf9ff136d16a1546f568a4fbca216_0stdout] Name Location Size Created" Jul 18 181735 pnf-dm rancher-system-agent[159080]: time="2023-07-18T181735Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"

mammoth-memory-36508

07/18/2023, 6:34 PM

Before that it's just the CA Cert for Probe doesn't exist for a bit until everything was running.

creamy-pencil-82913

07/18/2023, 6:34 PM

seems fine

creamy-pencil-82913

07/18/2023, 6:34 PM

does this node have all three roles? control-plane, etcd, worker?

mammoth-memory-36508

07/18/2023, 6:35 PM

Yes

mammoth-memory-36508

07/18/2023, 6:36 PM

k get nodes

NAME                            STATUS   ROLES                       AGE   VERSION

pnf-dm   Ready    control-plane,etcd,master   17m   v1.25.11+rke2r1

blue-kitchen-51801

07/18/2023, 6:36 PM

k get nodes -o wide

might be useful to be able to see private and public IPs

mammoth-memory-36508

07/18/2023, 6:37 PM

This time I assigned them on the join command, so both internal and external are the same IP (and only) IP of the node.

creamy-pencil-82913

07/18/2023, 6:53 PM

Have you tried updating to 2.7.5? There have been some significant changes to rancher’s cluster provisioning controllers since 2.7.3 came out.

mammoth-memory-36508

07/18/2023, 6:54 PM

I can try that.

mammoth-memory-36508

07/18/2023, 7:12 PM

Unfortunate. The upgrade appears to be failing, TLS handshake timeout on the server coming online.

mammoth-memory-36508

07/18/2023, 8:54 PM

Same behavior on 2.7.5 unfortunately. Waiting for NodeRef

mammoth-memory-36508

07/18/2023, 9:19 PM

I see in my logs that I get a 401 from some /v3/connect hits after I try to join the new cluster. Ingress:

Copy code

"GET /v3/connect HTTP/1.1" 400 17 "-" "Go-http-client/1.1" 2656 0.001 [cattle-system-rancher-80]

mammoth-memory-36508

07/18/2023, 9:19 PM

So the downstream cluster can't seem to authenticate anymore while it's in this state, but I'm not sure what kind of authentication it's supposed to pass

mammoth-memory-36508

07/18/2023, 10:05 PM

Bunch of these in the rancher logs now:

Copy code

2023/07/18 22:04:27 [ERROR] Failed to handle tunnel request from remote address 172.16.1.89:45130 (X-Forwarded-For: 10.104.98.2): response 400: cluster not found
2023/07/18 22:04:36 [ERROR] Failed to handle tunnel request from remote address 172.16.0.136:58616 (X-Forwarded-For: 10.104.98.2): response 401: failed authentication

creamy-pencil-82913

07/18/2023, 10:08 PM

you might try deleting and re-creating the cluster? Uninstall rke2 and rancher-system-agent from the node if you’re going to reuse it, then delete and recreate the cluster, and re-register the node.

mammoth-memory-36508

07/18/2023, 10:08 PM

I've done that a few times to no avail. Always ends up in the same state, even with a freshly imaged cluster.

mammoth-memory-36508

07/18/2023, 10:12 PM

*freshly imaged node

creamy-pencil-82913

07/18/2023, 10:12 PM

hmm. hard to guess what else might be going on then. Does the cattle-cluster-agent pod or anything else deployed to the downstream cluster have any clues?

creamy-pencil-82913

07/18/2023, 10:12 PM

is the rancher server URL resolvable from the node but not a pod?

creamy-pencil-82913

07/18/2023, 10:13 PM

or anything else weird like that?

mammoth-memory-36508

07/18/2023, 10:13 PM

It doesn't. There's almost no messages at all inside the cattle-cluster-agent. That's the pod that will be making these registration from the downstream right?

mammoth-memory-36508

07/18/2023, 10:13 PM

I can actually curl from inside the cattle-cluster-agent pod to my Rancher server, and authenticate with my own token

creamy-pencil-82913

07/18/2023, 10:13 PM

that is what handles comms with the cluster as a whole, yeah

mammoth-memory-36508

07/18/2023, 10:13 PM

as long as I use the --insecure flag on curl that is. But I'm not pre-pending a CA or anything on my own token when I tried the call

mammoth-memory-36508

07/18/2023, 10:14 PM

It's like the downstream cluster has the wrong information suddenly, but the node was completely clean before using the install command.

mammoth-memory-36508

07/18/2023, 10:15 PM

And the cattle-rancher-agent is able to initially connect and download the cert from Rancher.

creamy-pencil-82913

07/18/2023, 10:20 PM

did you change the cert config on Rancher after bringing it up, and miss updating something somewhere? Is Rancher using the same cert signed by the same CA that the ingress is using?

mammoth-memory-36508

07/18/2023, 10:20 PM

It is. Hmm. I uploaded the full chain CA cert after adding the ingress tls cert before

mammoth-memory-36508

07/18/2023, 10:21 PM

Hm. I just recreated a new node in the same VLAN as my RMC and it worked... so... something must be going on with the network here. Maybe stripping some of the authencation someplace?

👍 1

creamy-pencil-82913

07/19/2023, 12:14 AM

firewall doing mitm or something?

mammoth-memory-36508

07/19/2023, 7:16 PM

Hmm still not sure actually. Our RMC has a POD CIDR of 172.16.0.0/22. Our downstream has a POD CIDR of 172.18.0.0/15 and will not work. It gets a 401 Authentication Error: Proxy Authentication Failed from the NGINX on the RMC. However, if I make a downstream in the same network with a POD CIDR of 10.10.0.0/16 -- it just works right away. We do use proxies, but all of them have an no_proxy in /etc/default/rke2-server that includes 10.0.0.0/8 and 172.0.0.0/8 ... just in case somehow it decides to proxy. We see the requests hitting the nginx ingress on the RMC though so I don't think it's somehow getting proxied.

blue-kitchen-51801

07/19/2023, 7:20 PM

Rancher is accessible from each node without proxy? (from what I've noticed, RKE2 agent doesn't use the proxy to connect back to Rancher)

mammoth-memory-36508

07/19/2023, 7:21 PM

It is, yes. And if we change the POD CIDR it will connect just fine back to the RMC over the same underlying network configuration.

👍 1

mammoth-memory-36508

07/19/2023, 7:21 PM

I can access the RMC from within the cattle-agent pod also, with curl. But on the ingress logs we see 401s responding to all of the final update requests.

mammoth-memory-36508

07/19/2023, 7:24 PM

How being in the 172.0.0.0/8 space on the POD CIDR makes any difference to being in the 10.0.0.0/8 is really baffling to me. These IP ranges shouldn't even be visible on the traffic going from cattle-agent to RMC right?

blue-kitchen-51801

07/19/2023, 7:28 PM

tomorrow I can share step by step how we've created our RKE2 cluster behind a proxy if you're interested. do you use harvester or another cloud provider?

mammoth-memory-36508

07/19/2023, 7:29 PM

No, they're custom on node. The things is we have three other RMCs running with ~10 clusters downstream each without issue. The same process doesn't work for this one somehow though.

👍 1

mammoth-memory-36508

07/19/2023, 7:29 PM

But this is the first new RMC in a while, so something must be misconfigured somewhere....

blue-kitchen-51801

07/19/2023, 7:30 PM

do you use the same Rancher version and same Kubernetes version?

mammoth-memory-36508

07/19/2023, 7:31 PM

Newer of each, unfortunately. And those RMCs are RKE1 instead of RKE2, although the downstream are RKE2

creamy-pencil-82913

07/19/2023, 7:36 PM

Are you using some sort of bgp load balancer or something that exposes the pods outside the rancher cluster? There's not normally any reason that pods on either cluster would talk directly to each other or have any reason to know what the pod cidrs on other clusters are.

mammoth-memory-36508

07/19/2023, 7:37 PM

We aren't, at the moment. But we use BIG-IP on the downstream eventually. However, not trying to set that up yet so they shouldn't know each other's IPs

mammoth-memory-36508

07/19/2023, 7:59 PM

Okay. So I added no_proxy for the RANCHER INSTALL to include 172.0.0.0/8 and it started working. So the POD CIDR of the Calico network is now included in the no_proxy

mammoth-memory-36508

07/19/2023, 8:01 PM

(the downstream POD CIDR)

mammoth-memory-36508

07/19/2023, 8:08 PM

I did not expect the RMC to care about these IPs or to ever know them... but it seems that the ingress was trying to forward to them through the proxy.

803 Views

Open in Slack

Previous Next