This message was deleted.
# rke2
a
This message was deleted.
c
Did you check the rancher-system-agent journald logs on the node? Is rke2 even getting installed and the cattle-cluster-agent pod deployed?
m
It is. The cluster goes through the entire setup in fact as far as I can tell. The rancher system agent goes into just waiting for updates eventually. The rke2-server is watching like normal.
The rancher-system-agent is at this now:
msg="282b70f1536068eb685b501227878f9a6f5d6015318bb67e8610ba01acc9835f_0:stdout]: Name                                     Location                                                                >
msg="[282b70f1536068eb685b501227878f9a6f5d6015318bb67e8610ba01acc9835f_0:stdout]: etcd-snapshot-pnf-1689638402 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-pnf>
msg="[282b70f1536068eb685b501227878f9a6f5d6015318bb67e8610ba01acc9835f_0:stdout]: etcd-snapshot-pnf-dm-kcoap101-1689656404 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-pnfc>
msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
The only error I see is sometimes this one in rke2-server:
Jul 18 05:33:48 pnf-dm-kcoap101 rke2[803]: time="2023-07-18T05:33:48Z" level=warning msg="Failed to create Kubernetes secret: Internal error occurred: failed calling webhook \"<http://rancher.cattle.io.se|rancher.cattle.io.se>>
The webhook pod is running normally. I do notice a failed to sync schema showing up in there lately... but I did quite a bit of restarting and messing around on this poor node.
The cattle-cluster-agent is just sending these kinds of logs, nothing else: evel=info msg="Watching metadata for /v1, Kind=PersistentVolumeClaim" Although there is an updating TLS secret in there ,but it is just informational and seems to have gone through fine.
b
In our case we have to set MTU size to 1400 because we’ve deployed RKE2 on Harvester & Hetzner VLANs so sometimes big packets caused problems..
In another environment, we have the same problem as you have, I don’t know what’s causing the problem.. so looking forward on finding also a solution
Also, we’ve noticed that the latest version of RKE2 (v1.25.11+rke2r1) is incompatible with Harvester Cloud Provider so we had to install the previous RKE2 version by checking
Show deprecated Kuberentes patch versions
m
Hm interesting. Our cluster is on custom nodes, so just VMs. The fact that the downstream is able to register back with the cattle-agent initially but then the Rancher server acts like nothing is available is what puzzles me. I'm able to curl for the cacert for example. Something like a MTU issue would make some sense.
👍 1
Oh also, my node is in the state Waiting for NodeRef at this point in the join process. Perhaps that is why it's non-ready.
b
if you SSH into each node, most likely you will see that RKE2 is up an running on a single node and on another nodes is not running..
on the other nodes*
m
On this deployment I turned it down to just be all three roles on a single node. I was able to previously join 5 nodes - two control plane and three etcd - but it would never try to join the workers. The RKE2 cluster itself in that state was working fine, but the upstream still was waiting for Node Ref.
Hm, I think the cattle agent on the downstream cluster does not trust the CA on the upstream one. I installed using the registration command, I assumed that this trust would be added automatically.
I think I've found the issue here. I have two interfaces on my nodes, and for some reason node ports are getting assigned to both.
b
yes, it’s normal for node ports to be assigned to all interfaces but how this behaviour caused the problem?
m
You're correct, removing it did not help. How does the Rancher server reach the downstream cluster? As it seems to be unable to establish the connection after the initial time. I've been trying to find the IP it's trying to hit to make sure that connection is available.
c
it uses the hostname you configured when setting up Rancher
Rancher does not contact the cluster. The cluster agent phones home to Rancher.
👍 1
all connections are outbound from the downstream cluster nodes, to rancher.
m
Okay. Those connections work fine, and the cattle cluster agent even checks in before it goes to non-ready bootstrap machine(s) custom-8ecee98cd4a0 and join url to be available on bootstrap node
c
is everything up and running on the node? Are all the pods running properly, or are there some things stuck pending or crashing?
m
Everything is healthy. There's nothing crashing, nor pending.
It's using Calico, which seems to have come online fine too.
c
what does the rancher-system-agent log in journald say?
m
Jul 18 181735 pnf rancher-system-agent[159080]: time="2023-07-18T181735Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]" Jul 18 181735 pnf-dm rancher-system-agent[159080]: time="2023-07-18T181735Z" level=info msg="[ebe618a63d3454d696f83c85f49d0422c1bbf9ff136d16a1546f568a4fbca216_0stdout] Name Location Size Created" Jul 18 181735 pnf-dm rancher-system-agent[159080]: time="2023-07-18T181735Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Before that it's just the CA Cert for Probe doesn't exist for a bit until everything was running.
c
seems fine
does this node have all three roles? control-plane, etcd, worker?
m
Yes
k get nodes
NAME                            STATUS   ROLES                       AGE   VERSION
pnf-dm   Ready    control-plane,etcd,master   17m   v1.25.11+rke2r1
b
k get nodes -o wide
might be useful to be able to see private and public IPs
m
This time I assigned them on the join command, so both internal and external are the same IP (and only) IP of the node.
c
Have you tried updating to 2.7.5? There have been some significant changes to rancher’s cluster provisioning controllers since 2.7.3 came out.
m
I can try that.
Unfortunate. The upgrade appears to be failing, TLS handshake timeout on the server coming online.
Same behavior on 2.7.5 unfortunately. Waiting for NodeRef
I see in my logs that I get a 401 from some /v3/connect hits after I try to join the new cluster. Ingress:
Copy code
"GET /v3/connect HTTP/1.1" 400 17 "-" "Go-http-client/1.1" 2656 0.001 [cattle-system-rancher-80]
So the downstream cluster can't seem to authenticate anymore while it's in this state, but I'm not sure what kind of authentication it's supposed to pass
Bunch of these in the rancher logs now:
Copy code
2023/07/18 22:04:27 [ERROR] Failed to handle tunnel request from remote address 172.16.1.89:45130 (X-Forwarded-For: 10.104.98.2): response 400: cluster not found
2023/07/18 22:04:36 [ERROR] Failed to handle tunnel request from remote address 172.16.0.136:58616 (X-Forwarded-For: 10.104.98.2): response 401: failed authentication
c
you might try deleting and re-creating the cluster? Uninstall rke2 and rancher-system-agent from the node if you’re going to reuse it, then delete and recreate the cluster, and re-register the node.
m
I've done that a few times to no avail. Always ends up in the same state, even with a freshly imaged cluster.
*freshly imaged node
c
hmm. hard to guess what else might be going on then. Does the cattle-cluster-agent pod or anything else deployed to the downstream cluster have any clues?
is the rancher server URL resolvable from the node but not a pod?
or anything else weird like that?
m
It doesn't. There's almost no messages at all inside the cattle-cluster-agent. That's the pod that will be making these registration from the downstream right?
I can actually curl from inside the cattle-cluster-agent pod to my Rancher server, and authenticate with my own token
c
that is what handles comms with the cluster as a whole, yeah
m
as long as I use the --insecure flag on curl that is. But I'm not pre-pending a CA or anything on my own token when I tried the call
It's like the downstream cluster has the wrong information suddenly, but the node was completely clean before using the install command.
And the cattle-rancher-agent is able to initially connect and download the cert from Rancher.
c
did you change the cert config on Rancher after bringing it up, and miss updating something somewhere? Is Rancher using the same cert signed by the same CA that the ingress is using?
m
It is. Hmm. I uploaded the full chain CA cert after adding the ingress tls cert before
Hm. I just recreated a new node in the same VLAN as my RMC and it worked... so... something must be going on with the network here. Maybe stripping some of the authencation someplace?
👍 1
c
firewall doing mitm or something?
m
Hmm still not sure actually. Our RMC has a POD CIDR of 172.16.0.0/22. Our downstream has a POD CIDR of 172.18.0.0/15 and will not work. It gets a 401 Authentication Error: Proxy Authentication Failed from the NGINX on the RMC. However, if I make a downstream in the same network with a POD CIDR of 10.10.0.0/16 -- it just works right away. We do use proxies, but all of them have an no_proxy in /etc/default/rke2-server that includes 10.0.0.0/8 and 172.0.0.0/8 ... just in case somehow it decides to proxy. We see the requests hitting the nginx ingress on the RMC though so I don't think it's somehow getting proxied.
b
Rancher is accessible from each node without proxy? (from what I've noticed, RKE2 agent doesn't use the proxy to connect back to Rancher)
m
It is, yes. And if we change the POD CIDR it will connect just fine back to the RMC over the same underlying network configuration.
👍 1
I can access the RMC from within the cattle-agent pod also, with curl. But on the ingress logs we see 401s responding to all of the final update requests.
How being in the 172.0.0.0/8 space on the POD CIDR makes any difference to being in the 10.0.0.0/8 is really baffling to me. These IP ranges shouldn't even be visible on the traffic going from cattle-agent to RMC right?
b
tomorrow I can share step by step how we've created our RKE2 cluster behind a proxy if you're interested. do you use harvester or another cloud provider?
m
No, they're custom on node. The things is we have three other RMCs running with ~10 clusters downstream each without issue. The same process doesn't work for this one somehow though.
👍 1
But this is the first new RMC in a while, so something must be misconfigured somewhere....
b
do you use the same Rancher version and same Kubernetes version?
m
Newer of each, unfortunately. And those RMCs are RKE1 instead of RKE2, although the downstream are RKE2
c
Are you using some sort of bgp load balancer or something that exposes the pods outside the rancher cluster? There's not normally any reason that pods on either cluster would talk directly to each other or have any reason to know what the pod cidrs on other clusters are.
m
We aren't, at the moment. But we use BIG-IP on the downstream eventually. However, not trying to set that up yet so they shouldn't know each other's IPs
Okay. So I added no_proxy for the RANCHER INSTALL to include 172.0.0.0/8 and it started working. So the POD CIDR of the Calico network is now included in the no_proxy
(the downstream POD CIDR)
I did not expect the RMC to care about these IPs or to ever know them... but it seems that the ingress was trying to forward to them through the proxy.
772 Views