This message was deleted.
# rke2
a
This message was deleted.
a
Actual IPs
b
Yes, I have that
a
but also cluster-cidr:
and service-cidr:
all need both
b
yes
a
::1 localhost in hosts file?
b
yes
b
@plain-byte-79620
b
reason: WaitingForNodeRef
severity: Info
status: 'False'
type: NodeHealthy
Looks like they get stuck on this
message: Cluster agent is not connected reason: Disconnected status: 'False' type: Ready
a
Which rancher version
If you upgraded from 2.7.9 to 2.8.1 downgrade to 2.8.0 first then 2.8.1+
b
Rancher: v2.8.2 Cluster: v1.27.10+rke2r1
It's a fresh 2.8.2 install
a
interesting
What if you downgrade eitherway?
😄
b
The rancher is a manually installed RKE2-cluster on six fresh suse-machines to.
And it's installed with FluxCD from helm chart.
a
huh
b
I am not sure how you bootstrap your rancher environment correctly.
a
There is no right way, just many ways 🙂
p
Are you deploying them with the rancher generated script that the ui?
b
yes
p
I think you should configure the IP from that script there should be a flag I think it's named node-address or something.
b
interesting, I tried to find some docs about that but was unable to. But my google-fu around rancher is still in training mode.
p
I can check from a test ui where you have to find them
b
Yes please.
"-a" | "--address") CATTLE_ADDRESS="$2"
is it this one?
"-i" | "--internal-address") CATTLE_INTERNAL_ADDRESS="$2"
or this?
p
--address
you have to add it manually with ipv4,ipv6
b
2024-03-05T09:14:17.629601+00:00 ranch1 rke2[1662]: time="2024-03-05T09:14:17Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
2024-03-05T09:14:21.167593+00:00 ranch1 rke2[1662]: time="2024-03-05T09:14:21Z" level=info msg="Waiting for API server to become available"
2024-03-05T09:14:22.359715+00:00 ranch1 rke2[1662]: time="2024-03-05T09:14:22Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Here is the logs from the machiene now - would be interesting to know what the 500-error is.
p
you should wait that the node is up I think
b
im giving it a few minutes then - fetching a cup of hot java
p
you can check the status with kubectl inside the node
b
ranch1 Ready control-plane,etcd,master 3h15m v1.27.10+rke2r1
Copy code
ranch1:~ # /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/kubectl --kubeconfig /var/lib/rancher/rke2/agent/kubelet.kubeconfig get nodes
NAME     STATUS   ROLES                       AGE     VERSION
ranch1   Ready    control-plane,etcd,master   3h15m   v1.27.10+rke2r1
After a long while this is showing in the first node, but in the rancher-gui it still say "Waiting for node"
Seems like the ClusterCIDR/ServiceCIDR is ignored.
This is the cluster CIDR, configured in Rancher for the cluster: fd7cb6d3041a5064:/64,10.60.0.0/16 And this is the ip of one of the pods in kube-system, and this looks like the "default" cidr if you dont provide one. IPs: IP: 10.42.0.18
And same with the service-ips. they are from 10.43.0.0/16 - but I have configured in rancher 10.61.0.0/16
p
where did you configure the CIDR?
b
In the rancher ui for the new cluster.
There is a node driver notice. But I checked those and nothing feels applicable.
Sorry for the mobile ui, I am out an about today.
p
but it's IPv6 only you have to specify ipv4,ipv6 and I think there is a flag enable IPv6 near the box where you configure the CNI
b
Ni, there is both a ipv4 and ipv6 cidr in both boxes.
And the ipv6 box is ticked.
p
are you sure that when you are running the install script on the node there aren't any older setup for RKE2? could you try to run
rke2-uninstall.sh
before installing the new setup through rancher?
b
Yes, I can do that. Is away all day today. Will check back here when done
Copy code
status:
  bootstrapReady: true
  conditions:
    - lastTransitionTime: '2024-03-05T09:08:01Z'
      status: 'True'
      type: Ready
    - lastTransitionTime: '2024-03-05T09:08:01Z'
      status: 'True'
      type: BootstrapReady
    - lastTransitionTime: '2024-03-05T09:07:59Z'
      status: 'True'
      type: InfrastructureReady
    - lastTransitionTime: '2024-03-05T09:07:59Z'
      reason: WaitingForNodeRef
      severity: Info
      status: 'False'
      type: NodeHealthy
  lastUpdated: '2024-03-05T09:08:01Z'
  observedGeneration: 2
  phase: Provisioning
Same effect, it gets stuck like this.
Copy code
curl -fL <https://rancher.domain.tld/system-agent-install.sh> | sudo  sh -s - --server <https://rancher.domain.tld> --label '<http://cattle.io/os=linux|cattle.io/os=linux>' --token <jfr...> --address "172.16.135.51,2a07:beef:5:2002:be24:11ff:fe60:115d" --etcd --controlplane --worker
This is the commandline for the install, and the addreses i the static node-ips for ipv6 and ipv4.
p
the IPs are from the same interface? Could you check the RKE2 logs on the node?
b
yes they are from the same if
What specific log should I look in?
p
you could check the status of the server with
systemctl status rke2-server.service
to check if there are any errors
b
Copy code
Mar 06 20:24:52 ranch1 rke2[19110]: time="2024-03-06T20:24:52Z" level=error msg="Failed to process config: lstat /var/lib/rancher/rke2/server/manifests: no such file or directory"
Mar 06 20:25:05 ranch1 rke2[19110]: {"level":"warn","ts":"2024-03-06T20:25:05.078178Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc000d12>>
Mar 06 20:25:05 ranch1 rke2[19110]: time="2024-03-06T20:25:05Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Mar 06 20:25:07 ranch1 rke2[19110]: time="2024-03-06T20:25:07Z" level=error msg="Failed to process config: lstat /var/lib/rancher/rke2/server/manifests: no such file or directory"
Mar 06 20:25:14 ranch1 rke2[19110]: {"level":"warn","ts":"2024-03-06T20:25:14.947945Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc000d12>>
Mar 06 20:25:14 ranch1 rke2[19110]: {"level":"info","ts":"2024-03-06T20:25:14.948009Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Mar 06 20:25:20 ranch1 rke2[19110]: {"level":"warn","ts":"2024-03-06T20:25:20.078694Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"<etcd-endpoints://0xc000d12>>
Mar 06 20:25:20 ranch1 rke2[19110]: time="2024-03-06T20:25:20Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Mar 06 20:25:20 ranch1 rke2[19110]: time="2024-03-06T20:25:20Z" level=fatal msg="leaderelection lost for rke2-etcd"
Seems like the etcd is nowhere to be found - I have only started one node, or installed one node of three sofar.
p
How many nodes are up now?
b
1
but 0 from rancher, all that one is stuck waitingfornode
p
but if you check with kubectl inside the node is it ready or not?
b
hmm, looking strange now - I will tear down it all and rebuild the machines I think
I rebuilt the machines, deployed an identical cluster without any ipv6 - and it just worked directly. I will try to rebuild them again, but now set a static ipv6-address and not use the permanent SLAAC-address that I used before to see if its a difference.
I also defined a custom ClusterCIDR and ServiceCIDR and it got configured as expected.
p
so did it work?
b
For ipv4 it worked
I have not yuet had time to do it again, with ipv6 - and a static address for all nodes.