This message was deleted.
# amazon
a
This message was deleted.
c
EKS using EC2 instances or an EKS cluster built with fargate?
c
What all containers do you see running on your eks node(s)? When the nodes get stuck in "waiting" that usually means the agent is trying to poll or resolve access to something and it will just run in a loop. Sometimes i've found the logs on the rancher server side, sometimes on the agent side, and sometime in the kubelet logs
w
I reinstalled Rancher directly on top of RKE2 (on 2 EC2 instances), I tried to create a cluster, but am running into the same problem (Rancher says: "[Waiting] configuring bootstrap node(s) custom-93e446a91a4b: waiting for probes: kubelet"). Looking at the node where I'm trying to deploy the cluster, I can see that the cluster has been created, but 2 PODS are unhappy: "NAMESPACE NAME READY STATUS RESTARTS AGE cattle-system cattle-cluster-agent-6988b48fd5-gzbkv 0/1 ContainerCreating 0 101m cattle-system cattle-cluster-agent-7c887c6f7b-pm5wl 0/1 CrashLoopBackOff 6 (105m ago) 112m".
and "/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml logs cattle-cluster-agent-6988b48fd5-gzbkv -n cattle-system" displays : "Error from server: Get "https://10.252.12.47:10250/containerLogs/cattle-system/cattle-cluster-agent-6988b48fd5-gzbkv/cluster-register": dial tcp 127.0.0.19345 connect: connection refused". (10.252.12.47 is the IP of the node)
c
Do you have all the correct ports open in you aws security group attached to the instances?
and eks nodes?
w
yes, the security group attached to the EC2 instances allows ALL (ingress/egress).
on the node where I tried to install RKE2 (using the CLI provided by Rancher), I have:
# /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml cluster-info
E0710 192053.244086 97898 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0710 192053.262613 97898 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0710 192053.267281 97898 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0710 192053.271121 97898 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request Kubernetes control plane is running at https://127.0.0.1:6443 CoreDNS is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/rke2-coredns-rke2-coredns:udp-53/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump
/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes
E0710 192144.809057 98261 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0710 192144.827734 98261 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0710 192144.831533 98261 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0710 192144.834755 98261 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request NAME STATUS ROLES AGE VERSION ip-10-252-12-47 NotReady control-plane,etcd,master 3h31m v1.26.6+rke2r1
but it cannot join Rancher
c
what's the size of the instance you run RKE2 on ?
w
they are t2.medium, 20GB storage
I managed to make it work, running this on the node on which I'm deploying the cluster from rancher: /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml -n cattle-system patch deployments cattle-cluster-agent --patch '{"spec": {"template": {"spec": {"hostAliases": [{"hostnames":["fab-rancher.local"],"ip": "10.252.12.18"}]}}}}'
I'm using self-signed certficates, and somehow coreDns is not working
c
for a 1 node cluster it feels a bit tight for running Rancher. I personally run Rancher on a 1 node K3s on a t3a.large instance
c
If you are using a self-signed cert for rancher then that may be your issue. Generally I put an ALB in front of rancher with an ACM cert or use the lentsencrypt integration to generate an LE cert. I'm not aware of any way to tell the rancher agent to ignore the CA or accept any cert
You could also try this project to get a signed LE cert on a local address/node: https://www.getlocalcert.net/
w
thanks for the replies. I re-installed Rancher on RKE2 on EC2, behind a network load babancer, using the FQDN of the NLB for the time being in the Rancher config, and so far so good, deploying a K8s cluster on AWS EC2 from Rancher works (with the right user:-) )
🎉 1