shy-zebra-53074
02/28/2023, 4:20 PMetcd
user, etc….
I’m deploying in a 3-master HA configuration. What I am seeing is sometimes (not all) one of the master nodes (could be A, B or C) will hang. I am unable to SSH in, I’m unable to run any commands against the node. The node never recovers. I have to login to AWS console and manually stop and start the node, at which point it joins and I can add workers. I am trying to ensure that I am compliant with the supported runtime and requirements exactly as they should be as this will be a large production cluster and need to ensure it is exactly as needed.
Anyone else had this issue? Anyone else seen this before? My process is: 1) I start Master A, wait 5m 2) Add Join Token to B/C 3) Start Master B, wait 5m 4) Start Master C, wait 5m… The entire process is automated with ansible. As I stated, this doesn’t happen every time.
One thing I was doing is SSH into Master A and run watch -n1 "kubectl get nodes"
so I’m not sure if querying the API server while things are getting going would have this effect. I also provided all of the logs to AWS support for them to review OOM errors, but nothing looked out of the ordinary.
RKE2 Version: 1.26.0+rke2r2
OS: RHEL 8.5 - minimal install
Should I maybe add more time (more than 5m) between the start of each Master node?creamy-pencil-82913
02/28/2023, 4:58 PMshy-zebra-53074
02/28/2023, 5:01 PMls
does nothing[root@ip-192-168-0-10 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-0-10.us-gov-east-1.compute.internal Ready control-plane,etcd,master 33m v1.26.0+rke2r2
ip-192-168-112-162.us-gov-east-1.compute.internal Ready worker 9m14s v1.26.0+rke2r2
ip-192-168-134-112.us-gov-east-1.compute.internal Ready worker 9m15s v1.26.0+rke2r2
ip-192-168-16-10.us-gov-east-1.compute.internal Ready control-plane,etcd,master 27m v1.26.0+rke2r2
ip-192-168-32-10.us-gov-east-1.compute.internal Ready control-plane,etcd,master 20m v1.26.0+rke2r2
ip-192-168-53-32.us-gov-east-1.compute.internal Ready worker 9m24s v1.26.0+rke2r2
ip-192-168-76-222.us-gov-east-1.compute.internal Ready worker 9m32s v1.26.0+rke2r2
ip-192-168-85-181.us-gov-east-1.compute.internal NotReady <none> 9m12s v1.26.0+rke2r2
ip-192-168-98-61.us-gov-east-1.compute.internal Ready worker 9m15s v1.26.0+rke2r2
ip-192-168-85-181.us-gov-east-1.compute.internal
[root@ip-192-168-0-10 ~]# ssh admin@192.168.85.181 -i /home/admin/.ssh/aws.us-gov-east-1.dev
creamy-pencil-82913
02/28/2023, 5:25 PMshy-zebra-53074
02/28/2023, 5:26 PMcreamy-pencil-82913
02/28/2023, 5:47 PMshy-zebra-53074
02/28/2023, 5:49 PMcreamy-pencil-82913
02/28/2023, 5:52 PMshy-zebra-53074
02/28/2023, 5:57 PMcreamy-pencil-82913
02/28/2023, 5:57 PMshy-zebra-53074
02/28/2023, 5:57 PMcreamy-pencil-82913
02/28/2023, 5:58 PMshy-zebra-53074
02/28/2023, 5:58 PM[root@ip-192-168-0-10 ~]# cat /proc/sys/vm/panic_on_oom
0
creamy-pencil-82913
02/28/2023, 5:58 PMshy-zebra-53074
02/28/2023, 5:59 PM[root@ip-192-168-0-10 ~]# swapon -s
Filename Type Size Used Priority
/dev/dm-1 partition 1023996 0 -2
creamy-pencil-82913
02/28/2023, 6:00 PMshy-zebra-53074
02/28/2023, 6:01 PMcreamy-pencil-82913
02/28/2023, 6:01 PMshy-zebra-53074
02/28/2023, 6:01 PMkubectl
while things are standing up have any effect? I know I’m probably reaching herecreamy-pencil-82913
02/28/2023, 6:05 PMshy-zebra-53074
02/28/2023, 6:06 PMEvery 3.0s: kubectl get nodes ip-192-168-0-10.us-gov-east-1.compute.internal: Tue Feb 28 13:32:54 2023
NAME STATUS ROLES AGE VERSION
ip-192-168-0-10.us-gov-east-1.compute.internal Ready control-plane,etcd,master 44m v1.26.0+rke2r2
ip-192-168-105-44.us-gov-east-1.compute.internal Ready worker 20m v1.26.0+rke2r2
ip-192-168-118-161.us-gov-east-1.compute.internal Ready worker 19m v1.26.0+rke2r2
ip-192-168-134-49.us-gov-east-1.compute.internal Ready worker 20m v1.26.0+rke2r2
ip-192-168-16-10.us-gov-east-1.compute.internal Ready control-plane,etcd,master 37m v1.26.0+rke2r2
ip-192-168-32-10.us-gov-east-1.compute.internal Ready control-plane,etcd,master 31m v1.26.0+rke2r2
ip-192-168-56-154.us-gov-east-1.compute.internal Ready worker 21m v1.26.0+rke2r2
ip-192-168-64-223.us-gov-east-1.compute.internal Ready worker 21m v1.26.0+rke2r2
ip-192-168-94-18.us-gov-east-1.compute.internal Ready worker 21m v1.26.0+rke2r2
hundreds-airport-66196
03/02/2023, 1:41 AM