This message was deleted.
# rke2
a
This message was deleted.
c
Rke2 supports ServiceLB (klipper-lb), it's just disabled by default. You can enable it easily enough.
That issue you found is quite old.
m
Awesome, I had not seen this PR. Thanks!
I also do not see it in the latest server configuration docs fwiw https://docs.rke2.io/reference/server_config
But it does show up from
rke2 server --help | grep enable-servicelb
Hey @creamy-pencil-82913, if I enable both aws cloud provider and service-lb, like so:
Copy code
cloud-provider-name: aws
cloud-provider-config: "/etc/rancher/rke2/cloud.conf" # only EC2 autoscale is setup
enable-servicelb: true
Then
cloud-controller-manager
goes in a restart loop:
Copy code
NAME                                                           READY   STATUS      RESTARTS       AGE
cloud-controller-manager-ip-x-x-x-x.region.compute.internal    0/1     Running     3 (64s ago)    13m
cloud-controller-manager-ip-y-y-y-y.region.compute.internal    0/1     Running     3 (57s ago)    12m
cloud-controller-manager-ip-z-z-z-z.region.compute.internal    0/1     Running     6 (90s ago)    25m
LoadBalancer services are still able to be created however, and I am able to connect to those services. But this pod never transitions to a Ready state. When I disable cloud-provider, it goes into a Ready state. Is that expected? I also tried setting
Copy code
disable-cloud-controller: true
but
cloud-controller-manager
still runs with the restart behavior. The events and logs dont seem to indicate much except for this. I cant find much online about it. So was wondering if you might know what the problem is?
Copy code
Warning  Unhealthy  72s (x30 over 6m2s)   kubelet  Startup probe failed: Get "<https://localhost:10258/healthz>": dial tcp [::1]:10258: connect: connection refused
c
What release of rke2 are you running? Must be a little old, as the AWS cloud provider is gone from newer releases of Kubernetes
This might be a bit of an odd configuration, since you're trying to use parts of two different cloud controllers - the in-tree AWS cloud provider, and the out-of-tree rke2 cloud provider. Your best bet is probably to not use the in-tree AWS cloud provider, but instead deploy the AWS cloud provider via HelmChart.
What sorts of log messages do you see in the cloud controller pod logs?
m
I see. Yeah I figured there was something odd in this configuration. We’re trying to balance rapid dev iteration on AWS, but our end goal is to deploy at the edge (no AWS). I wonder if the out of tree aws cloud provider would still conflict with LB creation. Each cloud controller pod shows only these 2 logs
Copy code
I0829 16:39:51.868208       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0829 16:39:51.868715       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...
Thanks for the tips — I’ll look into if using the out of tree AWS cloud provider solves this
c
check the other pods - that one’s not active, there’s another one holding the lease
m
oh you’re right. 2 of them are stuck acquiring, while one of them looks ok. RKE2 version is 1.25.12, but interesting k3s version 1.26.3 is mismatched?
Copy code
I0829 16:48:55.909440       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0829 16:48:55.909885       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...
I0829 16:49:11.826660       1 leaderelection.go:258] successfully acquired lease kube-system/rke2-cloud-controller-manager
I0829 16:49:11.826838       1 event.go:294] "Event occurred" object="kube-system/rke2-cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="<http://coordination.k8s.io/v1|coordination.k8s.io/v1>" type="Normal" reason="LeaderElection" message="ip.compute.internal_3be410ca-ed7d-42cf-a20e-5705884b82d7 became leader"
time="2023-08-29T16:49:11Z" level=info msg="Creating service-controller event broadcaster"
time="2023-08-29T16:49:12Z" level=info msg="Starting /v1, Kind=Node controller"
time="2023-08-29T16:49:12Z" level=info msg="Starting /v1, Kind=Pod controller"
time="2023-08-29T16:49:12Z" level=info msg="Starting apps/v1, Kind=DaemonSet controller"
time="2023-08-29T16:49:12Z" level=info msg="Starting <http://discovery.k8s.io/v1|discovery.k8s.io/v1>, Kind=EndpointSlice controller"
I0829 16:49:12.446502       1 controllermanager.go:311] Started "service"
W0829 16:49:12.446517       1 controllermanager.go:288] "route" is disabled
W0829 16:49:12.446521       1 controllermanager.go:288] "cloud-node" is disabled
W0829 16:49:12.446525       1 controllermanager.go:288] "cloud-node-lifecycle" is disabled
I0829 16:49:12.446622       1 controller.go:227] Starting service controller
I0829 16:49:12.446639       1 shared_informer.go:273] Waiting for caches to sync for service
I0829 16:49:12.547574       1 shared_informer.go:280] Caches are synced for service
I0829 16:49:12.547735       1 event.go:294] "Event occurred" object="default/hello" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0829 16:49:12.551680       1 event.go:294] "Event occurred" object="default/hello" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="AppliedDaemonSet" message="Applied LoadBalancer DaemonSet kube-system/svclb-hello-9329e302"
they’re each in a CrashLoop now since yesterday. once the leader restarts, one of the other replicas acquires the lease. there doesnt seem to be a log indicating what caused the previous crash
c
you might use
kubectl logs --previous
to look at the logs from the crashed pod
We update the cloud controller version asynchronously and it doesn’t change much, so its expected that the versions won’t match 1:1
m
yeah previous doesnt have additional logs
c
hmm, is it just getting killed due to the health check failing?
what was the error you’re getting on the heath check when you look at the pod status?
m
was just able to to get back to this now. yes the pods are stuck in not ready
Copy code
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-08-30T23:28:39Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-08-30T23:28:39Z"
    message: 'containers with unready status: [cloud-controller-manager]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-08-30T23:28:39Z"
    message: 'containers with unready status: [cloud-controller-manager]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-08-30T23:28:39Z"
    status: "True"
    type: PodScheduled
we have 3 control plane nodes, here are the logs of each. (I obfuscated the IPs for privacy reasons)
Copy code
$ kubectl -n kube-system logs cloud-controller-manager-a
I0830 23:33:33.630303       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0830 23:33:33.630806       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...

$ kubectl -n kube-system logs cloud-controller-manager-b
I0830 23:32:41.339963       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0830 23:32:41.340429       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...
I0830 23:32:58.320255       1 leaderelection.go:258] successfully acquired lease kube-system/rke2-cloud-controller-manager
I0830 23:32:58.320324       1 event.go:294] "Event occurred" object="kube-system/rke2-cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="<http://coordination.k8s.io/v1|coordination.k8s.io/v1>" type="Normal" reason="LeaderElection" message="ip-_0328c001-e000-45b2-9200-b49b119cdb71 became leader"
time="2023-08-30T23:32:58Z" level=info msg="Creating service-controller event broadcaster"
time="2023-08-30T23:32:58Z" level=info msg="Starting /v1, Kind=Node controller"
time="2023-08-30T23:32:59Z" level=info msg="Starting /v1, Kind=Pod controller"
time="2023-08-30T23:32:59Z" level=info msg="Starting apps/v1, Kind=DaemonSet controller"
W0830 23:32:59.691042       1 controllermanager.go:288] "cloud-node-lifecycle" is disabled
time="2023-08-30T23:32:59Z" level=info msg="Starting <http://discovery.k8s.io/v1|discovery.k8s.io/v1>, Kind=EndpointSlice controller"
I0830 23:32:59.691346       1 controllermanager.go:311] Started "service"
W0830 23:32:59.691356       1 controllermanager.go:288] "route" is disabled
W0830 23:32:59.691360       1 controllermanager.go:288] "cloud-node" is disabled
I0830 23:32:59.691485       1 controller.go:227] Starting service controller
I0830 23:32:59.691500       1 shared_informer.go:273] Waiting for caches to sync for service
I0830 23:32:59.791739       1 shared_informer.go:280] Caches are synced for service
I0830 23:32:59.791908       1 event.go:294] "Event occurred" object="default/hello" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0830 23:32:59.795705       1 event.go:294] "Event occurred" object="default/hello" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="AppliedDaemonSet" message="Applied LoadBalancer DaemonSet kube-system/svclb-hello-9c549e63"
[anduril@ip-10-32-57-32 ~]$ kubectl -n kube-system logs cloud-controller-manager-3
I0830 23:33:46.427698       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0830 23:33:46.428187       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...
c
It seems OK, I wonder why the health checks are failing.
m
yeah I get the exact same behavior when running out of tree aws-cloud-controller-manager 🙁. worth noting I need to add these values to them helm chart for it to work
Copy code
nodeSelector:
  <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"
args:
- --configure-cloud-routes=false
- --cloud-provider=aws
its a bit unfortunate, as we wanted to have EC2 autoscaling alongside serviceLB for LB services. but for that I think we would have to maintain a custom controller manager. will figure out our path forward from here. Thanks for your help Brad! learned a lot from this