This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

08/09/2023, 9:00 PM

This message was deleted.

creamy-pencil-82913

08/10/2023, 9:46 PM

Rke2 supports ServiceLB (klipper-lb), it's just disabled by default. You can enable it easily enough.

creamy-pencil-82913

08/10/2023, 9:47 PM

https://github.com/rancher/rke2/issues/3419

creamy-pencil-82913

08/10/2023, 9:47 PM

That issue you found is quite old.

many-nightfall-61858

08/10/2023, 9:48 PM

Awesome, I had not seen this PR. Thanks!

many-nightfall-61858

08/10/2023, 9:55 PM

I also do not see it in the latest server configuration docs fwiw https://docs.rke2.io/reference/server_config

many-nightfall-61858

08/10/2023, 9:56 PM

But it does show up from

rke2 server --help | grep enable-servicelb

many-nightfall-61858

08/29/2023, 12:56 AM

Hey @creamy-pencil-82913, if I enable both aws cloud provider and service-lb, like so:

Copy code

cloud-provider-name: aws
cloud-provider-config: "/etc/rancher/rke2/cloud.conf" # only EC2 autoscale is setup
enable-servicelb: true

Then

cloud-controller-manager

goes in a restart loop:

Copy code

NAME                                                           READY   STATUS      RESTARTS       AGE
cloud-controller-manager-ip-x-x-x-x.region.compute.internal    0/1     Running     3 (64s ago)    13m
cloud-controller-manager-ip-y-y-y-y.region.compute.internal    0/1     Running     3 (57s ago)    12m
cloud-controller-manager-ip-z-z-z-z.region.compute.internal    0/1     Running     6 (90s ago)    25m

LoadBalancer services are still able to be created however, and I am able to connect to those services. But this pod never transitions to a Ready state. When I disable cloud-provider, it goes into a Ready state. Is that expected? I also tried setting

Copy code

disable-cloud-controller: true

but

cloud-controller-manager

still runs with the restart behavior. The events and logs dont seem to indicate much except for this. I cant find much online about it. So was wondering if you might know what the problem is?

Copy code

Warning  Unhealthy  72s (x30 over 6m2s)   kubelet  Startup probe failed: Get "<https://localhost:10258/healthz>": dial tcp [::1]:10258: connect: connection refused

creamy-pencil-82913

08/29/2023, 1:09 AM

What release of rke2 are you running? Must be a little old, as the AWS cloud provider is gone from newer releases of Kubernetes

creamy-pencil-82913

08/29/2023, 1:14 AM

This might be a bit of an odd configuration, since you're trying to use parts of two different cloud controllers - the in-tree AWS cloud provider, and the out-of-tree rke2 cloud provider. Your best bet is probably to not use the in-tree AWS cloud provider, but instead deploy the AWS cloud provider via HelmChart.

creamy-pencil-82913

08/29/2023, 1:15 AM

What sorts of log messages do you see in the cloud controller pod logs?

many-nightfall-61858

08/29/2023, 4:49 PM

I see. Yeah I figured there was something odd in this configuration. We’re trying to balance rapid dev iteration on AWS, but our end goal is to deploy at the edge (no AWS). I wonder if the out of tree aws cloud provider would still conflict with LB creation. Each cloud controller pod shows only these 2 logs

Copy code

I0829 16:39:51.868208       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0829 16:39:51.868715       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...

many-nightfall-61858

08/29/2023, 4:50 PM

Thanks for the tips — I’ll look into if using the out of tree AWS cloud provider solves this

creamy-pencil-82913

08/29/2023, 4:51 PM

check the other pods - that one’s not active, there’s another one holding the lease

many-nightfall-61858

08/29/2023, 4:59 PM

oh you’re right. 2 of them are stuck acquiring, while one of them looks ok. RKE2 version is 1.25.12, but interesting k3s version 1.26.3 is mismatched?

Copy code

I0829 16:48:55.909440       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0829 16:48:55.909885       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...
I0829 16:49:11.826660       1 leaderelection.go:258] successfully acquired lease kube-system/rke2-cloud-controller-manager
I0829 16:49:11.826838       1 event.go:294] "Event occurred" object="kube-system/rke2-cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="<http://coordination.k8s.io/v1|coordination.k8s.io/v1>" type="Normal" reason="LeaderElection" message="ip.compute.internal_3be410ca-ed7d-42cf-a20e-5705884b82d7 became leader"
time="2023-08-29T16:49:11Z" level=info msg="Creating service-controller event broadcaster"
time="2023-08-29T16:49:12Z" level=info msg="Starting /v1, Kind=Node controller"
time="2023-08-29T16:49:12Z" level=info msg="Starting /v1, Kind=Pod controller"
time="2023-08-29T16:49:12Z" level=info msg="Starting apps/v1, Kind=DaemonSet controller"
time="2023-08-29T16:49:12Z" level=info msg="Starting <http://discovery.k8s.io/v1|discovery.k8s.io/v1>, Kind=EndpointSlice controller"
I0829 16:49:12.446502       1 controllermanager.go:311] Started "service"
W0829 16:49:12.446517       1 controllermanager.go:288] "route" is disabled
W0829 16:49:12.446521       1 controllermanager.go:288] "cloud-node" is disabled
W0829 16:49:12.446525       1 controllermanager.go:288] "cloud-node-lifecycle" is disabled
I0829 16:49:12.446622       1 controller.go:227] Starting service controller
I0829 16:49:12.446639       1 shared_informer.go:273] Waiting for caches to sync for service
I0829 16:49:12.547574       1 shared_informer.go:280] Caches are synced for service
I0829 16:49:12.547735       1 event.go:294] "Event occurred" object="default/hello" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0829 16:49:12.551680       1 event.go:294] "Event occurred" object="default/hello" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="AppliedDaemonSet" message="Applied LoadBalancer DaemonSet kube-system/svclb-hello-9329e302"

they’re each in a CrashLoop now since yesterday. once the leader restarts, one of the other replicas acquires the lease. there doesnt seem to be a log indicating what caused the previous crash

creamy-pencil-82913

08/29/2023, 5:08 PM

you might use

kubectl logs --previous

to look at the logs from the crashed pod

creamy-pencil-82913

08/29/2023, 5:09 PM

We update the cloud controller version asynchronously and it doesn’t change much, so its expected that the versions won’t match 1:1

many-nightfall-61858

08/29/2023, 5:14 PM

yeah previous doesnt have additional logs

creamy-pencil-82913

08/29/2023, 5:30 PM

hmm, is it just getting killed due to the health check failing?

creamy-pencil-82913

08/29/2023, 5:30 PM

what was the error you’re getting on the heath check when you look at the pod status?

many-nightfall-61858

08/30/2023, 11:35 PM

was just able to to get back to this now. yes the pods are stuck in not ready

Copy code

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-08-30T23:28:39Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-08-30T23:28:39Z"
    message: 'containers with unready status: [cloud-controller-manager]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-08-30T23:28:39Z"
    message: 'containers with unready status: [cloud-controller-manager]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-08-30T23:28:39Z"
    status: "True"
    type: PodScheduled

many-nightfall-61858

08/30/2023, 11:39 PM

we have 3 control plane nodes, here are the logs of each. (I obfuscated the IPs for privacy reasons)

Copy code

$ kubectl -n kube-system logs cloud-controller-manager-a
I0830 23:33:33.630303       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0830 23:33:33.630806       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...

$ kubectl -n kube-system logs cloud-controller-manager-b
I0830 23:32:41.339963       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0830 23:32:41.340429       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...
I0830 23:32:58.320255       1 leaderelection.go:258] successfully acquired lease kube-system/rke2-cloud-controller-manager
I0830 23:32:58.320324       1 event.go:294] "Event occurred" object="kube-system/rke2-cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="<http://coordination.k8s.io/v1|coordination.k8s.io/v1>" type="Normal" reason="LeaderElection" message="ip-_0328c001-e000-45b2-9200-b49b119cdb71 became leader"
time="2023-08-30T23:32:58Z" level=info msg="Creating service-controller event broadcaster"
time="2023-08-30T23:32:58Z" level=info msg="Starting /v1, Kind=Node controller"
time="2023-08-30T23:32:59Z" level=info msg="Starting /v1, Kind=Pod controller"
time="2023-08-30T23:32:59Z" level=info msg="Starting apps/v1, Kind=DaemonSet controller"
W0830 23:32:59.691042       1 controllermanager.go:288] "cloud-node-lifecycle" is disabled
time="2023-08-30T23:32:59Z" level=info msg="Starting <http://discovery.k8s.io/v1|discovery.k8s.io/v1>, Kind=EndpointSlice controller"
I0830 23:32:59.691346       1 controllermanager.go:311] Started "service"
W0830 23:32:59.691356       1 controllermanager.go:288] "route" is disabled
W0830 23:32:59.691360       1 controllermanager.go:288] "cloud-node" is disabled
I0830 23:32:59.691485       1 controller.go:227] Starting service controller
I0830 23:32:59.691500       1 shared_informer.go:273] Waiting for caches to sync for service
I0830 23:32:59.791739       1 shared_informer.go:280] Caches are synced for service
I0830 23:32:59.791908       1 event.go:294] "Event occurred" object="default/hello" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0830 23:32:59.795705       1 event.go:294] "Event occurred" object="default/hello" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="AppliedDaemonSet" message="Applied LoadBalancer DaemonSet kube-system/svclb-hello-9c549e63"
[anduril@ip-10-32-57-32 ~]$ kubectl -n kube-system logs cloud-controller-manager-3
I0830 23:33:46.427698       1 controllermanager.go:152] Version: v1.26.3-k3s1
I0830 23:33:46.428187       1 leaderelection.go:248] attempting to acquire leader lease kube-system/rke2-cloud-controller-manager...

creamy-pencil-82913

08/31/2023, 4:44 AM

It seems OK, I wonder why the health checks are failing.

many-nightfall-61858

08/31/2023, 5:09 PM

yeah I get the exact same behavior when running out of tree aws-cloud-controller-manager 🙁. worth noting I need to add these values to them helm chart for it to work

Copy code

nodeSelector:
  <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>: "true"
args:
- --configure-cloud-routes=false
- --cloud-provider=aws

many-nightfall-61858

08/31/2023, 5:10 PM

its a bit unfortunate, as we wanted to have EC2 autoscaling alongside serviceLB for LB services. but for that I think we would have to maintain a custom controller manager. will figure out our path forward from here. Thanks for your help Brad! learned a lot from this

25 Views

Open in Slack

Previous Next