https://rancher.com/ logo
Title
s

silly-solstice-24970

11/29/2022, 3:58 PM
Hi all, We have a 12 nodes (bare-metal) cluster using rke2. During the weekend, we lost 1 node and lost the connection with Rancher UI. I was able to find, that from the cluster end, there are 2 hooks into rancher, but I haven’t been able to find hooks on rancher. - Rancher UI runs in a 3 node (VMs) cluster. After discarding everything I could think of, my theory is that from rancher there was only 1 hook/connection point. Is that the case? If so, where can I find it? (what sort of resource is it and in which ns runs). If not, any theory of why 1 node down would disconnect the cluster from Rancher UI? Thanks!
a

agreeable-actor-89488

11/29/2022, 4:19 PM
I cannot answer this question unfortunately since we are only using RKE1 and haven't had an issue with 1 node going down which causes cluster to be disconnected from UI.. only thing that did was make the cluster unhealthy on the Rancher UI.. until we fixed the issue with the one node.. However we are interested in RKE2 as well.. but documentation seems a bit confusing. We have only used RKE1 .. it seems like the install is different for rancher manager using RKE2.. we must install the RKE2 server first then install Rancher manager on top of that based on some documentation we found.. not just create a cluster with RKE2. Is that correct? We can't just use the basic Rancher install to use RKE2?
a

agreeable-oil-87482

11/29/2022, 4:31 PM
How are you exposing the rancher service? I assume ingress, therefore how is your ingress controller service configured? You should be able to access the rancher UI from hitting any of the rancher pods, assuming its exposed correctly
@agreeable-actor-89488 - the install process for rancher is the same on rke1 or rke2. They're just k8s distributions
s

silly-solstice-24970

11/29/2022, 4:46 PM
my apologies, meant to say Rancher UI is 2.6.6 and rke at v1.3.12
@agreeable-oil-87482 that’s my question, cause I know the expose point is an ingress (via nginx-ingress-controller), but I don’t see how or to which node of the cluster is using as proxy
a

agreeable-oil-87482

11/29/2022, 4:50 PM
How are you exposing the ingress controller service? IE nodeport or load balancer etc
s

silly-solstice-24970

11/29/2022, 4:52 PM
we are using metallb, and the ingress-controller service is configured as LoadBalancer
it’s pretty much the standard rke installation
what I don’t know is how Rancher communicates to the cluster
a

agreeable-oil-87482

11/29/2022, 4:53 PM
And your rancher hostname resolves to the VIP of your ingress load balancer service?
s

silly-solstice-24970

11/29/2022, 4:53 PM
yes
a

agreeable-oil-87482

11/29/2022, 4:54 PM
So I would check the metallb logs. If a node went down the VIP should have moved over to another node assuming you're using metallb in layer 2 mode.
s

silly-solstice-24970

11/29/2022, 4:55 PM
rancher has several other (4) clusters, non of them failed
a

agreeable-oil-87482

11/29/2022, 4:55 PM
Downstream clusters communicate with Rancher by establishing a websocket connection to the rancher URL address
s

silly-solstice-24970

11/29/2022, 4:55 PM
so it’s not bidirectional?
a

agreeable-oil-87482

11/29/2022, 4:55 PM
There's an agent pod that manifests in the downstream clusters that facilitates this
s

silly-solstice-24970

11/29/2022, 4:56 PM
Yes, the agent running at the cluster was the one I was able to see
my theory (that now I realize doesn’t apply) was that on rancher’s end there was an agent as well
a

agreeable-oil-87482

11/29/2022, 4:57 PM
It's initiated by the agent pod but once established it's bi directional
s

silly-solstice-24970

11/29/2022, 4:58 PM
the weird part is that I can see there’re 2 agents on the cluster side, but only one failed an communication was lost 😕
a

agreeable-oil-87482

11/29/2022, 4:58 PM
There's both node agents and cluster agents
You should only have one cluster agent pod per cluster
s

silly-solstice-24970

11/29/2022, 4:59 PM
yes, you are correct, fleet-agent there’s only one
we have 2 cluster agents
so, theoretically, if the node that is running fleet-agent goes (out of the blue) down, connection can be lost?
a

agreeable-oil-87482

11/29/2022, 5:04 PM
Ignore the fleet agent, the cluster agent is the focus for your issue
If the cluster agent pod goes down it'll be rescheduled
It may show as disconnected from rancher for a few seconds
But check it's logs for more info
s

silly-solstice-24970

11/29/2022, 5:05 PM
but cluster agents we have 2
└> kg pod -n cattle-system                                                                                                                                                                                                                                                                                               14:02:39
NAME                                    READY   STATUS    RESTARTS         AGE
cattle-cluster-agent-776d9c5484-6z5tk   1/1     Running   153 (4d5h ago)   88d
cattle-cluster-agent-776d9c5484-whqln   1/1     Running   147 (4d4h ago)   88d
a

agreeable-oil-87482

11/29/2022, 5:07 PM
Did someone scale up the deployment or something?
s

silly-solstice-24970

11/29/2022, 5:07 PM
let’m check
set to
replica 2
in the deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: "7"
    <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"<http://management.cattle.io/scale-available|management.cattle.io/scale-available>":"2"},"name":"cattle-cluster-agent","namespace":"cattle-system"},"spec":{"selector":{"matchLabels":{"app":"cattle-cluster-agent"}},"strategy":{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0},"type":"RollingUpdate"},"template":{"metadata":{"labels":{"app":"cattle-cluster-agent"}},"spec":{"affinity":{"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"preference":{"matchExpressions":[{"key":"<http://node-role.kubernetes.io/controlplane|node-role.kubernetes.io/controlplane>","operator":"In","values":["true"]}]},"weight":100},{"preference":{"matchExpressions":[{"key":"<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>","operator":"In","values":["true"]}]},"weight":100},{"preference":{"matchExpressions":[{"key":"<http://node-role.kubernetes.io/master|node-role.kubernetes.io/master>","operator":"In","values":["true"]}]},"weight":100},{"preference":{"matchExpressions":[{"key":"<http://cattle.io/cluster-agent|cattle.io/cluster-agent>","operator":"In","values":["true"]}]},"weight":1}],"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"<http://beta.kubernetes.io/os|beta.kubernetes.io/os>","operator":"NotIn","values":["windows"]}]}]}},"podAntiAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"app","operator":"In","values":["cattle-cluster-agent"]}]},"topologyKey":"<http://kubernetes.io/hostname|kubernetes.io/hostname>"},"weight":100}]}},"containers":[{"env":[{"name":"CATTLE_FEATURES","value":"embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false"},{"name":"CATTLE_IS_RKE","value":"false"},{"name":"CATTLE_SERVER","value":"<https://REDACTED>"},{"name":"CATTLE_CA_CHECKSUM","value":""},{"name":"CATTLE_CLUSTER","value":"true"},{"name":"CATTLE_K8S_MANAGED","value":"true"},{"name":"CATTLE_CLUSTER_REGISTRY","value":""},{"name":"CATTLE_SERVER_VERSION","value":"v2.6.6"},{"name":"CATTLE_INSTALL_UUID","value":"8418bc7f-8261-4caf-bac1-8c5498e6e22a"},{"name":"CATTLE_INGRESS_IP_DOMAIN","value":"<http://sslip.io|sslip.io>"}],"image":"rancher/rancher-agent:v2.6.6","imagePullPolicy":"IfNotPresent","name":"cluster-register","volumeMounts":[{"mountPath":"/cattle-credentials","name":"cattle-credentials","readOnly":true}]}],"serviceAccountName":"cattle","tolerations":[{"effect":"NoSchedule","key":"<http://node-role.kubernetes.io/controlplane|node-role.kubernetes.io/controlplane>","value":"true"},{"effect":"NoSchedule","key":"<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>","operator":"Exists"},{"effect":"NoSchedule","key":"<http://node-role.kubernetes.io/master|node-role.kubernetes.io/master>","operator":"Exists"}],"volumes":[{"name":"cattle-credentials","secret":{"defaultMode":320,"secretName":"cattle-credentials-7781fce"}}]}}}}
    <http://management.cattle.io/scale-available|management.cattle.io/scale-available>: "2"
  creationTimestamp: "2021-12-01T14:18:44Z"
  generation: 8
  name: cattle-cluster-agent
  namespace: cattle-system
  resourceVersion: "221850422"
  uid: 0096bb08-6124-4eea-92f4-ba59964dea7b
spec:
  progressDeadlineSeconds: 600
  replicas: 2
.....
a bit cleanner…
└> kubectl describe deployments.apps -n cattle-system cattle-cluster-agent                                                                                                                                                                                                                                               14:08:11
Name:                   cattle-cluster-agent
Namespace:              cattle-system
CreationTimestamp:      Wed, 01 Dec 2021 11:18:44 -0300
Labels:                 <none>
Annotations:            <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: 7
                        <http://management.cattle.io/scale-available|management.cattle.io/scale-available>: 2
Selector:               app=cattle-cluster-agent
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  0 max unavailable, 1 max surge
Pod Template:
  Labels:           app=cattle-cluster-agent
  Service Account:  cattle
  Containers:
   cluster-register:
    Image:      rancher/rancher-agent:v2.6.6
    Port:       <none>
    Host Port:  <none>
    Environment:
      CATTLE_FEATURES:           embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false
      CATTLE_IS_RKE:             false
      CATTLE_SERVER:             REDACTED
      CATTLE_CA_CHECKSUM:        
      CATTLE_CLUSTER:            true
      CATTLE_K8S_MANAGED:        true
      CATTLE_CLUSTER_REGISTRY:   
      CATTLE_SERVER_VERSION:     v2.6.6
      CATTLE_INSTALL_UUID:       8418bc7f-8261-4caf-bac1-8c5498e6e22a
      CATTLE_INGRESS_IP_DOMAIN:  <http://sslip.io|sslip.io>
    Mounts:
      /cattle-credentials from cattle-credentials (ro)
  Volumes:
   cattle-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  REDACTED
    Optional:    false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   cattle-cluster-agent-776d9c5484 (2/2 replicas created)
Events:          <none>
a

agreeable-oil-87482

11/29/2022, 5:43 PM
Actually, my bad, we do a 2 replica deployment of the cluster agent in recent versions
I'd still check the logs though
s

silly-solstice-24970

11/29/2022, 5:44 PM
but which logs, the M2 NVMe unit died….
I’m trying to figure out which was the point of failure
a

agreeable-oil-87482

11/29/2022, 5:45 PM
And this is still purely with not being able to access the Rancher UI?
Or was there a downstream cluster connection issue too?
s

silly-solstice-24970

11/29/2022, 5:46 PM
yes, on the kube_config file I just manually changed the proxy and it worked
but users are using Rancher UI for accessing the clusters
a

agreeable-oil-87482

11/29/2022, 5:48 PM
A kubeconfig for a downstream cluster?
s

silly-solstice-24970

11/29/2022, 5:50 PM
After rke creates the cluster, it generates a kubeconf file. In that file, you can manually edit the server at ‘server’ (any server from the cluster will be listening at 6443)
a

agreeable-oil-87482

11/29/2022, 5:50 PM
We have a 12 nodes (bare-metal) cluster using rke2. During the weekend, we lost 1 node and lost the connection with Rancher UI. I was able to find, that from the cluster end, there are 2 hooks into rancher, but I haven’t been able to find hooks on rancher. - Rancher UI runs in a 3 node (VMs) cluster.
I don't quite follow this, you have a 12 node bare metal cluster and a 3 node vm cluster. The three node VM cluster runs Rancher and you're saying when one of the bare metal nodes went down you couldn't access Rancher?
s

silly-solstice-24970

11/29/2022, 5:51 PM
no, we couldn’t access the cluster through Rancher
a

agreeable-oil-87482

11/29/2022, 5:51 PM
The failed bare metal node won't have caused you issues with the Rancher management cluster, or the Rancher UI
You couldn't access the cluster, or you couldn't access the Rancher UI?
s

silly-solstice-24970

11/29/2022, 5:52 PM
the UI was completely functional
the UI was completely functional
a

agreeable-oil-87482

11/29/2022, 5:52 PM
During the weekend, we lost 1 node and lost the connection with Rancher UI.
s

silly-solstice-24970

11/29/2022, 5:53 PM
users access through Rancher Web Interface our Kubernetes clusters
a

agreeable-oil-87482

11/29/2022, 5:54 PM
So the users could log into the Rancher UI but they couldn't "explore" their clusters?
IE clicking the
explore
button in the cluster list
s

silly-solstice-24970

11/29/2022, 5:54 PM
ok, I can see why that is confusing. We lost the connection between rancher UI and the cluster
just the one cluster, the one in which one node failed
the error was “”Cluster agent is not connected”"
a

agreeable-oil-87482

11/29/2022, 5:56 PM
Ok that makes sense. That functionality depends on the cluster agent being available in the downstream cluster. When your node failed, and if it was running one of the cluster agent Pods it should be rescheduled like any other K8s pod in that cluster and re-establish connection
s

silly-solstice-24970

11/29/2022, 5:56 PM
that’s the behavior we expect
and now, if we kill the pod, it gets rescheduled
a

agreeable-oil-87482

11/29/2022, 5:57 PM
Is it currently still stating the agent is not connected?
s

silly-solstice-24970

11/29/2022, 5:57 PM
after cleaning the
Terminating
pods, it was rescheduled
a

agreeable-oil-87482

11/29/2022, 5:58 PM
And once rescheduled Rancher reported it as connected?
s

silly-solstice-24970

11/29/2022, 5:58 PM
yeap
a

agreeable-oil-87482

11/29/2022, 5:59 PM
How many control plane and etcd nodes in this cluster?
s

silly-solstice-24970

11/29/2022, 6:00 PM
All of the have the roles
controlplane,etcd,worker
a

agreeable-oil-87482

11/29/2022, 6:00 PM
All 12 nodes?
s

silly-solstice-24970

11/29/2022, 6:00 PM
yes
we have 1 node down and 2 that are scheduled to be added:
kubectl get nodes | grep "controlplane,etcd,worker" | wc -l
9
bottom line, at the time of failure there were 10 servers
a

agreeable-oil-87482

11/29/2022, 6:04 PM
That's a lot of replication between etcd nodes. Sounds like there may have been a lack of agreement on the state of some pods when it happened. I'd dig out the etcd logs for around that time
s

silly-solstice-24970

11/29/2022, 6:06 PM
I agree (not my call), so you say that failure to sync etcd between the nodes might have cause the failure to respawn the cluster-agent?
a

agreeable-oil-87482

11/29/2022, 6:08 PM
It almost sounds like there weren't any functioning cluster agent pods when the issue occurred. Can't say for sure, but I'd check that, and scheduler logs too to see
Do you happen to recall if there were any cluster agent pods running when that node went down?
s

silly-solstice-24970

11/29/2022, 6:08 PM
yes, they definitely were
2 actually
a

agreeable-oil-87482

11/29/2022, 6:09 PM
Hmmm
s

silly-solstice-24970

11/29/2022, 6:09 PM
one of them was running in the failed node, and the other in completely funcional one
a

agreeable-oil-87482

11/29/2022, 6:09 PM
Do you have the logs of the functional one?
s

silly-solstice-24970

11/29/2022, 6:10 PM
I should, give me a sec
this logs started like 12 hours earlier:
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:17:24.476289    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:17:24.476320217Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:17:37.476090    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:17:37.476131071Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:17:52.476107    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:17:52.476147078Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:03.477807    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:03.477859351Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:16.488822    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:16.488849233Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:31.475653    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:31.475695139Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:42.476356    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:42.476391026Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:53.477920    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:53.47796685Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:19:06.475478    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:19:06.475507785Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:19:19.476385    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:19:19.476424614Z"}
At the failure time, we got these for the service:
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:49\",\"event\":\"startUpdate\",\"msg\":\"start of service update\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:27:32.633976521Z\"}\n","stream":"stdout","time":"2022-11-03T15:27:32.63399913Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"service.go:33\",\"event\":\"clearAssignment\",\"msg\":\"not a LoadBalancer\",\"reason\":\"notLoadBalancer\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:27:32.633985011Z\"}\n","stream":"stdout","time":"2022-11-03T15:27:32.63401395Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:75\",\"event\":\"noChange\",\"msg\":\"service converged, no change\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:27:32.63400606Z\"}\n","stream":"stdout","time":"2022-11-03T15:27:32.63403026Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:76\",\"event\":\"endUpdate\",\"msg\":\"end of service update\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:27:32.63401418Z\"}\n","stream":"stdout","time":"2022-11-03T15:27:32.63404437Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:49\",\"event\":\"startUpdate\",\"msg\":\"start of service update\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:35:56.639997663Z\"}\n","stream":"stdout","time":"2022-11-03T15:35:56.640017403Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"service.go:33\",\"event\":\"clearAssignment\",\"msg\":\"not a LoadBalancer\",\"reason\":\"notLoadBalancer\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:35:56.640005163Z\"}\n","stream":"stdout","time":"2022-11-03T15:35:56.640028863Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:75\",\"event\":\"noChange\",\"msg\":\"service converged, no change\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:35:56.640027133Z\"}\n","stream":"stdout","time":"2022-11-03T15:35:56.640052362Z"}
c

creamy-pencil-82913

11/29/2022, 6:27 PM
12 etcd nodes is way, WAY too many
You should not have all the roles on all 12 servers, that is a recipe for disaster
s

silly-solstice-24970

11/29/2022, 6:28 PM
no argues there….. I’ll forward the suggestion
c

creamy-pencil-82913

11/29/2022, 6:29 PM
Also, you should always have an odd number of etcd servers
note that the etcd docs don’t even cover more than 9 members. Older versions of the docs used to explicitly say that
The recommended etcd cluster size is 3, 5 or 7
; I personally don’t see the case for having more than 3 servers; the rest should all be agents.
s

silly-solstice-24970

11/29/2022, 6:30 PM
it was configured that way
I believe it can be remediated through the rke file though, right?
but, besides the recommendation, do you believe it might be related? a corruption on etcd can prevent the cluster agent to respawn?
c

creamy-pencil-82913

11/29/2022, 6:35 PM
I thought you were using RKE2, not RKE?
s

silly-solstice-24970

11/29/2022, 6:37 PM
I couldn’t edit the message…. we are using
rke version v1.3.12
c

creamy-pencil-82913

11/29/2022, 6:37 PM
The agent is a pod like any other, it should eventually be rescheduled onto another note if the node it’s running on goes down and becomes Not Ready.
s

silly-solstice-24970

11/29/2022, 6:37 PM
and Rancher UI at 2.6.6
c

creamy-pencil-82913

11/29/2022, 6:37 PM
Ah ok yeah then.
But if you have too many etcd members and it’s affecting the performance of the cluster, that could make the pod take longer to be rescheduled to another node.
s

silly-solstice-24970

11/29/2022, 6:38 PM
but why would the second replica kick in?
close to the event, I found this log:
{"log":"time=\"2022-11-25T12:21:18Z\" level=error msg=\"error syncing 'rancher-partner-charts': handler helm-clusterrepo-ensure: git -C /var/lib/rancher-data/local-catalogs/v2/ranc
her-partner-charts/8f17acdce9bffd6e05a58a3798840e408c4ea71783381ecd2e9af30baad65974 fetch origin 40d20d4f3eaafabad953bd8d150ebcf7c1ecc3cb error: exit status 128, detail: fatal: una
ble to access '<https://git.rancher.io/partner-charts/>': Could not resolve host: <http://git.rancher.io|git.rancher.io>\\n, requeuing\"\n","stream":"stdout","time":"2022-11-25T12:21:18.469552085Z"}
{"log":"time=\"2022-11-25T12:21:28Z\" level=error msg=\"error syncing 'rancher-partner-charts': handler helm-clusterrepo-ensure: git -C /var/lib/rancher-data/local-catalogs/v2/ranc
her-partner-charts/8f17acdce9bffd6e05a58a3798840e408c4ea71783381ecd2e9af30baad65974 fetch origin 40d20d4f3eaafabad953bd8d150ebcf7c1ecc3cb error: exit status 128, detail: fatal: una
ble to access '<https://git.rancher.io/partner-charts/>': Could not resolve host: <http://git.rancher.io|git.rancher.io>\\n, requeuing\"\n","stream":"stdout","time":"2022-11-25T12:21:28.499309911Z"}