This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

11/29/2022, 3:58 PM

This message was deleted.

agreeable-actor-89488

11/29/2022, 4:19 PM

I cannot answer this question unfortunately since we are only using RKE1 and haven't had an issue with 1 node going down which causes cluster to be disconnected from UI.. only thing that did was make the cluster unhealthy on the Rancher UI.. until we fixed the issue with the one node.. However we are interested in RKE2 as well.. but documentation seems a bit confusing. We have only used RKE1 .. it seems like the install is different for rancher manager using RKE2.. we must install the RKE2 server first then install Rancher manager on top of that based on some documentation we found.. not just create a cluster with RKE2. Is that correct? We can't just use the basic Rancher install to use RKE2?

agreeable-oil-87482

11/29/2022, 4:31 PM

How are you exposing the rancher service? I assume ingress, therefore how is your ingress controller service configured? You should be able to access the rancher UI from hitting any of the rancher pods, assuming its exposed correctly

agreeable-oil-87482

11/29/2022, 4:32 PM

@agreeable-actor-89488 - the install process for rancher is the same on rke1 or rke2. They're just k8s distributions

silly-solstice-24970

11/29/2022, 4:46 PM

my apologies, meant to say Rancher UI is 2.6.6 and rke at v1.3.12

silly-solstice-24970

11/29/2022, 4:49 PM

@agreeable-oil-87482 that’s my question, cause I know the expose point is an ingress (via nginx-ingress-controller), but I don’t see how or to which node of the cluster is using as proxy

agreeable-oil-87482

11/29/2022, 4:50 PM

How are you exposing the ingress controller service? IE nodeport or load balancer etc

silly-solstice-24970

11/29/2022, 4:52 PM

we are using metallb, and the ingress-controller service is configured as LoadBalancer

silly-solstice-24970

11/29/2022, 4:52 PM

it’s pretty much the standard rke installation

silly-solstice-24970

11/29/2022, 4:52 PM

what I don’t know is how Rancher communicates to the cluster

agreeable-oil-87482

11/29/2022, 4:53 PM

And your rancher hostname resolves to the VIP of your ingress load balancer service?

silly-solstice-24970

11/29/2022, 4:53 PM

yes

agreeable-oil-87482

11/29/2022, 4:54 PM

So I would check the metallb logs. If a node went down the VIP should have moved over to another node assuming you're using metallb in layer 2 mode.

silly-solstice-24970

11/29/2022, 4:55 PM

rancher has several other (4) clusters, non of them failed

agreeable-oil-87482

11/29/2022, 4:55 PM

Downstream clusters communicate with Rancher by establishing a websocket connection to the rancher URL address

silly-solstice-24970

11/29/2022, 4:55 PM

so it’s not bidirectional?

agreeable-oil-87482

11/29/2022, 4:55 PM

There's an agent pod that manifests in the downstream clusters that facilitates this

silly-solstice-24970

11/29/2022, 4:56 PM

Yes, the agent running at the cluster was the one I was able to see

silly-solstice-24970

11/29/2022, 4:57 PM

my theory (that now I realize doesn’t apply) was that on rancher’s end there was an agent as well

agreeable-oil-87482

11/29/2022, 4:57 PM

It's initiated by the agent pod but once established it's bi directional

silly-solstice-24970

11/29/2022, 4:58 PM

the weird part is that I can see there’re 2 agents on the cluster side, but only one failed an communication was lost 😕

agreeable-oil-87482

11/29/2022, 4:58 PM

There's both node agents and cluster agents

agreeable-oil-87482

11/29/2022, 4:59 PM

You should only have one cluster agent pod per cluster

silly-solstice-24970

11/29/2022, 4:59 PM

yes, you are correct, fleet-agent there’s only one

silly-solstice-24970

11/29/2022, 5:00 PM

we have 2 cluster agents

silly-solstice-24970

11/29/2022, 5:03 PM

so, theoretically, if the node that is running fleet-agent goes (out of the blue) down, connection can be lost?

agreeable-oil-87482

11/29/2022, 5:04 PM

Ignore the fleet agent, the cluster agent is the focus for your issue

agreeable-oil-87482

11/29/2022, 5:04 PM

If the cluster agent pod goes down it'll be rescheduled

agreeable-oil-87482

11/29/2022, 5:04 PM

It may show as disconnected from rancher for a few seconds

agreeable-oil-87482

11/29/2022, 5:05 PM

But check it's logs for more info

silly-solstice-24970

11/29/2022, 5:05 PM

but cluster agents we have 2

silly-solstice-24970

11/29/2022, 5:05 PM

Copy code

└> kg pod -n cattle-system                                                                                                                                                                                                                                                                                               14:02:39
NAME                                    READY   STATUS    RESTARTS         AGE
cattle-cluster-agent-776d9c5484-6z5tk   1/1     Running   153 (4d5h ago)   88d
cattle-cluster-agent-776d9c5484-whqln   1/1     Running   147 (4d4h ago)   88d

agreeable-oil-87482

11/29/2022, 5:07 PM

Did someone scale up the deployment or something?

silly-solstice-24970

11/29/2022, 5:07 PM

let’m check

silly-solstice-24970

11/29/2022, 5:08 PM

set to

replica 2

in the deployment

silly-solstice-24970

11/29/2022, 5:09 PM

Copy code

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: "7"
    <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"<http://management.cattle.io/scale-available|management.cattle.io/scale-available>":"2"},"name":"cattle-cluster-agent","namespace":"cattle-system"},"spec":{"selector":{"matchLabels":{"app":"cattle-cluster-agent"}},"strategy":{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0},"type":"RollingUpdate"},"template":{"metadata":{"labels":{"app":"cattle-cluster-agent"}},"spec":{"affinity":{"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"preference":{"matchExpressions":[{"key":"<http://node-role.kubernetes.io/controlplane|node-role.kubernetes.io/controlplane>","operator":"In","values":["true"]}]},"weight":100},{"preference":{"matchExpressions":[{"key":"<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>","operator":"In","values":["true"]}]},"weight":100},{"preference":{"matchExpressions":[{"key":"<http://node-role.kubernetes.io/master|node-role.kubernetes.io/master>","operator":"In","values":["true"]}]},"weight":100},{"preference":{"matchExpressions":[{"key":"<http://cattle.io/cluster-agent|cattle.io/cluster-agent>","operator":"In","values":["true"]}]},"weight":1}],"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"<http://beta.kubernetes.io/os|beta.kubernetes.io/os>","operator":"NotIn","values":["windows"]}]}]}},"podAntiAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"app","operator":"In","values":["cattle-cluster-agent"]}]},"topologyKey":"<http://kubernetes.io/hostname|kubernetes.io/hostname>"},"weight":100}]}},"containers":[{"env":[{"name":"CATTLE_FEATURES","value":"embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false"},{"name":"CATTLE_IS_RKE","value":"false"},{"name":"CATTLE_SERVER","value":"<https://REDACTED>"},{"name":"CATTLE_CA_CHECKSUM","value":""},{"name":"CATTLE_CLUSTER","value":"true"},{"name":"CATTLE_K8S_MANAGED","value":"true"},{"name":"CATTLE_CLUSTER_REGISTRY","value":""},{"name":"CATTLE_SERVER_VERSION","value":"v2.6.6"},{"name":"CATTLE_INSTALL_UUID","value":"8418bc7f-8261-4caf-bac1-8c5498e6e22a"},{"name":"CATTLE_INGRESS_IP_DOMAIN","value":"<http://sslip.io|sslip.io>"}],"image":"rancher/rancher-agent:v2.6.6","imagePullPolicy":"IfNotPresent","name":"cluster-register","volumeMounts":[{"mountPath":"/cattle-credentials","name":"cattle-credentials","readOnly":true}]}],"serviceAccountName":"cattle","tolerations":[{"effect":"NoSchedule","key":"<http://node-role.kubernetes.io/controlplane|node-role.kubernetes.io/controlplane>","value":"true"},{"effect":"NoSchedule","key":"<http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>","operator":"Exists"},{"effect":"NoSchedule","key":"<http://node-role.kubernetes.io/master|node-role.kubernetes.io/master>","operator":"Exists"}],"volumes":[{"name":"cattle-credentials","secret":{"defaultMode":320,"secretName":"cattle-credentials-7781fce"}}]}}}}
    <http://management.cattle.io/scale-available|management.cattle.io/scale-available>: "2"
  creationTimestamp: "2021-12-01T14:18:44Z"
  generation: 8
  name: cattle-cluster-agent
  namespace: cattle-system
  resourceVersion: "221850422"
  uid: 0096bb08-6124-4eea-92f4-ba59964dea7b
spec:
  progressDeadlineSeconds: 600
  replicas: 2
.....

silly-solstice-24970

11/29/2022, 5:11 PM

a bit cleanner…

Copy code

└> kubectl describe deployments.apps -n cattle-system cattle-cluster-agent                                                                                                                                                                                                                                               14:08:11
Name:                   cattle-cluster-agent
Namespace:              cattle-system
CreationTimestamp:      Wed, 01 Dec 2021 11:18:44 -0300
Labels:                 <none>
Annotations:            <http://deployment.kubernetes.io/revision|deployment.kubernetes.io/revision>: 7
                        <http://management.cattle.io/scale-available|management.cattle.io/scale-available>: 2
Selector:               app=cattle-cluster-agent
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  0 max unavailable, 1 max surge
Pod Template:
  Labels:           app=cattle-cluster-agent
  Service Account:  cattle
  Containers:
   cluster-register:
    Image:      rancher/rancher-agent:v2.6.6
    Port:       <none>
    Host Port:  <none>
    Environment:
      CATTLE_FEATURES:           embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false
      CATTLE_IS_RKE:             false
      CATTLE_SERVER:             REDACTED
      CATTLE_CA_CHECKSUM:        
      CATTLE_CLUSTER:            true
      CATTLE_K8S_MANAGED:        true
      CATTLE_CLUSTER_REGISTRY:   
      CATTLE_SERVER_VERSION:     v2.6.6
      CATTLE_INSTALL_UUID:       8418bc7f-8261-4caf-bac1-8c5498e6e22a
      CATTLE_INGRESS_IP_DOMAIN:  <http://sslip.io|sslip.io>
    Mounts:
      /cattle-credentials from cattle-credentials (ro)
  Volumes:
   cattle-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  REDACTED
    Optional:    false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   cattle-cluster-agent-776d9c5484 (2/2 replicas created)
Events:          <none>

agreeable-oil-87482

11/29/2022, 5:43 PM

Actually, my bad, we do a 2 replica deployment of the cluster agent in recent versions

agreeable-oil-87482

11/29/2022, 5:43 PM

I'd still check the logs though

silly-solstice-24970

11/29/2022, 5:44 PM

but which logs, the M2 NVMe unit died….

silly-solstice-24970

11/29/2022, 5:44 PM

I’m trying to figure out which was the point of failure

agreeable-oil-87482

11/29/2022, 5:45 PM

And this is still purely with not being able to access the Rancher UI?

agreeable-oil-87482

11/29/2022, 5:46 PM

Or was there a downstream cluster connection issue too?

silly-solstice-24970

11/29/2022, 5:46 PM

yes, on the kube_config file I just manually changed the proxy and it worked

silly-solstice-24970

11/29/2022, 5:47 PM

but users are using Rancher UI for accessing the clusters

agreeable-oil-87482

11/29/2022, 5:48 PM

A kubeconfig for a downstream cluster?

silly-solstice-24970

11/29/2022, 5:50 PM

After rke creates the cluster, it generates a kubeconf file. In that file, you can manually edit the server at ‘server’ (any server from the cluster will be listening at 6443)

agreeable-oil-87482

11/29/2022, 5:50 PM

We have a 12 nodes (bare-metal) cluster using rke2. During the weekend, we lost 1 node and lost the connection with Rancher UI. I was able to find, that from the cluster end, there are 2 hooks into rancher, but I haven’t been able to find hooks on rancher. - Rancher UI runs in a 3 node (VMs) cluster.

I don't quite follow this, you have a 12 node bare metal cluster and a 3 node vm cluster. The three node VM cluster runs Rancher and you're saying when one of the bare metal nodes went down you couldn't access Rancher?

silly-solstice-24970

11/29/2022, 5:51 PM

no, we couldn’t access the cluster through Rancher

agreeable-oil-87482

11/29/2022, 5:51 PM

The failed bare metal node won't have caused you issues with the Rancher management cluster, or the Rancher UI

agreeable-oil-87482

11/29/2022, 5:51 PM

You couldn't access the cluster, or you couldn't access the Rancher UI?

silly-solstice-24970

11/29/2022, 5:52 PM

~~the UI was completely functional~~

silly-solstice-24970

11/29/2022, 5:52 PM

the UI was completely functional

agreeable-oil-87482

11/29/2022, 5:52 PM

During the weekend, we lost 1 node and lost the connection with Rancher UI.

silly-solstice-24970

11/29/2022, 5:53 PM

users access through Rancher Web Interface our Kubernetes clusters

agreeable-oil-87482

11/29/2022, 5:54 PM

So the users could log into the Rancher UI but they couldn't "explore" their clusters?

agreeable-oil-87482

11/29/2022, 5:54 PM

IE clicking the

explore

button in the cluster list

silly-solstice-24970

11/29/2022, 5:54 PM

ok, I can see why that is confusing. We lost the connection between rancher UI and the cluster

silly-solstice-24970

11/29/2022, 5:55 PM

just the one cluster, the one in which one node failed

silly-solstice-24970

11/29/2022, 5:55 PM

the error was “”Cluster agent is not connected”"

agreeable-oil-87482

11/29/2022, 5:56 PM

Ok that makes sense. That functionality depends on the cluster agent being available in the downstream cluster. When your node failed, and if it was running one of the cluster agent Pods it should be rescheduled like any other K8s pod in that cluster and re-establish connection

silly-solstice-24970

11/29/2022, 5:56 PM

that’s the behavior we expect

silly-solstice-24970

11/29/2022, 5:56 PM

and now, if we kill the pod, it gets rescheduled

agreeable-oil-87482

11/29/2022, 5:57 PM

Is it currently still stating the agent is not connected?

silly-solstice-24970

11/29/2022, 5:57 PM

after cleaning the

Terminating

pods, it was rescheduled

agreeable-oil-87482

11/29/2022, 5:58 PM

And once rescheduled Rancher reported it as connected?

silly-solstice-24970

11/29/2022, 5:58 PM

yeap

agreeable-oil-87482

11/29/2022, 5:59 PM

How many control plane and etcd nodes in this cluster?

silly-solstice-24970

11/29/2022, 6:00 PM

All of the have the roles

controlplane,etcd,worker

agreeable-oil-87482

11/29/2022, 6:00 PM

All 12 nodes?

silly-solstice-24970

11/29/2022, 6:00 PM

yes

silly-solstice-24970

11/29/2022, 6:02 PM

we have 1 node down and 2 that are scheduled to be added:

Copy code

kubectl get nodes | grep "controlplane,etcd,worker" | wc -l
9

silly-solstice-24970

11/29/2022, 6:03 PM

bottom line, at the time of failure there were 10 servers

agreeable-oil-87482

11/29/2022, 6:04 PM

That's a lot of replication between etcd nodes. Sounds like there may have been a lack of agreement on the state of some pods when it happened. I'd dig out the etcd logs for around that time

silly-solstice-24970

11/29/2022, 6:06 PM

I agree (not my call), so you say that failure to sync etcd between the nodes might have cause the failure to respawn the cluster-agent?

agreeable-oil-87482

11/29/2022, 6:08 PM

It almost sounds like there weren't any functioning cluster agent pods when the issue occurred. Can't say for sure, but I'd check that, and scheduler logs too to see

agreeable-oil-87482

11/29/2022, 6:08 PM

Do you happen to recall if there were any cluster agent pods running when that node went down?

silly-solstice-24970

11/29/2022, 6:08 PM

yes, they definitely were

silly-solstice-24970

11/29/2022, 6:08 PM

2 actually

agreeable-oil-87482

11/29/2022, 6:09 PM

Hmmm

silly-solstice-24970

11/29/2022, 6:09 PM

one of them was running in the failed node, and the other in completely funcional one

agreeable-oil-87482

11/29/2022, 6:09 PM

Do you have the logs of the functional one?

silly-solstice-24970

11/29/2022, 6:10 PM

I should, give me a sec

silly-solstice-24970

11/29/2022, 6:26 PM

this logs started like 12 hours earlier:

Copy code

5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:17:24.476289    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:17:24.476320217Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:17:37.476090    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:17:37.476131071Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:17:52.476107    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:17:52.476147078Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:03.477807    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:03.477859351Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:16.488822    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:16.488849233Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:31.475653    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:31.475695139Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:42.476356    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:42.476391026Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:18:53.477920    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:18:53.47796685Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:19:06.475478    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:19:06.475507785Z"}
5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0/5b9405eb75a062d8ab82b9ceb5c0cadc13904ddd9672224bb1e143feae867bc0-json.log:{"log":"E1125 12:19:19.476385    7819 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"cluster-register\\\" with CrashLoopBackOff: \\\"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-776d9c5484-whqln_cattle-system(f08d26d4-d212-4200-a200-7de428e83db7)\\\"\" pod=\"cattle-system/cattle-cluster-agent-776d9c5484-whqln\" podUID=f08d26d4-d212-4200-a200-7de428e83db7\n","stream":"stderr","time":"2022-11-25T12:19:19.476424614Z"}

silly-solstice-24970

11/29/2022, 6:27 PM

At the failure time, we got these for the service:

Copy code

8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:49\",\"event\":\"startUpdate\",\"msg\":\"start of service update\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:27:32.633976521Z\"}\n","stream":"stdout","time":"2022-11-03T15:27:32.63399913Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"service.go:33\",\"event\":\"clearAssignment\",\"msg\":\"not a LoadBalancer\",\"reason\":\"notLoadBalancer\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:27:32.633985011Z\"}\n","stream":"stdout","time":"2022-11-03T15:27:32.63401395Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:75\",\"event\":\"noChange\",\"msg\":\"service converged, no change\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:27:32.63400606Z\"}\n","stream":"stdout","time":"2022-11-03T15:27:32.63403026Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:76\",\"event\":\"endUpdate\",\"msg\":\"end of service update\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:27:32.63401418Z\"}\n","stream":"stdout","time":"2022-11-03T15:27:32.63404437Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:49\",\"event\":\"startUpdate\",\"msg\":\"start of service update\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:35:56.639997663Z\"}\n","stream":"stdout","time":"2022-11-03T15:35:56.640017403Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"service.go:33\",\"event\":\"clearAssignment\",\"msg\":\"not a LoadBalancer\",\"reason\":\"notLoadBalancer\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:35:56.640005163Z\"}\n","stream":"stdout","time":"2022-11-03T15:35:56.640028863Z"}
8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3/8531818b4ba4c9f483e32a16ed38d6c6cfcec194fd0fcc6870f086323b7f47b3-json.log:{"log":"{\"caller\":\"main.go:75\",\"event\":\"noChange\",\"msg\":\"service converged, no change\",\"service\":\"cattle-system/cattle-cluster-agent\",\"ts\":\"2022-11-03T15:35:56.640027133Z\"}\n","stream":"stdout","time":"2022-11-03T15:35:56.640052362Z"}

creamy-pencil-82913

11/29/2022, 6:27 PM

12 etcd nodes is way, WAY too many

creamy-pencil-82913

11/29/2022, 6:28 PM

You should not have all the roles on all 12 servers, that is a recipe for disaster

silly-solstice-24970

11/29/2022, 6:28 PM

no argues there….. I’ll forward the suggestion

creamy-pencil-82913

11/29/2022, 6:29 PM

Also, you should always have an odd number of etcd servers

creamy-pencil-82913

11/29/2022, 6:29 PM

See https://etcd.io/docs/v3.5/faq/#what-is-maximum-cluster-size

💯 1

creamy-pencil-82913

11/29/2022, 6:30 PM

note that the etcd docs don’t even cover more than 9 members. Older versions of the docs used to explicitly say that

The recommended etcd cluster size is 3, 5 or 7

; I personally don’t see the case for having more than 3 servers; the rest should all be agents.

silly-solstice-24970

11/29/2022, 6:30 PM

it was configured that way

silly-solstice-24970

11/29/2022, 6:31 PM

I believe it can be remediated through the rke file though, right?

silly-solstice-24970

11/29/2022, 6:32 PM

but, besides the recommendation, do you believe it might be related? a corruption on etcd can prevent the cluster agent to respawn?

creamy-pencil-82913

11/29/2022, 6:35 PM

I thought you were using RKE2, not RKE?

silly-solstice-24970

11/29/2022, 6:37 PM

I couldn’t edit the message…. we are using

rke version v1.3.12

creamy-pencil-82913

11/29/2022, 6:37 PM

The agent is a pod like any other, it should eventually be rescheduled onto another note if the node it’s running on goes down and becomes Not Ready.

silly-solstice-24970

11/29/2022, 6:37 PM

and Rancher UI at 2.6.6

creamy-pencil-82913

11/29/2022, 6:37 PM

Ah ok yeah then.

creamy-pencil-82913

11/29/2022, 6:37 PM

But if you have too many etcd members and it’s affecting the performance of the cluster, that could make the pod take longer to be rescheduled to another node.

silly-solstice-24970

11/29/2022, 6:38 PM

but why would the second replica kick in?

silly-solstice-24970

11/29/2022, 6:40 PM

close to the event, I found this log:

Copy code

{"log":"time=\"2022-11-25T12:21:18Z\" level=error msg=\"error syncing 'rancher-partner-charts': handler helm-clusterrepo-ensure: git -C /var/lib/rancher-data/local-catalogs/v2/ranc
her-partner-charts/8f17acdce9bffd6e05a58a3798840e408c4ea71783381ecd2e9af30baad65974 fetch origin 40d20d4f3eaafabad953bd8d150ebcf7c1ecc3cb error: exit status 128, detail: fatal: una
ble to access '<https://git.rancher.io/partner-charts/>': Could not resolve host: <http://git.rancher.io|git.rancher.io>\\n, requeuing\"\n","stream":"stdout","time":"2022-11-25T12:21:18.469552085Z"}
{"log":"time=\"2022-11-25T12:21:28Z\" level=error msg=\"error syncing 'rancher-partner-charts': handler helm-clusterrepo-ensure: git -C /var/lib/rancher-data/local-catalogs/v2/ranc
her-partner-charts/8f17acdce9bffd6e05a58a3798840e408c4ea71783381ecd2e9af30baad65974 fetch origin 40d20d4f3eaafabad953bd8d150ebcf7c1ecc3cb error: exit status 128, detail: fatal: una
ble to access '<https://git.rancher.io/partner-charts/>': Could not resolve host: <http://git.rancher.io|git.rancher.io>\\n, requeuing\"\n","stream":"stdout","time":"2022-11-25T12:21:28.499309911Z"}

22 Views

Open in Slack

Previous Next