This message was deleted Rancher Users #rke

Join Slack

This message was deleted.

# rke

adamant-kite-43734

10/03/2023, 8:13 PM

This message was deleted.

few-memory-46527

10/03/2023, 8:23 PM

I have also used the below command helm upgrade rancher rancher-latest/rancher --namespace cattle-system --set ingress.enabled=true --version 2.6.3 --set hostname=omkar.com

fast-piano-59234

10/05/2023, 10:29 AM

You are disabling ingress, so no ingress resource will be created. How are you configuring access to Rancher?

few-memory-46527

10/05/2023, 11:10 AM

Hi I am using this command this is enabling ingress right?

fast-piano-59234

10/05/2023, 11:46 AM

Sorry I missed the upgrade part, you can start by following https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/rancher-ha#check-ingress and see what it reports

few-memory-46527

10/05/2023, 11:52 AM

I will check this and let you know the result.

few-memory-46527

10/06/2023, 12:16 PM

I am getting the below logs while acessing the rancher UI.

Copy code

[rke-admin@poclphusamaster ~]$ kubectl -n ingress-nginx logs -l app=ingress-nginx
I1006 11:38:29.158073       7 main.go:101] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I1006 11:38:29.209486       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"2b8db858-dfe2-4f42-acac-4576ff93a400", APIVersion:"<http://networking.k8s.io/v1|networking.k8s.io/v1>", ResourceVersion:"388086", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
W1006 11:38:31.447094       7 controller.go:1076] Service "cattle-system/rancher" does not have any active Endpoint.
I1006 11:38:50.127347       7 leaderelection.go:258] successfully acquired lease ingress-nginx/ingress-controller-leader-nginx
I1006 11:38:50.127471       7 status.go:84] "New leader elected" identity="nginx-ingress-controller-2hj82"
I1006 11:38:50.143367       7 status.go:300] "updating Ingress status" namespace="cattle-system" ingress="rancher" currentValue=[] newValue=[{IP:172.27.16.66 Hostname: Ports:[]}]
I1006 11:38:50.148472       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"2b8db858-dfe2-4f42-acac-4576ff93a400", APIVersion:"<http://networking.k8s.io/v1|networking.k8s.io/v1>", ResourceVersion:"388260", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I1006 11:38:50.199425       7 admission.go:149] processed ingress via admission controller {testedIngressLength:1 testedIngressTime:0.047s renderingIngressLength:1 renderingIngressTime:0s admissionTime:17.9kBs testedConfigurationSize:0.047}
I1006 11:38:50.199471       7 main.go:101] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I1006 11:38:50.203551       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"2b8db858-dfe2-4f42-acac-4576ff93a400", APIVersion:"<http://networking.k8s.io/v1|networking.k8s.io/v1>", ResourceVersion:"388267", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync

fast-piano-59234

10/08/2023, 9:17 AM

you can use the other steps to get to the root cause,

Service "cattle-system/rancher" does not have any active Endpoint.

is probably a good lead. What EC2 instance type(s) or you using?

few-memory-46527

10/08/2023, 10:57 AM

I am not using a EC2 machine. These are the servers provided by the office. We are using RHEL8.8. you can use the other steps to get to the root cause What are the other steps?

few-memory-46527

10/09/2023, 5:07 AM

I have followed the below documentation to debug this issue but nothing is clear yet. https://ranchermanager.docs.rancher.com/v2.5/getting-started/installation-and-upgrade/install-upgrade-on-a-kubernetes-cluster/troubleshooting

fast-piano-59234

10/09/2023, 8:42 AM

where <http://omkar.com|omkar.com> is pointing to one of the public Ip of a EC2 instance.

, are you using it as a load balancer or something? You need to be more specific about your setup so we can match your intentions with your setup and figure out where this is causing issues.

fast-piano-59234

10/09/2023, 8:45 AM

There are a lot of steps on https://ranchermanager.docs.rancher.com/troubleshooting/other-troubleshooting-tips/rancher-ha and the page you linked, that info is valuable to retrieve why you can't access the UI. It starts with the HTTP error you get when you try to access Rancher, that error will already reveal a lot what is going on. Then you traverse down the path that it takes to get to Rancher, why do you get that error? Are you hitting the ingress controller, what logs are there when it is hit? Why does it return the error, how does the Rancher ingress look like, how is the service looking, and the pods. All the way down til you hit something that is causing the issue. If you don't know what is causing, please supply the commands used and the outputs so others can take a look.

few-memory-46527

10/09/2023, 8:53 AM

Sure, I will share the logs of the output

Copy code

[rke-admin@poclphusamaster ~]$ kubectl -n ingress-nginx logs -l app=ingress-nginx
W1009 06:12:29.622770       7 controller.go:1076] Service "cattle-system/rancher" does not have any active Endpoint.
I1009 06:12:29.764479       7 admission.go:149] processed ingress via admission controller {testedIngressLength:1 testedIngressTime:0.142s renderingIngressLength:1 renderingIngressTime:0s admissionTime:17.9kBs testedConfigurationSize:0.142}
I1009 06:12:29.764581       7 main.go:101] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I1009 06:12:29.787702       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"2b8db858-dfe2-4f42-acac-4576ff93a400", APIVersion:"networking.k8s.io/v1", ResourceVersion:"1722126", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
W1009 06:12:29.790538       7 controller.go:1076] Service "cattle-system/rancher" does not have any active Endpoint.
I1009 06:12:50.021381       7 status.go:300] "updating Ingress status" namespace="cattle-system" ingress="rancher" currentValue=[] newValue=[{IP:172.27.16.66 Hostname: Ports:[]}]
I1009 06:12:50.035999       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"2b8db858-dfe2-4f42-acac-4576ff93a400", APIVersion:"networking.k8s.io/v1", ResourceVersion:"1722296", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I1009 06:12:50.168049       7 admission.go:149] processed ingress via admission controller {testedIngressLength:1 testedIngressTime:0.108s renderingIngressLength:1 renderingIngressTime:0.016s admissionTime:17.9kBs testedConfigurationSize:0.124}
I1009 06:12:50.168149       7 main.go:101] "successfully validated configuration, accepting" ingress="rancher/cattle-system"
I1009 06:12:50.178851       7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"2b8db858-dfe2-4f42-acac-4576ff93a400", APIVersion:"networking.k8s.io/v1", ResourceVersion:"1722303", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync

To get the svc details

Copy code

[rke-admin@poclphusamaster ~]$ kubectl get svc -A
NAMESPACE             NAME                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
cattle-fleet-system   gitjob                               ClusterIP   10.43.141.158   <none>        80/TCP                   4d18h
cattle-system         rancher                              ClusterIP   10.43.53.160    <none>        80/TCP,443/TCP           4d18h
cattle-system         rancher-webhook                      ClusterIP   10.43.167.160   <none>        443/TCP                  4d18h
cattle-system         webhook-service                      ClusterIP   10.43.95.239    <none>        443/TCP                  4d18h
cert-manager          cert-manager                         ClusterIP   10.43.51.208    <none>        9402/TCP                 172m
cert-manager          cert-manager-webhook                 ClusterIP   10.43.88.69     <none>        443/TCP                  172m
default               kubernetes                           ClusterIP   10.43.0.1       <none>        443/TCP                  4d18h
ingress-nginx         ingress-nginx-controller-admission   ClusterIP   10.43.102.5     <none>        443/TCP                  4d18h
kube-system           kube-dns                             ClusterIP   10.43.0.10      <none>        53/UDP,53/TCP,9153/TCP   4d18h
kube-system           metrics-server                       ClusterIP   10.43.18.165    <none>        443/TCP                  4d18h

This are the pods

Copy code

[rke-admin@poclphusamaster ~]$ kubectl get pods -A
NAMESPACE                   NAME                                         READY   STATUS      RESTARTS         AGE
cattle-fleet-local-system   fleet-agent-78f694664b-cwvcr                 1/1     Running     8 (160m ago)     4d18h
cattle-fleet-system         fleet-controller-6666887949-xjsrf            1/1     Running     216 (160m ago)   4d18h
cattle-fleet-system         gitjob-7b97c9c7fd-hsl5z                      1/1     Running     8 (160m ago)     4d18h
cattle-system               rancher-6bcbdd6cb7-77n4w                     1/1     Running     148 (160m ago)   4d18h
cattle-system               rancher-6bcbdd6cb7-jcx5k                     1/1     Running     248 (160m ago)   4d18h
cattle-system               rancher-6bcbdd6cb7-xvvtz                     1/1     Running     149 (160m ago)   4d18h
cattle-system               rancher-webhook-5d4f5b7f6d-thvnf             1/1     Running     6 (160m ago)     4d18h
cert-manager                cert-manager-57d89b9548-m2j4c                1/1     Running     1 (160m ago)     173m
cert-manager                cert-manager-cainjector-5bcf77b697-69nhk     1/1     Running     1 (160m ago)     173m
cert-manager                cert-manager-webhook-9cb88bd6d-hr7w7         1/1     Running     1 (160m ago)     173m
ingress-nginx               nginx-ingress-controller-2hj82               1/1     Running     1 (160m ago)     2d21h
kube-system                 calico-kube-controllers-5685fbd9f7-4xx52     1/1     Running     1 (160m ago)     2d21h
kube-system                 canal-shx8c                                  2/2     Running     2 (160m ago)     2d21h
kube-system                 coredns-8578b6dbdd-f9wch                     1/1     Running     1 (160m ago)     2d21h
kube-system                 coredns-autoscaler-f7b68ccb7-sfm9s           1/1     Running     1 (160m ago)     2d21h
kube-system                 kube-vip-ds-t2ltj                            1/1     Running     11 (160m ago)    4d18h
kube-system                 metrics-server-6bc7854fb5-44m94              1/1     Running     1 (160m ago)     2d21h
kube-system                 rke-coredns-addon-deploy-job--1-zkfq7        0/1     Completed   0                4d18h
kube-system                 rke-ingress-controller-deploy-job--1-bjgvc   0/1     Completed   0                4d18h
kube-system                 rke-metrics-addon-deploy-job--1-6j72t        0/1     Completed   0                4d18h
kube-system                 rke-network-plugin-deploy-job--1-qk4j7       0/1     Completed   0                4d18h

To get the ingress

Copy code

[rke-admin@poclphusamaster ~]$ kubectl -n cattle-system get ingress
NAME      CLASS   HOSTS                   ADDRESS        PORTS     AGE
rancher   nginx   hurancher.omkar.org   172.27.16.66   80, 443   4d18h

few-memory-46527

10/09/2023, 8:56 AM

Here hurancher.omkar.org (previously mistaken with omkar.com) is pointing to the VIP.

few-memory-46527

10/09/2023, 2:25 PM

few-memory-46527

10/09/2023, 2:27 PM

This is the current error I am getting in the app=rancher logs

Copy code

[rke-admin@poclphusamaster ~]$ kubectl -n cattle-system logs -l app=rancher
2023/10/09 14:12:00 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 14:12:41 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 14:13:26 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 14:14:11 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 14:14:55 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 14:15:41 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 14:16:24 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 14:17:06 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 14:17:47 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 14:19:52 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/09 13:59:27 [ERROR] Failed to connect to peer <wss://10.42.0.41/v3/connect> [local ID=10.42.0.39]: dial tcp 10.42.0.41:443: connect: connection refused
2023/10/09 13:59:32 [ERROR] Failed to connect to peer <wss://10.42.0.41/v3/connect> [local ID=10.42.0.39]: dial tcp 10.42.0.41:443: connect: connection refused
2023/10/09 13:59:37 [ERROR] Failed to connect to peer <wss://10.42.0.41/v3/connect> [local ID=10.42.0.39]: dial tcp 10.42.0.41:443: connect: connection refused
2023/10/09 13:59:42 [ERROR] Failed to connect to peer <wss://10.42.0.41/v3/connect> [local ID=10.42.0.39]: dial tcp 10.42.0.41:443: connect: connection refused
2023/10/09 13:59:47 [ERROR] Failed to connect to peer <wss://10.42.0.41/v3/connect> [local ID=10.42.0.39]: dial tcp 10.42.0.41:443: connect: connection refused
2023/10/09 13:59:52 [ERROR] Failed to connect to peer <wss://10.42.0.41/v3/connect> [local ID=10.42.0.39]: dial tcp 10.42.0.41:443: connect: connection refused
2023/10/09 13:59:57 [ERROR] Failed to connect to peer <wss://10.42.0.41/v3/connect> [local ID=10.42.0.39]: dial tcp 10.42.0.41:443: connect: connection refused
2023/10/09 13:59:59 [INFO] Handling backend connection request [10.42.0.41]
W1009 14:05:59.523895      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W1009 14:15:32.527095      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023/10/09 14:17:31 [ERROR] error syncing 'rancher-partner-charts': handler helm-clusterrepo-ensure: git -C /var/lib/rancher-data/local-catalogs/v2/rancher-partner-charts/8f17acdce9bffd6e05a58a3798840e408c4ea71783381ecd2e9af30baad65974 fetch origin 39cf64a120af1737af61b201312012815ce4c252 error: exit status 128, detail: error: Server does not allow request for unadvertised object 39cf64a120af1737af61b201312012815ce4c252
, requeuing
2023/10/09 14:18:05 [INFO] Stopping cluster agent for c-mtn95
2023/10/09 14:18:05 [ERROR] failed to start cluster controllers c-mtn95: context canceled
2023/10/09 14:19:31 [ERROR] error syncing 'rancher-partner-charts': handler helm-clusterrepo-ensure: git -C /var/lib/rancher-data/local-catalogs/v2/rancher-partner-charts/8f17acdce9bffd6e05a58a3798840e408c4ea71783381ecd2e9af30baad65974 fetch origin 39cf64a120af1737af61b201312012815ce4c252 error: exit status 128, detail: error: Server does not allow request for unadvertised object 39cf64a120af1737af61b201312012815ce4c252
, requeuing
2023/10/09 14:20:16 [INFO] Stopping cluster agent for c-mtn95
2023/10/09 14:20:16 [ERROR] failed to start cluster controllers c-mtn95: context canceled
2023/10/09 14:21:32 [ERROR] error syncing 'rancher-partner-charts': handler helm-clusterrepo-ensure: git -C /var/lib/rancher-data/local-catalogs/v2/rancher-partner-charts/8f17acdce9bffd6e05a58a3798840e408c4ea71783381ecd2e9af30baad65974 fetch origin 39cf64a120af1737af61b201312012815ce4c252 error: exit status 128, detail: error: Server does not allow request for unadvertised object 39cf64a120af1737af61b201312012815ce4c252
, requeuing

I am stuck on this for long time now. Requesting to guide me on this.

few-memory-46527

10/10/2023, 7:31 AM

It is possible to get on a call and identify the root cause. Let me know if it is possible. We're willing to compensate you for your time and assistance in resolving this problem.

fast-piano-59234

10/10/2023, 6:06 PM

Given the amount of restarts, it looks like the machines you are using are underpowered. What are the specs of the machines that you are using (CPU/memory/disk type/disk IOPS etc)?

few-memory-46527

10/10/2023, 6:12 PM

The machine is having 8 core, 16 GB RAM, 200 GB disk,

few-memory-46527

10/10/2023, 6:14 PM

Now it is giving the context cancelled error. Is there a way to stop these cluster getting registered?

Copy code

2023/10/10 18:04:59 [ERROR] failed to start cluster controllers c-mtn95: context canceled
W1010 18:05:33.515067      33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023/10/10 18:07:18 [INFO] Stopping cluster agent for c-mtn95
2023/10/10 18:07:18 [ERROR] failed to start cluster controllers c-mtn95: context canceled
2023/10/10 18:09:10 [INFO] Stopping cluster agent for c-mtn95
2023/10/10 18:09:10 [ERROR] failed to start cluster controllers c-mtn95: context canceled
2023/10/10 18:11:14 [INFO] Stopping cluster agent for c-mtn95
2023/10/10 18:11:14 [ERROR] failed to start cluster controllers c-mtn95: context canceled
2023/10/10 18:13:27 [INFO] Stopping cluster agent for c-mtn95
2023/10/10 18:13:27 [ERROR] failed to start cluster controllers c-mtn95: context canceled
2023/10/10 18:07:32 [INFO] Stopping cluster agent for c-mtn95
2023/10/10 18:07:32 [ERROR] failed to start cluster controllers c-mtn95: context canceled
2023/10/10 18:09:33 [INFO] Stopping cluster agent for c-mtn95
2023/10/10 18:09:33 [ERROR] failed to start cluster controllers c-mtn95: context canceled
2023/10/10 18:10:01 [ERROR] error syncing 'c-mtn95': handler cluster-deploy: Get "<https://172.27.16.68:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/10 18:11:38 [INFO] Stopping cluster agent for c-mtn95

few-memory-46527

10/10/2023, 6:48 PM

Hi, Sorry for the confusion. I would like to give to a background on how I ended up on this issue. 1. I had done the setup of the rke rancher. I was able to access via the URL but the setup I did was using a VIP to setup the domain so that is where my confusion was which I was asking at the start of this same thread. 2. I was trying a scenario where I could backup the cluster from the snapshot. For that I tried to delete the etcd from /var/lib/etcd and stopped the docker and I kept the cluster in same state for around 1 day. 3. Now I was not able to access the UI so I cleared the cluster using rke remove command and used the rke etcd restore-snapshot command with the snapshot I had created. I was able to access run all the kubectl command which were not running after I had deleted /var/lib/etcd but now the UI is not accessible.

fast-piano-59234

10/11/2023, 8:38 AM

Please don't post every message to channel, it kinda defeats the purpose of using threads. You are adding quite a lot of context each time which makes the situation quite different to diagnose. You won't be able to reach the UI if the rancher pods keep restarting, if you want to stabilize the environment, you should look at why that happens first (it might just be the continuous requeueing but it should show in the CPU/memory usage on the nodes). There are a few things that come up from the details you provided: • Did you restore the snapshot to a new cluster without rancher installed first? Was it the same k8s version as you created the snapshot on? Are you installing the same Rancher version? Did you configure the same Rancher URL or a different one? • If you go to the created cluster(s), what is the

cattle-cluster-agent

pod container logging? What is it pointing to (

CATTLE_SERVER

environment variable)

few-memory-46527

10/11/2023, 5:55 PM

Hi, I have restored in the same VM that I had already setup the rancher before but before restoring I removed if anything using rke remove command. Yes the K8s version used was the same. Yes I am installing same rancher version. I have configured the same rancher URL. I am not able to run any command in the downstream cluster. No Kubectl commands are working in here. Nor I can see any pod with the name in the rancher node. 172.27.16.66 -> I have done the rancher setup in this server. Below servers I have used to provision a cluster from the Rancher UI while it was accessible earlier. 172.27.16.68 -> K8s master (controlplane, etcd) 172.27.16.69 -> k8s worker This are the pods that are running in the rancher node [rke-admin@poclphusamaster ~]$ kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE cattle-fleet-local-system fleet-agent-5c7775ccd-lqjv5 1/1 Running 5 (3h45m ago) 11h cattle-fleet-system fleet-controller-65d9f467d8-jg2ph 1/1 Running 5 (3h45m ago) 11h cattle-fleet-system gitjob-d74ff755b-7vc86 1/1 Running 5 (3h45m ago) 11h cattle-system rancher-6bcbdd6cb7-5dm6x 1/1 Running 0 11h cattle-system rancher-6bcbdd6cb7-7qhdg 1/1 Running 1 (7h29m ago) 11h cattle-system rancher-6bcbdd6cb7-pplq6 1/1 Running 4 (3h45m ago) 11h cattle-system rancher-webhook-ccf8c9784-km9cs 1/1 Running 0 11h cert-manager cert-manager-57d89b9548-kfswf 1/1 Running 5 (3h45m ago) 7d3h cert-manager cert-manager-cainjector-5bcf77b697-5bff9 1/1 Running 4 (7h11m ago) 7d3h cert-manager cert-manager-webhook-9cb88bd6d-9hl7f 1/1 Running 0 7d3h ingress-nginx nginx-ingress-controller-5r9s8 1/1 Running 0 12h kube-system calico-kube-controllers-5685fbd9f7-n8hn5 1/1 Running 1 (7h11m ago) 12h kube-system canal-jtz8t 2/2 Running 0 12h kube-system coredns-8578b6dbdd-d8hqp 1/1 Running 0 12h kube-system coredns-autoscaler-f7b68ccb7-rwp9s 1/1 Running 0 12h kube-system kube-vip-ds-t2ltj 1/1 Running 23 (3h46m ago) 7d3h kube-system metrics-server-6bc7854fb5-6f762 1/1 Running 0 12h kube-system rke-coredns-addon-deploy-job--1-zkfq7 0/1 Completed 0 7d3h kube-system rke-ingress-controller-deploy-job--1-bjgvc 0/1 Completed 0 7d3h kube-system rke-metrics-addon-deploy-job--1-6j72t 0/1 Completed 0 7d3h kube-system rke-network-plugin-deploy-job--1-qk4j7 0/1 Completed 0 7d3h

few-memory-46527

10/12/2023, 1:23 PM

Any update on this?

few-memory-46527

10/14/2023, 12:52 PM

Hi can someone please help..

few-memory-46527

10/14/2023, 12:53 PM

Also let me know if there are any consultation services that can be taken for getting support.

fast-piano-59234

10/16/2023, 8:35 AM

The rancher pods seems to have settled (only a few restarts), so accessing it should be possible. The troubleshooting remains the same, start with the entrypoint (Rancher URL) and go from there, does it point to the EC2 instance still or something else? Does the ingress controller log the request when you try to visit the Rancher URL? If it does, what does it log? Can it access the rancher pods to forward the request? What is

kube-vip

doing on this cluster? Is it not interfering with the route I just described? The goal to root cause is to eliminate any possible issue until found, so if there is no reason to have it running, you should remove it for now until you can access the UI. Support options can be found on https://www.suse.com/support/

few-memory-46527

10/16/2023, 10:14 AM

I cleaned up the servers since I thought I won't be able to get the response. Anyways I have this doubt still. Our setup is an intranet accessible only within our organization. We have three servers viz 172.27.16.66 (where RKE Single Node Rancher is installed) and two downstream cluster servers (172.27.16.68 i.e k8s worker and 172.27.16.69 i.e k8s master). We use the domain hurancher.omkar.org, which maps to the VIP (Virtual IP) 172.27.16.223 and that is why the kube-vip is present in this cluster. This VIP internally redirects to 172.27.16.66. Is VIP required for a single-node Rancher setup?

fast-piano-59234

10/17/2023, 10:10 AM

Single node is never recommended, for obvious HA reasons. For HA RKE1 setup, you can refer to https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/infrastructure-setup/ha-rke1-kubernetes-cluster. It uses a load balancer but of course, any implementation that you wish can be used to achieve HA. You can use a virtual IP for it and fail that over to other nodes (if they get added in the future). As long as it doesn't interfere with the components that are in between.

few-memory-46527

10/17/2023, 10:13 AM

I am using single node for a POC setup. I don't feel the need to use 6 server for a small demo. What to be done in the case if the UI keeps going down? And is there a quick way to restore the UI? I tried it from snapshot but it did not work every time.

fast-piano-59234

10/17/2023, 6:14 PM

There is still not enough information why it "keeps going down", the initial problem was that the UI was not reachable at all but if it is intermittently unavailable, that is a different story. We still need the answers on the questions above to answer why it is not working in your environment. For a POC/demo, you should be able to set this up quite quickly and proof that it works. The documentation describes pretty well how you can handle disaster recovery, like https://ranchermanager.docs.rancher.com/pages-for-subheaders/backup-restore-and-disaster-recovery

few-memory-46527

10/20/2023, 3:32 AM

Hello, I have setup new cluster again. I am not able to access the UI again. This are the ingress logs. [rke-admin@poclphusamaster ~]$ kubectl -n ingress-nginx logs -l app=ingress-nginx I1019 141523.369826 7 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"nginx-ingress-controller-n46l2", UID:"4c6f592e-728e-4a08-b853-c66f4f2c8712", APIVersion:"v1", ResourceVersion:"1097", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration I1019 141552.990390 7 status.go:300] "updating Ingress status" namespace="cattle-system" ingress="rancher" currentValue=[] newValue=[{IP:172.27.16.66 Hostname: Ports:[]}] I1019 141553.000590 7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"c70fc68a-cd4c-440a-a2c9-bf27ee9fdab5", APIVersion:"networking.k8s.io/v1", ResourceVersion:"5553", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync I1019 141639.547441 7 admission.go:149] processed ingress via admission controller {testedIngressLength:1 testedIngressTime:0.128s renderingIngressLength:1 renderingIngressTime:0s admissionTime:17.9kBs testedConfigurationSize:0.128} I1019 141639.547518 7 main.go:101] "successfully validated configuration, accepting" ingress="rancher/cattle-system" I1019 141639.582317 7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"c70fc68a-cd4c-440a-a2c9-bf27ee9fdab5", APIVersion:"networking.k8s.io/v1", ResourceVersion:"5770", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync E1019 153209.288687 7 leaderelection.go:330] error retrieving resource lock ingress-nginx/ingress-controller-leader-nginx: Get "https://10.43.0.1:443/api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx": context deadline exceeded I1019 153209.288829 7 leaderelection.go:283] failed to renew lease ingress-nginx/ingress-controller-leader-nginx: timed out waiting for the condition I1019 153209.288986 7 leaderelection.go:248] attempting to acquire leader lease ingress-nginx/ingress-controller-leader-nginx... I1019 153217.376462 7 leaderelection.go:258] successfully acquired lease ingress-nginx/ingress-controller-leader-nginx I see all the cattle system pods are running [rke-admin@poclphusamaster ~]$ k get pods -n cattle-system NAME READY STATUS RESTARTS AGE rancher-6bcbdd6cb7-k7lgg 1/1 Running 0 13h rancher-6bcbdd6cb7-rpmjl 1/1 Running 0 13h rancher-6bcbdd6cb7-sjqkc 1/1 Running 0 13h rancher-webhook-5d4f5b7f6d-cptfk 1/1 Running 0 13h Also the ingress pods and completed and are running ingress-nginx ingress-nginx-admission-create--1-rjz2p 0/1 Completed 0 13h ingress-nginx ingress-nginx-admission-patch--1-hlxxz 0/1 Completed 0 13h ingress-nginx nginx-ingress-controller-n46l2 1/1 Running 0 13h This is the ingress url [rke-admin@poclphusamaster ~]$ kubectl -n cattle-system get ingress NAME CLASS HOSTS ADDRESS PORTS AGE rancher nginx hurancher.zeomega.org 172.27.16.66 80, 443 13h What is wrong in here?

fast-piano-59234

10/20/2023, 12:50 PM

Doesn't seem it is being hit at all, can you share the output of

curl -v <https://hurancher.zeomega.org>

, might need to add

-k

if its a non valid certificate

few-memory-46527

10/20/2023, 1:01 PM

Please find the below output. [rke-admin@poclphusamaster ~]$ curl -v https://hurancher.zeomega.org -k * Rebuilt URL to: https://hurancher.zeomega.org/ * Trying 172.27.250.78... * TCP_NODELAY set * Connected to hurancher.zeomega.org (172.27.250.78) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, [no content] (0): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.zeomega.org * start date: Nov 18 173939 2022 GMT * expire date: Nov 18 173939 2023 GMT * issuer: C=US; ST=Texas; L=Houston; O=SSL Corp; CN=SSL.com SSL Intermediate CA ECC R2 * SSL certificate verify ok. * TLSv1.3 (OUT), TLS app data, [no content] (0): > GET / HTTP/1.1 > Host: hurancher.zeomega.org > User-Agent: curl/7.61.1 > Accept: / > * OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104 * Closing connection 0 * TLSv1.3 (OUT), TLS alert, [no content] (0): curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104

few-memory-46527

10/20/2023, 6:32 PM

Here is see some error [rke-admin@poclphusamaster ~]$ kubectl -n cattle-system logs -l app=rancher 2023/10/20 182925 [ERROR] error syncing 'cattle-fleet-system/helm-operation-sfz8k': handler helm-operation: an error on the server ("container not found (\"proxy\")") has prevented the request from succeeding (get pods helm-operation-sfz8k), requeuing 2023/10/20 182927 [INFO] kontainerdriver googlekubernetesengine listening on address 127.0.0.1:34535 2023/10/20 182927 [INFO] kontainerdriver amazonelasticcontainerservice listening on address 127.0.0.1:40211 2023/10/20 182927 [INFO] kontainerdriver azurekubernetesservice listening on address 127.0.0.1:46119 2023/10/20 182927 [INFO] kontainerdriver googlekubernetesengine stopped 2023/10/20 182927 [INFO] dynamic schema for kontainerdriver googlekubernetesengine updating 2023/10/20 182927 [INFO] kontainerdriver amazonelasticcontainerservice stopped 2023/10/20 182927 [INFO] dynamic schema for kontainerdriver amazonelasticcontainerservice updating 2023/10/20 182927 [INFO] kontainerdriver azurekubernetesservice stopped 2023/10/20 182927 [INFO] dynamic schema for kontainerdriver azurekubernetesservice updating 2023/10/20 182907 [INFO] Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller 2023/10/20 182907 [INFO] Starting /v1, Kind=Namespace controller 2023/10/20 182907 [INFO] Starting apiregistration.k8s.io/v1, Kind=APIService controller 2023/10/20 182907 [INFO] Starting /v1, Kind=ResourceQuota controller 2023/10/20 182907 [INFO] Starting /v1, Kind=ServiceAccount controller 2023/10/20 182907 [INFO] Starting /v1, Kind=Node controller 2023/10/20 182907 [INFO] Starting rbac.authorization.k8s.io/v1, Kind=Role controller 2023/10/20 182907 [INFO] Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller 2023/10/20 182907 [INFO] Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller 2023/10/20 182907 [INFO] Starting cluster agent for local [owner=true] 2023/10/20 182906 [INFO] Starting cluster controllers for local 2023/10/20 182907 [INFO] Starting /v1, Kind=Secret controller 2023/10/20 182907 [INFO] Starting /v1, Kind=ServiceAccount controller 2023/10/20 182907 [INFO] Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller 2023/10/20 182907 [INFO] Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller 2023/10/20 182907 [INFO] Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller 2023/10/20 182907 [INFO] Starting /v1, Kind=Namespace controller 2023/10/20 182907 [INFO] Starting rbac.authorization.k8s.io/v1, Kind=Role controller 2023/10/20 182907 [INFO] Starting cluster agent for local [owner=false] 2023/10/20 182909 [INFO] Handling backend connection request [10.42.0.27]

fast-piano-59234

10/20/2023, 9:06 PM

What is

172.27.250.78

and what is the network path/what component is in between from the machine running

curl

172.27.250.78

and from there to

172.27.16.66

few-memory-46527

10/21/2023, 2:57 AM

As i said earlier in the conversation. I m using VIP and mapping the domain with that VIP. The VIP is pointing to 172.27.16.66 only.

fast-piano-59234

10/22/2023, 8:07 AM

Remove the VIP in between and query the ingress on the node directly to see if that works, like said earlier, eliminate all components until root cause is found. For example:

Copy code

curl <https://hurancher.zeomega.org> --resolve <http://hurancher.zeomega.org:443:172.27.16.66|hurancher.zeomega.org:443:172.27.16.66>

few-memory-46527

10/23/2023, 2:44 PM

Hello, I have started getting this error. I can see this error in the rancher server.

Copy code

[rke-admin@poclphusamaster ~]$ kubectl -n cattle-system logs -l app=rancher
2023/10/23 14:36:39 [INFO] Starting <http://apiregistration.k8s.io/v1|apiregistration.k8s.io/v1>, Kind=APIService controller
2023/10/23 14:36:39 [INFO] Starting /v1, Kind=LimitRange controller
2023/10/23 14:36:39 [INFO] Starting /v1, Kind=Namespace controller
2023/10/23 14:36:39 [INFO] Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=ClusterRole controller
2023/10/23 14:36:39 [INFO] Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=ClusterRoleBinding controller
2023/10/23 14:36:39 [INFO] Starting /v1, Kind=Secret controller
2023/10/23 14:36:39 [INFO] Starting /v1, Kind=ServiceAccount controller
2023/10/23 14:36:39 [INFO] Starting cluster agent for local [owner=true]
2023/10/23 14:37:53 [INFO] Stopping cluster agent for c-hpwz8
2023/10/23 14:37:53 [ERROR] failed to start cluster controllers c-hpwz8: context canceled
time="2023-10-23 14:37:02" level=error msg="Failed to get HPA for project c-hpwz8:p-4q9vw err=Unknown schema type [horizontalPodAutoscaler]"
time="2023-10-23 14:37:02" level=error msg="Failed to get Pod for project c-hpwz8:p-4q9vw err=Unknown schema type [pod]"
time="2023-10-23 14:37:02" level=error msg="Failed to get Namespaces for project c-hpwz8:p-fmkr5 err=Unknown schema type [namespace]"
time="2023-10-23 14:37:02" level=error msg="Failed to get Workload for project c-hpwz8:p-fmkr5 err=Unknown schema type [workload]"
time="2023-10-23 14:37:02" level=error msg="Failed to get HPA for project c-hpwz8:p-fmkr5 err=Unknown schema type [horizontalPodAutoscaler]"
time="2023-10-23 14:37:02" level=error msg="Failed to get Pod for project c-hpwz8:p-fmkr5 err=Unknown schema type [pod]"
2023/10/23 14:37:38 [ERROR] error syncing 'c-hpwz8': handler cluster-deploy: Get "<https://172.27.16.69:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/23 14:37:59 [INFO] Stopping cluster agent for c-hpwz8
2023/10/23 14:37:59 [ERROR] failed to start cluster controllers c-hpwz8: context canceled
2023/10/23 14:38:17 [ERROR] error syncing 'c-hpwz8': handler cluster-deploy: Get "<https://172.27.16.69:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent>": cluster agent disconnected, requeuing
2023/10/23 14:36:39 [INFO] Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=ClusterRoleBinding controller
2023/10/23 14:36:39 [INFO] Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=ClusterRole controller
2023/10/23 14:36:39 [INFO] Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=RoleBinding controller
2023/10/23 14:36:39 [INFO] Starting <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>, Kind=Role controller
2023/10/23 14:36:39 [INFO] Starting /v1, Kind=Namespace controller
2023/10/23 14:36:39 [INFO] Starting /v1, Kind=Secret controller
2023/10/23 14:36:39 [INFO] Starting /v1, Kind=ServiceAccount controller
2023/10/23 14:36:44 [INFO] Handling backend connection request [10.42.0.28]
2023/10/23 14:38:22 [INFO] Stopping cluster agent for c-hpwz8
2023/10/23 14:38:22 [ERROR] failed to start cluster controllers c-hpwz8: context canceled

In the downstream cluster I can see that there are 2 snapshots

Copy code

[rke-admin@Poclphusanode2 etcd-snapshots]$ ll
total 2192
-rw------- 1 root root 1423633 Oct 23 09:35 c-hpwz8-rl-6p2wh_2023-10-23T14:34:50Z.zip
-rw------- 1 root root  815925 Oct 20 23:07 c-hpwz8-rl-75hmd_2023-10-21T04:07:04Z.zip

I am aware of the command that I can run in the rancher node i.e rke etcd snapshot-restore --name snapshot.db --config cluster.yml and also if I want to recover the downstream cluster I can do it through UI but the UI is not accessible now. How can I restore the downstream cluster if I do not have the cluster.yml file in there?

fast-piano-59234

10/25/2023, 9:08 AM

Rancher is not accessible so the error for cluster agents not being able to connect is confirming that. What is the outcome of the command above? Making Rancher accessible again is the first step

few-memory-46527

10/25/2023, 1:05 PM

That is what I want to know I have a snapshot in the downstream cluster and I want to recover that. What is the command to be ran there as we do not have cluster.yaml file in that downstream cluster.

6 Views

Open in Slack

Previous Next