narrow-noon-75604
09/01/2022, 6:13 PMlimited-motherboard-41807
09/02/2022, 3:36 PM/var/lib/rancher/rke2/server/logs
but the directory is empty.
Do you know where I should look for?billions-easter-91774
09/04/2022, 2:56 PMbillions-easter-91774
09/04/2022, 9:00 PMsilly-jordan-81965
09/05/2022, 5:42 AMfreezing-teacher-93828
09/05/2022, 8:49 AMtoken: masked
server: <https://cluster.example.com:9345>
tls-san:
- server2
- <http://server2.example.com|server2.example.com>
- <http://cluster.example.com|cluster.example.com>
- 12.34.56.78
disable: rke2-ingress-nginx
disable-kube-proxy: true
cni:
- cilium
(I masked the kubevip IP address and wrote 12.34.56.78 instead)freezing-teacher-93828
09/05/2022, 9:06 AMIn a high-availability RKE2 cluster (using kube-vip and 3 servers)
should the file /etc/rancher/rke2/config.yaml
be identical on the three servers (server1, server2, server3)?
I meant identical except for the section tls-san where the files can differ. For example
server1:
tls-san:
- server1
- <http://server1.example.com|server1.example.com>
- <http://cluster.example.com|cluster.example.com>
- 12.34.56.78
server2:
tls-san:
- server2
- <http://server2.example.com|server2.example.com>
- <http://cluster.example.com|cluster.example.com>
- 12.34.56.78
narrow-noon-75604
09/06/2022, 9:25 AMbored-rain-98291
09/06/2022, 6:23 PMacoustic-motherboard-98931
09/07/2022, 12:55 PM~$ sudo curl -sfL <https://get.rke2.io> | sh -
And after that, I receive this error with the second command:
~$ sudo systemctl enable rke2-server.service
Failed to enable unit: Unit file rke2-server.service does not exist.
Any hit on this?
Thankspolite-breakfast-84569
09/08/2022, 10:37 AMkubeconfig
for this new cluster has my rancher-server as the server endpoint so I suppose the rancher is balancing the connections between me and the masters nodes, is that correct?
Additionally, I did not see any configuration on the worker nodes for the kubelet
to talk to the masters in HA. So it seems to me that e.g worker-1
can talk only to master-1
. I have seen people who setup a ha-proxy
on the worker nodes so they are able to communicate to any of the masters, but here i do not see any setup like that by default.hundreds-airport-66196
09/08/2022, 2:15 PMbright-fireman-42144
09/08/2022, 4:59 PMbright-fireman-42144
09/09/2022, 12:48 AMbright-fireman-42144
09/09/2022, 12:48 AMbright-fireman-42144
09/09/2022, 12:53 AMbright-fireman-42144
09/09/2022, 11:54 AMmagnificent-vr-88571
09/11/2022, 8:26 PM>> kubectl get pod -n kube-system
cilium-4xc5q 1/1 Running 0 8h
cilium-89vrg 1/1 Running 0 8h
cilium-cg8gn 1/1 Running 6 8h
cilium-gbbl7 1/1 Running 1 8h
cilium-j8s9t 1/1 Running 3 8h
cilium-jfs9f 1/1 Running 1 179m
cilium-ld9fc 1/1 Running 0 8h
cilium-lz2hj 1/1 Running 0 8h
cilium-node-init-7ltcv 1/1 Running 0 8h
cilium-node-init-gzhvc 1/1 Running 0 8h
cilium-node-init-hqnrk 1/1 Running 0 179m
cilium-node-init-j2ffd 1/1 Running 0 8h
cilium-node-init-j5q52 1/1 Running 3 8h
cilium-node-init-mmbjj 1/1 Running 0 8h
cilium-node-init-qk6pj 1/1 Running 1 8h
cilium-node-init-w87qb 1/1 Running 3 8h
cilium-node-init-zfrt9 1/1 Running 0 8h
cilium-nxqxb 1/1 Running 0 8h
cilium-operator-fccb67dc5-srt76 1/1 Running 5 8h
cilium-operator-fccb67dc5-wsr5m 1/1 Running 3 8h
cloud-controller-manager-sv-svr1 1/1 Running 3 9h
cloud-controller-manager-sv-svr2 1/1 Running 3 8h
cloud-controller-manager-sv-svr3 1/1 Running 3 8h
etcd-sv-svr1 1/1 Running 8 9h
etcd-sv-svr2 1/1 Running 3 8h
etcd-sv-svr3 1/1 Running 3 147m
external-dns-dc9dd7d74-h6dqw 1/1 Running 1 90d
helm-install-rke2-metrics-server-cmgjc 0/1 CrashLoopBackOff 72 5h40m
kube-apiserver-sv-svr1 1/1 Running 1 9h
kube-apiserver-sv-svr2 1/1 Running 3 8h
kube-apiserver-sv-svr3 1/1 Running 3 140m
kube-controller-manager-sv-svr1 1/1 Running 3 9h
kube-controller-manager-sv-svr2 1/1 Running 3 8h
kube-controller-manager-sv-svr3 1/1 Running 3 8h
kube-proxy-sv-agent3 1/1 Running 0 7h40m
kube-proxy-sv-agent4 1/1 Running 0 8h
kube-proxy-sv-agent5 1/1 Running 0 8h
kube-proxy-sv-agent6 1/1 Running 0 8h
kube-proxy-sv-svr1 1/1 Running 1 9h
kube-proxy-sv-svr2 1/1 Running 3 8h
kube-proxy-sv-svr3 1/1 Running 3 8h
kube-proxy-sv-agent1 1/1 Running 0 8h
kube-proxy-sv-agent2 1/1 Running 0 3h
kube-scheduler-sv-svr1 1/1 Running 3 9h
kube-scheduler-sv-svr2 1/1 Running 3 8h
kube-scheduler-sv-svr3 1/1 Running 3 8h
kube-vip-cloud-provider-0 1/1 Running 3 8h
kube-vip-ds-5q5qw 1/1 Running 3 8h
kube-vip-ds-fw8zv 1/1 Running 3 8h
kube-vip-ds-rmqhc 1/1 Running 4 8h
metrics-server-8bbfb4bdb-rzpnp 1/1 Running 5 7h33m
rke2-coredns-rke2-coredns-855c5d9879-9fwhx 1/1 Running 0 5h40m
rke2-coredns-rke2-coredns-855c5d9879-j7wbc 0/1 CrashLoopBackOff 41 3h3m
rke2-coredns-rke2-coredns-autoscaler-7c77dcfb76-hm78m 1/1 Running 3 8h
rke2-ingress-nginx-controller-4kvdx 1/1 Running 2 8h
rke2-ingress-nginx-controller-8k5z8 1/1 Running 0 8h
rke2-ingress-nginx-controller-c6r5q 1/1 Running 0 179m
rke2-ingress-nginx-controller-cx88s 1/1 Running 0 8h
rke2-ingress-nginx-controller-jl74q 1/1 Running 1 8h
rke2-ingress-nginx-controller-nr2qp 1/1 Running 8 8h
rke2-ingress-nginx-controller-p6sfq 1/1 Running 3 8h
rke2-ingress-nginx-controller-qmbzn 1/1 Running 0 8h
rke2-ingress-nginx-controller-wj54z 1/1 Running 0 8h
rke2-metrics-server-5df7d77b5b-b4qlw 1/1 Running 20 74d
Following are the errors noticed and volumes are not mounted.
E0911 20:16:38.965933 17195 kubelet.go:1701] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[data], unattached volumes=[data kube-api-access-ztp4j dshm]: timed out waiting for the condition" pod="cvat/cvat-postgresql-0"
E0911 20:23:07.393663 16782 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"container\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=container pod=metadata-grpc-deployment-f8d68f687-5fvbs_kubeflow(d72591f7-e2c4-475f-ad83-fc59c996219a)\"" pod="kubeflow/metadata-grpc-deployment-f8d68f687-5fvbs" podUID=d72591f7-e2c4-475f-ad83-fc59c996219a
I0911 20:23:08.718940 16782 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-62552b22-3e99-4b63-8a56-69519573ae1d\" (UniqueName: \"<http://kubernetes.io/csi/driver.longhorn.io^pvc-62552b22-3e99-4b63-8a56-69519573ae1d\|kubernetes.io/csi/driver.longhorn.io^pvc-62552b22-3e99-4b63-8a56-69519573ae1d\>") pod \"loki-0\" (UID: \"8aef7574-fb66-415f-a130-6b8ec9091672\") "
E0911 20:23:08.724147 16782 nestedpendingoperations.go:335] Operation for "{volumeName:<http://kubernetes.io/csi/driver.longhorn.io^pvc-62552b22-3e99-4b63-8a56-69519573ae1d|kubernetes.io/csi/driver.longhorn.io^pvc-62552b22-3e99-4b63-8a56-69519573ae1d> podName: nodeName:}" failed. No retries permitted until 2022-09-11 20:25:10.724134581 +0000 UTC m=+21624.816950484 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-62552b22-3e99-4b63-8a56-69519573ae1d\" (UniqueName: \"<http://kubernetes.io/csi/driver.longhorn.io^pvc-62552b22-3e99-4b63-8a56-69519573ae1d\|kubernetes.io/csi/driver.longhorn.io^pvc-62552b22-3e99-4b63-8a56-69519573ae1d\>") pod \"loki-0\" (UID: \"8aef7574-fb66-415f-a130-6b8ec9091672\") "
I0911 20:23:09.829046 16782 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-c6597566-f0c6-40b3-be5b-9d670f51748d\" (UniqueName: \"<http://kubernetes.io/csi/driver.longhorn.io^pvc-c6597566-f0c6-40b3-be5b-9d670f51748d\|kubernetes.io/csi/driver.longhorn.io^pvc-c6597566-f0c6-40b3-be5b-9d670f51748d\>") pod \"harbor-redis-0\" (UID: \"912226dd-12cf-4cb5-a54b-fb831b4e7e73\") "
E0911 20:23:09.831850 16782 nestedpendingoperations.go:335] Operation for "{volumeName:<http://kubernetes.io/csi/driver.longhorn.io^pvc-c6597566-f0c6-40b3-be5b-9d670f51748d|kubernetes.io/csi/driver.longhorn.io^pvc-c6597566-f0c6-40b3-be5b-9d670f51748d> podName: nodeName:}" failed. No retries permitted until 2022-09-11 20:25:11.831837052 +0000 UTC m=+21625.924652956 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-c6597566-f0c6-40b3-be5b-9d670f51748d\" (UniqueName: \"<http://kubernetes.io/csi/driver.longhorn.io^pvc-c6597566-f0c6-40b3-be5b-9d670f51748d\|kubernetes.io/csi/driver.longhorn.io^pvc-c6597566-f0c6-40b3-be5b-9d670f51748d\>") pod \"harbor-redis-0\" (UID: \"912226dd-12cf-4cb5-a54b-fb831b4e7e73\") "
Any inputs to recover?echoing-oxygen-99290
09/13/2022, 3:16 PMcert-manager
and kube-vip
. I have created two tars:
cert-manager: docker save <http://quay.io/jetstack/cert-manager-cainjector:v1.9.1|quay.io/jetstack/cert-manager-cainjector:v1.9.1>
<http://quay.io/jetstack/cert-manager-controller:v1.9.1|quay.io/jetstack/cert-manager-controller:v1.9.1>
<http://quay.io/jetstack/cert-manager-webhook:v1.9.1|quay.io/jetstack/cert-manager-webhook:v1.9.1>
<http://quay.io/jetstack/cert-manager-ctl:v1.9.1|quay.io/jetstack/cert-manager-ctl:v1.9.1> | gzip > cert-manager.tar.gz
kube-vip: docker save <http://ghcr.io/kube-vip/kube-vip:v0.5.0|ghcr.io/kube-vip/kube-vip:v0.5.0> | gzip > kube-vip.tar.gz
I have copied both into the images directory:
root@rke-test-cluster-node-0:~# ls /var/lib/rancher/rke2/agent/images/
cert-manager.tar.gz kube-vip.tar.gz rke2-images.linux-amd64.tar.zst
Cert-manager
is able to come up without issue, but I run into issues with kube-vip
.
Failed to pull image "<http://ghcr.io/kube-vip/kube-vip:v0.5.0|ghcr.io/kube-vip/kube-vip:v0.5.0>": rpc error: code = Unknown desc = failed to pull and unpack image "<http://ghcr.io/kube-vip/kube-vip:v0.5.0|ghcr.io/kube-vip/kube-vip:v0.5.0>": failed to resolve reference "<http://ghcr.io/kube-vip/kube-vip:v0.5.0|ghcr.io/kube-vip/kube-vip:v0.5.0>": failed to do request: Head "<https://ghcr.io/v2/kube-vip/kube-vip/manifests/v0.5.0>": dial tcp 140.82.112.33:443: i/o timeout
When listing my available images, the kube-vip
image seems to be available:
root@rke-test-cluster-node-0:~# /var/lib/rancher/rke2/bin/crictl images | grep -e kube-vip -e cert-manager
<http://ghcr.io/kube-vip/kube-vip|ghcr.io/kube-vip/kube-vip> v0.5.0 09067696476ff 37.9MB
<http://quay.io/jetstack/cert-manager-cainjector|quay.io/jetstack/cert-manager-cainjector> v1.9.1 11778d29f8cc2 39.2MB
<http://quay.io/jetstack/cert-manager-controller|quay.io/jetstack/cert-manager-controller> v1.9.1 8eaca4249b016 57.2MB
<http://quay.io/jetstack/cert-manager-ctl|quay.io/jetstack/cert-manager-ctl> v1.9.1 0a3af10d53674 50.2MB
<http://quay.io/jetstack/cert-manager-webhook|quay.io/jetstack/cert-manager-webhook> v1.9.1 d3348bcdc1e7e 45.8MB
It seems it is trying to reach out to the internet for the image, rather than use the image available locally. Could someone provide any insight into what settings if any I can look into or steps I can take to further debug this?magnificent-vr-88571
09/13/2022, 9:05 PMSep 13 17:41:21 svmaster rke2[14824]: time="2022-09-13T17:41:21+09:00" level=info msg="Latest etcd manifest deployed"
Sep 13 17:41:22 svmaster rke2[14824]: {"level":"warn","ts":"2022-09-13T17:41:22.837+0900","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {<https://127.0.0.1:2379> <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Sep 13 17:43:23 svmaster rke2[21854]: time="2022-09-13T17:43:23+09:00" level=info msg="Stopped tunnel to 127.0.0.1:9345"
Sep 13 17:43:23 svmaster rke2[21854]: time="2022-09-13T17:43:23+09:00" level=info msg="Proxy done" err="context canceled" url="<wss://127.0.0.1:9345/v1-rke2/connect>"
Sep 13 17:43:23 svmaster rke2[21854]: time="2022-09-13T17:43:23+09:00" level=info msg="Connecting to proxy" url="<wss://192.168.7.15:9345/v1-rke2/connect>"
Sep 13 17:43:23 svmaster rke2[21854]: time="2022-09-13T17:43:23+09:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Sep 13 17:43:23 svmaster rke2[21854]: time="2022-09-13T17:43:23+09:00" level=info msg="Handling backend connection request [svmaster]"
And in agent journalctl logs following are displayed.
Sep 13 18:03:02 svagent rke2[2659960]: W0913 18:03:02.914417 2659960 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {<https://127.0.0.1:2379> <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 13 18:03:06 svagent rke2[2659960]: time="2022-09-13T18:03:06+09:00" level=debug msg="Wrote ping"
any inputs to resolve?rapid-toddler-64209
09/14/2022, 7:52 AMfreezing-wolf-83208
09/14/2022, 10:10 AMprehistoric-solstice-99854
09/15/2022, 8:58 PMkubectl
. However, when I tried to access Rancher, after a successful install, the site never fully loads. I looked through logs and have determined that DNS isn’t working and that is causing the problem. I got a shell inside a container and confirmed that I can ping an IP but not a domain name.
I’ve disabled firewalld, temporarily disabled SELinux, and I updated NetworkManager to ignore CNI traffic on all RKE2 nodes. The 3 management nodes and 3 worker nodes have no DNS issues, just the pods that do. I’m not sure what to try next. It appears the issue is communication between pods. Could anyone point me in the right direction on this? I’ve looked for generic coredns troubleshooting and nothing has helped me find the problem yet.rapid-toddler-64209
09/19/2022, 8:55 AMrapid-toddler-64209
09/19/2022, 9:01 AMbright-whale-83501
09/19/2022, 6:04 PMshy-megabyte-75492
09/21/2022, 1:00 AMshy-megabyte-75492
09/21/2022, 1:00 AMswift-zebra-42479
09/21/2022, 6:35 AMbright-whale-83501
09/21/2022, 9:02 AM