This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

08/18/2022, 2:36 PM

This message was deleted.

magnificent-vr-88571

09/02/2022, 4:15 PM

@creamy-pencil-82913, I would like to know for the above situation If I bring up a new server with different IP & node name, I can follow https://docs.rke2.io/backup_restore/#restoring-a-snapshot-to-new-nodes to bring back the cluster ? Currently after moving to new IP existing single node master from 10.21.100.19 to 172.18.14.15 is not starting services, failing with following error.

Copy code

msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [svmaster-7ba99b6a=<https://10.21.100.19:2380>], expect: svmaster-7ba99b6a=172.18.14.15"

creamy-pencil-82913

09/02/2022, 4:24 PM

If it’s a single-server cluster, you need to do a --cluster-reset if you change the server’s IP. If you change the IP address of a server in a multi-server cluster, you should delete that node from the cluster and then rejoin it. Node IPs are expected to be static for the life of the node in the cluster.

magnificent-vr-88571

09/02/2022, 4:25 PM

it’s actually a 6 node cluster with 1 master + 5 worker. single node, i meant single master.

creamy-pencil-82913

09/02/2022, 4:27 PM

RKE2 doesn’t have masters or workers, it has servers and agents. What you’re describing is single-server.

magnificent-vr-88571

09/02/2022, 4:28 PM

yes 1 server and 5 agents

magnificent-vr-88571

09/02/2022, 4:31 PM

so I need to • “--cluster-reset” on the server node • advertise-address: new IP in the config file /etc/rancher/rke2/config.yaml of server node, and start service. • modify config file of agent nodes to reach new “IP address of server node”

magnificent-vr-88571

09/02/2022, 4:31 PM

am i right ?

creamy-pencil-82913

09/02/2022, 4:32 PM

you shouldn’t need to set the advertise address, unless it’s detecting the wrong IP for some reason

🙆‍♂️ 1

creamy-pencil-82913

09/02/2022, 4:32 PM

Stop the service, run

rke2 server --cluster-reset

and wait for it to finish, then start the service again.

magnificent-vr-88571

09/02/2022, 4:41 PM

noticing many error in the process after executing

rke2 server --cluster-reset

Copy code

WARN[0305] Unable to watch for tunnel endpoints: Get "<https://127.0.0.1:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=0&watch=true>": dial tcp 127.0.0.1:6443: connect: connection refused
INFO[0305] Waiting for kubelet to be ready on node sv13: Get "<https://127.0.0.1:6443/api/v1/nodes/svmaster>": dial tcp 127.0.0.1:6443: connect: connection refused
INFO[0310] Waiting for API server to become available
INFO[0310] Waiting for etcd server to become available
INFO[0310] Failed to test data store connection: this server is a not a member of the etcd cluster. Found [svmaster-7d4b12a0=<https://10.21.100.19:2380>], expect: svmaster-7d4b12a0=172.18.14.15
INFO[0310] Cluster-Http-Server 2022/09/03 01:42:09 http: TLS handshake error from 127.0.0.1:42616: remote error: tls: bad certificate
INFO[0310] Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error

It’s a airgap environment and there is no internet connectivity for now.

magnificent-vr-88571

09/02/2022, 4:45 PM

looks following message is repeating, as you have mentioned I have removed “advertise-address:” from /etc/rancher/rke2/config.yaml

Copy code

INFO[0310] Failed to test data store connection: this server is a not a member of the etcd cluster. Found [svmaster-7d4b12a0=<https://10.21.100.19:2380>], expect: svmaster-7d4b12a0=172.18.14.15

creamy-pencil-82913

09/02/2022, 4:47 PM

does the host have multiple interfaces? Did you want to use --node-ip instead of --advertise-address?

magnificent-vr-88571

09/02/2022, 4:48 PM

Copy code

IPv4 address for cilium_host:     10.42.0.111
  IPv4 address for eno1:            172.18.14.15 
  IPv4 address for enxb03af2b6059f: 169.254.3.1

I can see above ip’s in the cluster

magnificent-vr-88571

09/02/2022, 4:48 PM

Copy code

eno1:            172.18.14.15

is the only physical network

magnificent-vr-88571

09/02/2022, 4:49 PM

Do I need to set back “advertise-address:” from /etc/rancher/rke2/config.yaml with “172.18.14.15" ? Earlier i did set it with previous IP.

creamy-pencil-82913

09/02/2022, 4:50 PM

Did the

rke2 server --cluster-reset

complete?

creamy-pencil-82913

09/02/2022, 4:51 PM

it looks like it’s still using the old cluster config

magnificent-vr-88571

09/02/2022, 4:51 PM

https://rancher-users.slack.com/archives/C01PHNP149L/p1662136869841819?thread_ts=1660833392.861649&cid=C01PHNP149L above log is displayed continously

creamy-pencil-82913

09/02/2022, 4:53 PM

try running

rke2-killall.sh

first

creamy-pencil-82913

09/02/2022, 4:53 PM

if it still doesn’t reset successfully, post the full log, not just the bit that it’s repeating

magnificent-vr-88571

09/02/2022, 4:53 PM

sure got it

magnificent-vr-88571

09/02/2022, 5:09 PM

failed with attached error

magnificent-vr-88571

09/02/2022, 5:11 PM

I am using longhorn for storage, may be got do something related to https://github.com/kubernetes/kubernetes/issues/105536

creamy-pencil-82913

09/02/2022, 5:21 PM

hmm, indeed. it appears to be stuck trying to clean up the longhorn mount points

creamy-pencil-82913

09/02/2022, 5:21 PM

would you mind opening an issue on GH, and attach that log?

magnificent-vr-88571

09/02/2022, 5:22 PM

sure no problem, for now can delete those dir manually right?

creamy-pencil-82913

09/02/2022, 5:23 PM

yeah… honestly the killall should have cleaned those up, but the kubelet tries to also do it, and it seems to be stuck trying to talk to longhorn which obviously isn’t running since none of the pods are up.

creamy-pencil-82913

09/02/2022, 5:24 PM

you might delete that longhorn socket from the kubelet plugin dir, see if that helps

magnificent-vr-88571

09/02/2022, 5:24 PM

oh, got it. let me try that

magnificent-vr-88571

09/02/2022, 5:27 PM

no luck, deleted longhorn from kubelete/plugins rke2-killall.sh still those pods are in

/var/lib/kubelet/pods

creamy-pencil-82913

09/02/2022, 5:28 PM

yeah, but does the kubelet still hang trying to talk to that socket when you do the cluster-reset?

creamy-pencil-82913

09/02/2022, 5:29 PM

or is it hanging on something else

magnificent-vr-88571

09/02/2022, 5:30 PM

still facing below errors, I have removed all dirs under

/var/lib/kubelet/pods/

manually.

Copy code

E0903 02:27:59.425466  235868 kubelet_volumes.go:225] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"6525b999-6ed3-40dd-87e6-674063559a95\" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/6525b999-6ed3-40dd-87e6-674063559a95/volumes/kubernetes.io~configmap/istiod-ca-cert: directory not empty" numErrs=13

creamy-pencil-82913

09/02/2022, 5:31 PM

try doing

rm -rf /var/lib/kubelet/pods/*/volumes/kubernetes.io*

creamy-pencil-82913

09/02/2022, 5:32 PM

I think it’s still related to the longhorn volumes, this is not normally a problem

magnificent-vr-88571

09/02/2022, 5:36 PM

finally received below message

Copy code

FATA[0100] starting kubernetes: preparing server: start managed database: cluster-reset was successfully performed, please remove the cluster-reset flag and start rke2 normally, if you need to perform another cluster reset, you must first manually delete the /var/lib/rancher/rke2/server/db/reset-flag file

magnificent-vr-88571

09/02/2022, 5:37 PM

between faced error due to removal of longhorn from kubelete/plugins, so restored it

magnificent-vr-88571

09/02/2022, 5:46 PM

while trying to start rke2-server through systemctl following log is observed.

creamy-pencil-82913

09/02/2022, 5:59 PM

hmm, so it still has the old IP. Do you have that IP listed anywhere in the config file?

creamy-pencil-82913

09/02/2022, 6:00 PM

I will also say that

v1.21.5+rke2r2

is quite old at this point. We’ve fixed a lot of stuff in the cluster reset; I think correcting the node IP during the reset was one of the fixes

creamy-pencil-82913

09/02/2022, 6:01 PM

You should go to the latest 1.21 release at the very least, if not up to 1.23 or newer since 1.21 is EOL and 1.22 is 1 month from EOL

magnificent-vr-88571

09/02/2022, 6:04 PM

Stuck in 1.21 because of kubeflow :(

magnificent-vr-88571

09/02/2022, 6:06 PM

So its good to install 1.21 latest in another node and try restore https://docs.rke2.io/backup_restore/#restoring-a-snapshot-to-new-nodes?

creamy-pencil-82913

09/02/2022, 6:13 PM

yeah, or just upgrade that one, either should work

magnificent-vr-88571

09/02/2022, 6:13 PM

Earlier IP was added in config file as “advertise address”, now it’s removed.

creamy-pencil-82913

09/02/2022, 6:14 PM

Kubeflow doesn’t support any non-eol versions of Kubernetes!?

magnificent-vr-88571

09/02/2022, 6:14 PM

Yeah… that’s the latest of kubeflow :(

creamy-pencil-82913

09/02/2022, 6:17 PM

that’s insane

😞 1

magnificent-vr-88571

09/07/2022, 4:11 PM

while trying to install 1.21 in new node and restore following error observed and restore is not success.

FATA[0010] starting kubernetes: preparing server: start managed database: etcd: snapshot path does not exist: on-demand-svmaster-1661871603.zip

creamy-pencil-82913

09/07/2022, 4:36 PM

the file doesn’t exist; are you in the correct directory or did you need to pass the absolute path to the file?

magnificent-vr-88571

09/07/2022, 4:41 PM

sorry that’s my bad.. gave correct path looks like cluster restored

magnificent-vr-88571

09/10/2022, 1:39 PM

I was able to bring up the cluster following https://docs.rke2.io/backup_restore/#restoring-a-snapshot-to-existing-nodes, and updated server ip’s in

/var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json

and

/var/lib/rancher/rke2/agent/etc/rke2-api-server-agent-load-balancer.json

In the cluster 1server+5agent was up and running. But facing trouble in 1 agent not joining with following error.

Copy code

Sep 10 22:36:26 sv-agent systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Sep 10 22:36:26 sv-agent sh[4038]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Sep 10 22:36:26 sv-agent sh[4039]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Sep 10 22:36:26 sv-agent rke2[4042]: time="2022-09-10T22:36:26+09:00" level=warning msg="not running in CIS mode"
Sep 10 22:36:26 sv-agent rke2[4042]: time="2022-09-10T22:36:26+09:00" level=info msg="Starting rke2 agent v1.21.5+rke2r2 (9e4acdc6018ae74c36523c99af25ab861f3884da)"
Sep 10 22:36:26 sv-agent rke2[4042]: time="2022-09-10T22:36:26+09:00" level=info msg="Running load balancer 127.0.0.1:6444 -> [192.168.x.x:9345]"
Sep 10 22:36:26 sv-agent rke2[4042]: time="2022-09-10T22:36:26+09:00" level=info msg="Running load balancer 127.0.0.1:6443 -> [192.168.x.x:6443]"
Sep 10 22:36:36 sv-agent rke2[4042]: time="2022-09-10T22:36:36+09:00" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"<https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Sep 10 22:36:46 sv-agent rke2[4042]: time="2022-09-10T22:36:46+09:00" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"<https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

magnificent-vr-88571

09/10/2022, 1:46 PM

Any recommended solution to resolve ?

magnificent-vr-88571

09/10/2022, 1:52 PM

Noticed following error in server

Copy code

Sep 10 22:49:06 sv-master rke2[605992]: time="2022-09-10T22:49:06+09:00" level=info msg="Cluster-Http-Server 2022/09/10 22:49:06 http: TLS handshake error from 192.168.x.x:54796: remote error: tls: bad certificate"
Sep 10 22:49:07 sv-master rke2[605992]: time="2022-09-10T22:49:07+09:00" level=error msg="Internal error occurred: failed calling webhook \"<http://rancher.cattle.io|rancher.cattle.io>\": Post \"<https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s>\": dial tcp 10.43.122.217:443: i/o timeout"
Sep 10 22:49:17 sv-master rke2[605992]: time="2022-09-10T22:49:17+09:00" level=info msg="Cluster-Http-Server 2022/09/10 22:49:17 http: TLS handshake error from 192.168.x.x:54810: remote error: tls: bad certificate"
Sep 10 22:49:17 sv-master rke2[605992]: time="2022-09-10T22:49:17+09:00" level=error msg="Internal error occurred: failed calling webhook \"<http://rancher.cattle.io|rancher.cattle.io>\": Post \"<https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s>\": context deadline exceeded"
Sep 10 22:49:27 sv-master rke2[605992]: time="2022-09-10T22:49:27+09:00" level=error msg="Internal error occurred: failed calling webhook \"<http://rancher.cattle.io|rancher.cattle.io>\": Post \"<https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s>\": dial tcp 10.43.122.217:443: i/o timeout"

creamy-pencil-82913

09/10/2022, 7:43 PM

Did you change the server address on that agent? It looks like it’s not able to connect to the server…

magnificent-vr-88571

09/10/2022, 8:58 PM

yes, config.yaml is same as other agents. server ip is updated to new ip .

magnificent-vr-88571

09/10/2022, 9:00 PM

We can ignore tls bad certificates error in server side log?

creamy-pencil-82913

09/10/2022, 9:55 PM

yes, you’ll see that every time an agent connects as it checks to see if the cert can be validated with the OS CA bundle.

creamy-pencil-82913

09/10/2022, 9:56 PM

can you successfully curl the server address from the node?

magnificent-vr-88571

09/10/2022, 10:13 PM

Copy code

curl <https://192.168.11.103:9345> -kv
*   Trying 192.168.11.103:9345...
* TCP_NODELAY set
* connect to 192.168.11.103 port 9345 failed: Connection refused
* Failed to connect to 192.168.11.103 port 9345: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 192.168.11.103 port 9345: Connection refused

magnificent-vr-88571

09/10/2022, 10:14 PM

@creamy-pencil-82913 could you please share if any particular curl command to be verified?

magnificent-vr-88571

09/10/2022, 10:24 PM

tried deleting from cluster

kubectl delete node agent

and added again.. it worked..

magnificent-vr-88571

09/10/2022, 10:44 PM

As i mentioned earlier its a single server cluster, i couldn’t find kube-api/etcd/scheduler pods in kube-system namespace. facing below error in

kubectl get events

Copy code

'(combined from similar events): Unable to attach or mount volumes: unmounted
  volumes=[jenkins-home], unattached volumes=[kube-api-access-r5rgj sc-config-volume
  admin-secret jenkins-config plugins plugin-dir jenkins-home tmp-volume jenkins-cache]:
  timed out waiting for the condition'

creamy-pencil-82913

09/11/2022, 12:01 AM

not sure what you mean you can’t find the pods?

magnificent-vr-88571

09/11/2022, 12:13 AM

only following pods are in kube-system

Copy code

kubectl get pods -n kube-system
NAME                                                    READY   STATUS      RESTARTS   AGE
cilium-785bs                                            1/1     Running     0          112m
cilium-8t96j                                            1/1     Running     1          142d
cilium-fnwhw                                            1/1     Running     4          84d
cilium-hqt85                                            1/1     Running     1          94d
cilium-jz5l7                                            1/1     Running     1          84d
cilium-l56kv                                            1/1     Running     3          84d
cilium-node-init-4sdnt                                  1/1     Running     0          84d
cilium-node-init-8kccp                                  1/1     Running     0          94d
cilium-node-init-9dgwz                                  1/1     Running     1          142d
cilium-node-init-bt8w7                                  1/1     Running     0          84d
cilium-node-init-h5zp4                                  1/1     Running     0          112m
cilium-node-init-pzkh5                                  1/1     Running     0          94d
cilium-node-init-r9wr9                                  1/1     Running     0          84d
cilium-operator-85f67b5cb7-w676b                        1/1     Running     8          154m
cilium-operator-85f67b5cb7-wwlcr                        1/1     Running     1          142d
cilium-q5t5d                                            1/1     Running     3          84d
external-dns-dc9dd7d74-h6dqw                            1/1     Running     0          89d
helm-install-rke2-cilium-2rc7l                          0/1     Completed   166        14h
helm-install-rke2-coredns-pkk8n                         0/1     Completed   165        14h
helm-install-rke2-ingress-nginx-cg5g2                   0/1     Completed   164        14h
helm-install-rke2-metrics-server-c2cdz                  0/1     Completed   164        14h
kube-proxy-svagent6                                         1/1     Running     2          14h
kube-proxy-svagent5                                         1/1     Running     2          14h
kube-proxy-svagent4                                         1/1     Running     1          14h
kube-proxy-svagent3                                     1/1     Running     2          14h
kube-proxy-svagent1                                          1/1     Running     2          14h
kube-proxy-svagent2                                         1/1     Running     1          112m
metrics-server-8bbfb4bdb-x78tm                          1/1     Running     4          73d
rke2-coredns-rke2-coredns-78d6d5c574-2wcrb              1/1     Running     5          73d
rke2-coredns-rke2-coredns-78d6d5c574-wnvlm              1/1     Running     0          81d
rke2-coredns-rke2-coredns-autoscaler-7c58bd5b6c-25wxc   1/1     Running     5          73d
rke2-ingress-nginx-controller-5xzq5                     1/1     Running     1          84d
rke2-ingress-nginx-controller-7d5d2                     1/1     Running     0          94d
rke2-ingress-nginx-controller-8j6jl                     1/1     Running     1          94d
rke2-ingress-nginx-controller-9sgqb                     1/1     Running     0          112m
rke2-ingress-nginx-controller-cj75v                     1/1     Running     1          84d
rke2-ingress-nginx-controller-lt64l                     1/1     Running     1          89d
rke2-ingress-nginx-controller-rthjm                     1/1     Running     1          84d
rke2-metrics-server-5df7d77b5b-b4qlw                    1/1     Running     10         73d

creamy-pencil-82913

09/11/2022, 1:05 AM

I can’t say I’ve ever seen that. What do you see from the kubelet logs (

/var/lib/rancher/rke2/agent/logs/kubelet.log

) on the servers?

magnificent-vr-88571

09/11/2022, 1:18 AM

Copy code

I0911 10:08:17.644983 2559675 kubelet.go:1846] "Starting kubelet main sync loop"
E0911 10:08:17.645017 2559675 kubelet.go:1870] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
I0911 10:08:17.717783 2559675 kuberuntime_manager.go:1044] "Updating runtime config through cri with podcidr" CIDR="10.42.0.0/24"
I0911 10:08:17.718054 2559675 kubelet_network.go:76] "Updating Pod CIDR" originalPodCIDR="" newPodCIDR="10.42.0.0/24"
I0911 10:08:17.719543 2559675 kubelet_node_status.go:71] "Attempting to register node" node="sv-server"
I0911 10:08:17.727884 2559675 kubelet_node_status.go:109] "Node was previously registered" node="sv-server"
I0911 10:08:17.727927 2559675 kubelet_node_status.go:74] "Successfully registered node" node="sv-server"
I0911 10:08:17.730628 2559675 setters.go:577] "Node became not ready" node="sv-server" condition={Type:Ready Status:False LastHeartbeatTime:2022-09-11 10:08:17.730606108 +0900 JST m=+5.229502926 LastTransitionTime:2022-09-11 10:08:17.730606108 +0900 JST m=+5.229502926 Reason:KubeletNotReady Message:container runtime status check may not have completed yet}
E0911 10:08:17.745759 2559675 kubelet.go:1870] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
I0911 10:08:17.838507 2559675 cpu_manager.go:199] "Starting CPU manager" policy="none"
I0911 10:08:17.838514 2559675 cpu_manager.go:200] "Reconciling" reconcilePeriod="10s"
I0911 10:08:17.838522 2559675 state_mem.go:36] "Initialized new in-memory state store"
I0911 10:08:17.838607 2559675 state_mem.go:88] "Updated default CPUSet" cpuSet=""
I0911 10:08:17.838613 2559675 state_mem.go:96] "Updated CPUSet assignments" assignments=map[]
I0911 10:08:17.838616 2559675 policy_none.go:44] "None policy: Start"
I0911 10:08:17.842140 2559675 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
I0911 10:08:17.842227 2559675 operation_generator.go:181] parsed scheme: ""
I0911 10:08:17.842232 2559675 operation_generator.go:181] scheme "" not registered, fallback to default scheme
I0911 10:08:17.842246 2559675 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/driver.longhorn.io-reg.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:08:17.842250 2559675 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:08:17.846483 2559675 csi_plugin.go:99] <http://kubernetes.io/csi|kubernetes.io/csi>: Trying to validate a new CSI Driver with name: <http://driver.longhorn.io|driver.longhorn.io> endpoint: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock versions: 1.0.0
I0911 10:08:17.846504 2559675 csi_plugin.go:112] <http://kubernetes.io/csi|kubernetes.io/csi>: Register new plugin with name: <http://driver.longhorn.io|driver.longhorn.io> at endpoint: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
I0911 10:08:17.846527 2559675 clientconn.go:106] parsed scheme: ""
I0911 10:08:17.846530 2559675 clientconn.go:106] scheme "" not registered, fallback to default scheme
I0911 10:08:17.846556 2559675 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:08:17.846561 2559675 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:08:17.846575 2559675 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
I0911 10:08:17.846998 2559675 manager.go:414] "Got registration request from device plugin with resource" resourceName="<http://nvidia.com/gpu|nvidia.com/gpu>"
I0911 10:08:17.847051 2559675 endpoint.go:196] parsed scheme: ""
I0911 10:08:17.847056 2559675 endpoint.go:196] scheme "" not registered, fallback to default scheme
I0911 10:08:17.847067 2559675 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/nvidia-gpu.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:08:17.847071 2559675 clientconn.go:948] ClientConn switching balancer to "pick_first"

magnificent-vr-88571

09/11/2022, 1:26 AM

Copy code

I0911 10:23:46.343334 2613006 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
I0911 10:23:46.343408 2613006 operation_generator.go:181] parsed scheme: ""
I0911 10:23:46.343413 2613006 operation_generator.go:181] scheme "" not registered, fallback to default scheme
I0911 10:23:46.343435 2613006 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/driver.longhorn.io-reg.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:23:46.343441 2613006 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:23:46.344368 2613006 manager.go:414] "Got registration request from device plugin with resource" resourceName="<http://nvidia.com/gpu|nvidia.com/gpu>"
I0911 10:23:46.344442 2613006 endpoint.go:196] parsed scheme: ""
I0911 10:23:46.344448 2613006 endpoint.go:196] scheme "" not registered, fallback to default scheme
I0911 10:23:46.344463 2613006 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/nvidia-gpu.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:23:46.344469 2613006 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:23:46.345122 2613006 csi_plugin.go:99] <http://kubernetes.io/csi|kubernetes.io/csi>: Trying to validate a new CSI Driver with name: <http://driver.longhorn.io|driver.longhorn.io> endpoint: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock versions: 1.0.0
I0911 10:23:46.345222 2613006 csi_plugin.go:112] <http://kubernetes.io/csi|kubernetes.io/csi>: Register new plugin with name: <http://driver.longhorn.io|driver.longhorn.io> at endpoint: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
I0911 10:23:46.345274 2613006 clientconn.go:106] parsed scheme: ""
I0911 10:23:46.345280 2613006 clientconn.go:106] scheme "" not registered, fallback to default scheme
I0911 10:23:46.345323 2613006 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:23:46.345328 2613006 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:23:46.345352 2613006 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
I0911 10:23:46.450868 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.450972 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451007 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451028 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451046 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451070 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451222 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="634a15c871a3e54d0199a15d05f9c72ff5ba31404ea65c8d912dcf06e333895d"
I0911 10:23:46.451231 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="7fce89225ab4c291f8e1b87d8c658dc3088c996e9d3fc7f57b733e283e298521"
I0911 10:23:46.451337 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="325983142c047ff495a362f6707395225b140b31d84dc6506ade1655a288de64"
I0911 10:23:46.451359 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="ee3b6083a3da1d86a6cff65a343eb706db097a7d72d935335c7d6523800733c7"
I0911 10:23:46.451390 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="3d9b0508b979ba80fc6d8dd4b494bcac6c1b8fe408118b22809e2c63c50a11f6"
I0911 10:23:46.451420 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="42cf05994f2f66910cf03179d65eaf1d075085db9da75ef65cb0fdeb31c5aef8"

creamy-pencil-82913

09/11/2022, 1:28 AM

Can you attach the complete kubelet log, as well as the rke2 journald logs?

magnificent-vr-88571

09/11/2022, 1:28 AM

Copy code

I0911 10:23:48.520467 2613006 request.go:668] Waited for 1.194376708s due to client-side throttling, not priority and fairness, request: GET:<https://127.0.0.1:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%!D(MISSING)hubble-server-certs&limit=500&resourceVersion=0>
E0911 10:23:48.571164 2613006 configmap.go:200] Couldn't get configMap cattle-monitoring-system/rancher-monitoring-prometheus-adapter: failed to sync configmap cache: timed out waiting for the condition
E0911 10:23:48.571230 2613006 nestedpendingoperations.go:301] Operation for "{volumeName:<http://kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config|kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config> podName:bc41451c-7b4e-4103-aba8-d443a8c1a3ef nodeName:}" failed. No retries permitted until 2022-09-11 10:23:49.071212607 +0900 JST m=+8.058691933 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"config\" (UniqueName: \"<http://kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config\|kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config\>") pod \"rancher-monitoring-prometheus-adapter-8846d4757-xmxc6\" (UID: \"bc41451c-7b4e-4103-aba8-d443a8c1a3ef\") : failed to sync configmap cache: timed out waiting for the condition"
E0911 10:23:50.082189 2613006 configmap.go:200] Couldn't get configMap cattle-monitoring-system/rancher-monitoring-prometheus-adapter: failed to sync configmap cache: timed out waiting for the condition
E0911 10:23:50.082248 2613006 nestedpendingoperations.go:301] Operation for "{volumeName:<http://kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config|kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config> podName:bc41451c-7b4e-4103-aba8-d443a8c1a3ef nodeName:}" failed. No retries permitted until 2022-09-11 10:23:51.082231147 +0900 JST m=+10.069710473 (durationBeforeRetry 1s). Error: "MountVolume.SetUp failed for volume \"config\" (UniqueName: \"<http://kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config\|kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config\>") pod \"rancher-monitoring-prometheus-adapter-8846d4757-xmxc6\" (UID: \"bc41451c-7b4e-4103-aba8-d443a8c1a3ef\") : failed to sync configmap cache: timed out waiting for the condition"
I0911 10:23:54.522787 2613006 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"config-volume\" (UniqueName: \"<http://kubernetes.io/projected/f98d064c-913f-4cce-a3c6-c62357ea29ce-config-volume\|kubernetes.io/projected/f98d064c-913f-4cce-a3c6-c62357ea29ce-config-volume\>") pod \"botkube-c9bcbb9df-2hp8f\" (UID: \"f98d064c-913f-4cce-a3c6-c62357ea29ce\") "
I0911 10:23:54.522800 2613006 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"istio-envoy\" (UniqueName: \"<http://kubernetes.io/empty-dir/88b1d52f-f58c-437e-bef6-04216db27b0d-istio-envoy\|kubernetes.io/empty-dir/88b1d52f-f58c-437e-bef6-04216db27b0d-istio-envoy\>") pod \"jonxu-0\" (UID: \"88b1d52f-f58c-437e-bef6-04216db27b0d\") "

creamy-pencil-82913

09/11/2022, 3:34 AM

the kubelets are failing to create the mirror pods for the static pods, but I don’t know why. If you can upload the full logs I might be able to tell why.

creamy-pencil-82913

09/11/2022, 3:45 AM

It might also be interesting to see the output of:

Copy code

for SERVER in $(kubectl get endpoints kubernetes --no-headers | awk '{print $2}' | xargs -n1 -d,); do kubectl -s https://$SERVER get leases -A; done

creamy-pencil-82913

09/11/2022, 3:47 AM

just saw your DM. The output of the above command would also be interesting.

🙆‍♂️ 1

magnificent-vr-88571

09/11/2022, 3:50 AM

Copy code

NAMESPACE          NAME                                                                                                            HOLDER                                                                         AGE
gpu-operator       <http://53822513.nvidia.com|53822513.nvidia.com>                                                                                             gpu-operator-794b8c8ddc-lg87z_f6edc1cf-8a2c-4a87-ae44-43dd37c8d2b1             142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.apiserversource.reconciler.00-of-01                              eventing-controller-79895f9c56-jc6b9_5bb67dc6-0fcc-4195-a83d-20a70c63e033      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.channel.reconciler.00-of-01                                      eventing-controller-79895f9c56-jc6b9_150deb66-ceff-4505-9ebf-f089e9616305      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.containersource.reconciler.00-of-01                              eventing-controller-79895f9c56-jc6b9_7544a369-6d4e-4b0e-a7dc-ca1230d42e44      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.eventtype.reconciler.00-of-01                                    eventing-controller-79895f9c56-jc6b9_c9093abe-2805-423b-bd1c-eca0ba861dad      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.parallel.reconciler.00-of-01                                     eventing-controller-79895f9c56-jc6b9_053baeb4-18dc-417b-b0ca-636268998057      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.pingsource.reconciler.00-of-01                                   eventing-controller-79895f9c56-jc6b9_f7b335dd-fe91-498c-8253-0d72448469b6      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.sequence.reconciler.00-of-01                                     eventing-controller-79895f9c56-jc6b9_d86ec61e-9fef-4493-9b7e-6eea8839273f      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.source.crd.reconciler.00-of-01                                   eventing-controller-79895f9c56-jc6b9_ca3a2f01-927f-44a7-99ee-02cf2755c767      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.subscription.reconciler.00-of-01                                 eventing-controller-79895f9c56-jc6b9_cfd2ca13-3d95-4dd7-b297-c44812c37b68      142d
knative-eventing   eventing-webhook.configmapwebhook.00-of-01                                                                      eventing-webhook-78f897666-pb6lt_5847b453-0449-40a2-9a88-eccd7caad2e8          142d
knative-eventing   eventing-webhook.conversionwebhook.00-of-01                                                                     eventing-webhook-78f897666-pb6lt_e6a01bb0-67ba-439e-9ce5-e65eb75760bd          142d
knative-eventing   eventing-webhook.defaultingwebhook.00-of-01                                                                     eventing-webhook-78f897666-pb6lt_a1528982-16b7-481a-8271-889744eb63fa          142d
knative-eventing   eventing-webhook.sinkbindings.00-of-01                                                                          eventing-webhook-78f897666-pb6lt_f18ca410-7fab-4b96-8e97-ad6571b2c429          142d
knative-eventing   eventing-webhook.sinkbindings.webhook.sources.knative.dev.00-of-01                                              eventing-webhook-78f897666-pb6lt_170aabc7-10ed-4086-9599-da821157fae2          142d
knative-eventing   eventing-webhook.validationwebhook.00-of-01                                                                     eventing-webhook-78f897666-pb6lt_04a9bdc5-32ed-4748-99f6-c570bd2262d4          142d
knative-eventing   eventing-webhook.webhookcertificates.00-of-01                                                                   eventing-webhook-78f897666-pb6lt_08571725-cebb-433b-8719-c2b74d20d86e          142d
knative-eventing   inmemorychannel-controller.knative.dev.eventing.pkg.reconciler.inmemorychannel.controller.reconciler.00-of-01   imc-controller-688df5bdb4-h8c5g_899bb490-aa39-41df-a83b-44059cf0e6ac           142d
knative-eventing   inmemorychannel-dispatcher.knative.dev.eventing.pkg.reconciler.inmemorychannel.dispatcher.reconciler.00-of-01   imc-dispatcher-646978d797-7fgb7_89549517-dac5-4167-b47c-8662bfe48e17           142d
knative-eventing   mt-broker-controller.knative.dev.eventing.pkg.reconciler.broker.reconciler.00-of-01                             mt-broker-controller-67c977497-959ft_a7ad5cf3-21e7-4522-97ba-1680a15a602e      142d
knative-eventing   mt-broker-controller.knative.dev.eventing.pkg.reconciler.broker.trigger.reconciler.00-of-01                     mt-broker-controller-67c977497-959ft_4c80242a-6d14-4cca-8b35-78333b51e76d      142d
knative-serving    autoscaler-bucket-00-of-01                                                                                      autoscaler-5c648f7465-mb8dj_10.42.10.66                                        142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.configuration.reconciler.00-of-01                                 controller-57c545cbfb-rfts5_a85f2981-49fd-4184-9421-64616116ed23               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.gc.reconciler.00-of-01                                            controller-57c545cbfb-rfts5_6f73ae4e-33e6-4a81-97e9-291f906f3688               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.labeler.reconciler.00-of-01                                       controller-57c545cbfb-rfts5_efc28882-968c-49ea-ba49-5a7eb91f0898               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.revision.reconciler.00-of-01                                      controller-57c545cbfb-rfts5_cd36830e-d6ed-488a-baf4-3aa92bdd6ea3               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.route.reconciler.00-of-01                                         controller-57c545cbfb-rfts5_2deb9296-1663-4798-9acb-b84785fdd0c8               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.serverlessservice.reconciler.00-of-01                             controller-57c545cbfb-rfts5_b935f998-cac7-48fe-bb32-24282c071cc2               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.service.reconciler.00-of-01                                       controller-57c545cbfb-rfts5_bfdacf4a-8c39-4d66-87a8-984a373d7894               142d
knative-serving    istio-webhook.configmapwebhook.00-of-01                                                                         istio-webhook-578b6b7654-lwhq6_f5aa2b57-939d-4382-a85b-91d6bee45758            142d
knative-serving    istio-webhook.defaultingwebhook.00-of-01                                                                        istio-webhook-578b6b7654-lwhq6_28e2d5d1-f4bf-42ff-961a-c1644d2830a4            142d
knative-serving    istio-webhook.webhookcertificates.00-of-01                                                                      istio-webhook-578b6b7654-lwhq6_03a80682-4109-490a-914e-db8a07f87df6            142d
knative-serving    istiocontroller.knative.dev.net-istio.pkg.reconciler.ingress.reconciler.00-of-01                                networking-istio-6b88f745c-kgws5_cbd3c902-eabf-45d0-864b-2ad23567fe3f          142d
knative-serving    istiocontroller.knative.dev.net-istio.pkg.reconciler.serverlessservice.reconciler.00-of-01                      networking-istio-6b88f745c-kgws5_3c8b5a99-81e9-407f-90b0-de78eb68580b          142d
knative-serving    webhook.configmapwebhook.00-of-01                                                                               webhook-6fffdc4d78-ftj79_2e7bc2df-7856-4187-a4e9-ed7542c52e88                  142d
knative-serving    webhook.defaultingwebhook.00-of-01                                                                              webhook-6fffdc4d78-ftj79_0ac3b9c9-11cb-4be0-9d03-3b55c3eb4e17                  142d
knative-serving    webhook.validationwebhook.00-of-01                                                                              webhook-6fffdc4d78-ftj79_2ba978d8-7ba6-42d9-a122-df7581c40aac                  142d
knative-serving    webhook.webhookcertificates.00-of-01                                                                            webhook-6fffdc4d78-ftj79_f834637d-d6c2-460a-bf5c-6c801d201069                  142d

magnificent-vr-88571

09/11/2022, 3:50 AM

Copy code

kube-node-lease    sv-agent3                                                                                                            sv-agent3                                                                           94d
kube-node-lease    sv-agent4                                                                                                            sv-agent4                                                                           94d
kube-node-lease    sv-agent5                                                                                                            sv-agent5                                                                           94d
kube-node-lease    sv-server                                                                                                            sv-server                                                                           143d
kube-node-lease    sv-agent6                                                                                                            sv-agent6                                                                           94d
kube-node-lease    sv-agent1                                                                                                             sv-agent1                                                                            94d
kube-node-lease    sv-agent2                                                                                                             sv-agent2                                                                            94d
kube-system        cert-manager-cainjector-leader-election                                                                         cert-manager-cainjector-5bdc6f956-kbq4q_16708ed7-8bec-43ea-b3ab-162c852bfe09   142d
kube-system        cert-manager-controller                                                                                         cert-manager-7d8cf77cc9-99nnp-external-cert-manager-controller                 142d
kube-system        cilium-operator-resource-lock                                                                                   sv-server-ioxXPUyruo                                                                143d
kube-system        cloud-controller-manager                                                                                        sv-server_7e20af5e-54f3-4263-b673-40155be06df5                                      143d
kube-system        kube-controller-manager                                                                                         sv-server_47875a89-2eb4-4733-ac7f-e0b2040b5408                                      143d
kube-system        kube-scheduler                                                                                                  sv-server_c74db847-a314-4283-a8c6-c6b1aab00a72                                      143d
kubeflow           workflow-controller                                                                                             workflow-controller-5cb67bb9db-zr5vv                                           142d
longhorn-system    driver-longhorn-io                                                                                              csi-provisioner-77b7fb5549-w57v5                                               102d
longhorn-system    external-attacher-leader-driver-longhorn-io                                                                     csi-attacher-6688cff467-nfw6b                                                  102d
longhorn-system    external-resizer-driver-longhorn-io                                                                             csi-resizer-58f5bb8799-4c9xn                                                   102d
longhorn-system    external-snapshotter-leader-driver-longhorn-io                                                                  csi-snapshotter-6d4f88d689-r9864                                               102d

creamy-pencil-82913

09/11/2022, 5:29 AM

OK.. I take it from that output that you have only a single server named sv-server? so it’s not a problem with etcd… this is weird, I’ve never seen the kubelet fail to create mirror pods for the control-plane static pods.

😢 1

magnificent-vr-88571

09/11/2022, 5:52 AM

yes, it’s sv-server

2070 Views

Open in Slack

Previous Next