This message was deleted.
# rke2
a
This message was deleted.
m
@creamy-pencil-82913, I would like to know for the above situation If I bring up a new server with different IP & node name, I can follow https://docs.rke2.io/backup_restore/#restoring-a-snapshot-to-new-nodes to bring back the cluster ? Currently after moving to new IP existing single node master from 10.21.100.19 to 172.18.14.15 is not starting services, failing with following error.
Copy code
msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [svmaster-7ba99b6a=<https://10.21.100.19:2380>], expect: svmaster-7ba99b6a=172.18.14.15"
c
If it’s a single-server cluster, you need to do a --cluster-reset if you change the server’s IP. If you change the IP address of a server in a multi-server cluster, you should delete that node from the cluster and then rejoin it. Node IPs are expected to be static for the life of the node in the cluster.
m
it’s actually a 6 node cluster with 1 master + 5 worker. single node, i meant single master.
c
RKE2 doesn’t have masters or workers, it has servers and agents. What you’re describing is single-server.
m
yes 1 server and 5 agents
so I need to • “--cluster-reset” on the server node • advertise-address: new IP in the config file /etc/rancher/rke2/config.yaml of server node, and start service. • modify config file of agent nodes to reach new “IP address of server node”
am i right ?
c
you shouldn’t need to set the advertise address, unless it’s detecting the wrong IP for some reason
🙆‍♂️ 1
Stop the service, run
rke2 server --cluster-reset
and wait for it to finish, then start the service again.
m
noticing many error in the process after executing
rke2 server --cluster-reset
Copy code
WARN[0305] Unable to watch for tunnel endpoints: Get "<https://127.0.0.1:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=0&watch=true>": dial tcp 127.0.0.1:6443: connect: connection refused
INFO[0305] Waiting for kubelet to be ready on node sv13: Get "<https://127.0.0.1:6443/api/v1/nodes/svmaster>": dial tcp 127.0.0.1:6443: connect: connection refused
INFO[0310] Waiting for API server to become available
INFO[0310] Waiting for etcd server to become available
INFO[0310] Failed to test data store connection: this server is a not a member of the etcd cluster. Found [svmaster-7d4b12a0=<https://10.21.100.19:2380>], expect: svmaster-7d4b12a0=172.18.14.15
INFO[0310] Cluster-Http-Server 2022/09/03 01:42:09 http: TLS handshake error from 127.0.0.1:42616: remote error: tls: bad certificate
INFO[0310] Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error
It’s a airgap environment and there is no internet connectivity for now.
looks following message is repeating, as you have mentioned I have removed “advertise-address:” from /etc/rancher/rke2/config.yaml
Copy code
INFO[0310] Failed to test data store connection: this server is a not a member of the etcd cluster. Found [svmaster-7d4b12a0=<https://10.21.100.19:2380>], expect: svmaster-7d4b12a0=172.18.14.15
c
does the host have multiple interfaces? Did you want to use --node-ip instead of --advertise-address?
m
Copy code
IPv4 address for cilium_host:     10.42.0.111
  IPv4 address for eno1:            172.18.14.15 
  IPv4 address for enxb03af2b6059f: 169.254.3.1
I can see above ip’s in the cluster
Copy code
eno1:            172.18.14.15
is the only physical network
Do I need to set back “advertise-address:” from /etc/rancher/rke2/config.yaml with “172.18.14.15" ? Earlier i did set it with previous IP.
c
Did the
rke2 server --cluster-reset
complete?
it looks like it’s still using the old cluster config
c
try running
rke2-killall.sh
first
if it still doesn’t reset successfully, post the full log, not just the bit that it’s repeating
m
sure got it
failed with attached error
I am using longhorn for storage, may be got do something related to https://github.com/kubernetes/kubernetes/issues/105536
c
hmm, indeed. it appears to be stuck trying to clean up the longhorn mount points
would you mind opening an issue on GH, and attach that log?
m
sure no problem, for now can delete those dir manually right?
c
yeah… honestly the killall should have cleaned those up, but the kubelet tries to also do it, and it seems to be stuck trying to talk to longhorn which obviously isn’t running since none of the pods are up.
you might delete that longhorn socket from the kubelet plugin dir, see if that helps
m
oh, got it. let me try that
no luck, deleted longhorn from kubelete/plugins rke2-killall.sh still those pods are in
/var/lib/kubelet/pods
c
yeah, but does the kubelet still hang trying to talk to that socket when you do the cluster-reset?
or is it hanging on something else
m
still facing below errors, I have removed all dirs under
/var/lib/kubelet/pods/
manually.
Copy code
E0903 02:27:59.425466  235868 kubelet_volumes.go:225] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"6525b999-6ed3-40dd-87e6-674063559a95\" found, but failed to rmdir() volume at path /var/lib/kubelet/pods/6525b999-6ed3-40dd-87e6-674063559a95/volumes/kubernetes.io~configmap/istiod-ca-cert: directory not empty" numErrs=13
c
try doing
rm -rf /var/lib/kubelet/pods/*/volumes/kubernetes.io*
I think it’s still related to the longhorn volumes, this is not normally a problem
m
finally received below message
Copy code
FATA[0100] starting kubernetes: preparing server: start managed database: cluster-reset was successfully performed, please remove the cluster-reset flag and start rke2 normally, if you need to perform another cluster reset, you must first manually delete the /var/lib/rancher/rke2/server/db/reset-flag file
between faced error due to removal of longhorn from kubelete/plugins, so restored it
while trying to start rke2-server through systemctl following log is observed.
c
hmm, so it still has the old IP. Do you have that IP listed anywhere in the config file?
I will also say that
v1.21.5+rke2r2
is quite old at this point. We’ve fixed a lot of stuff in the cluster reset; I think correcting the node IP during the reset was one of the fixes
You should go to the latest 1.21 release at the very least, if not up to 1.23 or newer since 1.21 is EOL and 1.22 is 1 month from EOL
m
Stuck in 1.21 because of kubeflow :(
So its good to install 1.21 latest in another node and try restore https://docs.rke2.io/backup_restore/#restoring-a-snapshot-to-new-nodes?
c
yeah, or just upgrade that one, either should work
m
Earlier IP was added in config file as “advertise address”, now it’s removed.
c
Kubeflow doesn’t support any non-eol versions of Kubernetes!?
m
Yeah… that’s the latest of kubeflow :(
c
that’s insane
😞 1
m
while trying to install 1.21 in new node and restore following error observed and restore is not success.
FATA[0010] starting kubernetes: preparing server: start managed database: etcd: snapshot path does not exist: on-demand-svmaster-1661871603.zip
c
the file doesn’t exist; are you in the correct directory or did you need to pass the absolute path to the file?
m
sorry that’s my bad.. gave correct path looks like cluster restored
I was able to bring up the cluster following https://docs.rke2.io/backup_restore/#restoring-a-snapshot-to-existing-nodes, and updated server ip’s in
/var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json
and
/var/lib/rancher/rke2/agent/etc/rke2-api-server-agent-load-balancer.json
In the cluster 1server+5agent was up and running. But facing trouble in 1 agent not joining with following error.
Copy code
Sep 10 22:36:26 sv-agent systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Sep 10 22:36:26 sv-agent sh[4038]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Sep 10 22:36:26 sv-agent sh[4039]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Sep 10 22:36:26 sv-agent rke2[4042]: time="2022-09-10T22:36:26+09:00" level=warning msg="not running in CIS mode"
Sep 10 22:36:26 sv-agent rke2[4042]: time="2022-09-10T22:36:26+09:00" level=info msg="Starting rke2 agent v1.21.5+rke2r2 (9e4acdc6018ae74c36523c99af25ab861f3884da)"
Sep 10 22:36:26 sv-agent rke2[4042]: time="2022-09-10T22:36:26+09:00" level=info msg="Running load balancer 127.0.0.1:6444 -> [192.168.x.x:9345]"
Sep 10 22:36:26 sv-agent rke2[4042]: time="2022-09-10T22:36:26+09:00" level=info msg="Running load balancer 127.0.0.1:6443 -> [192.168.x.x:6443]"
Sep 10 22:36:36 sv-agent rke2[4042]: time="2022-09-10T22:36:36+09:00" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"<https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Sep 10 22:36:46 sv-agent rke2[4042]: time="2022-09-10T22:36:46+09:00" level=info msg="Waiting to retrieve agent configuration; server is not ready: Get \"<https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Any recommended solution to resolve ?
Noticed following error in server
Copy code
Sep 10 22:49:06 sv-master rke2[605992]: time="2022-09-10T22:49:06+09:00" level=info msg="Cluster-Http-Server 2022/09/10 22:49:06 http: TLS handshake error from 192.168.x.x:54796: remote error: tls: bad certificate"
Sep 10 22:49:07 sv-master rke2[605992]: time="2022-09-10T22:49:07+09:00" level=error msg="Internal error occurred: failed calling webhook \"<http://rancher.cattle.io|rancher.cattle.io>\": Post \"<https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s>\": dial tcp 10.43.122.217:443: i/o timeout"
Sep 10 22:49:17 sv-master rke2[605992]: time="2022-09-10T22:49:17+09:00" level=info msg="Cluster-Http-Server 2022/09/10 22:49:17 http: TLS handshake error from 192.168.x.x:54810: remote error: tls: bad certificate"
Sep 10 22:49:17 sv-master rke2[605992]: time="2022-09-10T22:49:17+09:00" level=error msg="Internal error occurred: failed calling webhook \"<http://rancher.cattle.io|rancher.cattle.io>\": Post \"<https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s>\": context deadline exceeded"
Sep 10 22:49:27 sv-master rke2[605992]: time="2022-09-10T22:49:27+09:00" level=error msg="Internal error occurred: failed calling webhook \"<http://rancher.cattle.io|rancher.cattle.io>\": Post \"<https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation?timeout=10s>\": dial tcp 10.43.122.217:443: i/o timeout"
c
Did you change the server address on that agent? It looks like it’s not able to connect to the server…
m
yes, config.yaml is same as other agents. server ip is updated to new ip .
We can ignore tls bad certificates error in server side log?
c
yes, you’ll see that every time an agent connects as it checks to see if the cert can be validated with the OS CA bundle.
can you successfully curl the server address from the node?
m
Copy code
curl <https://192.168.11.103:9345> -kv
*   Trying 192.168.11.103:9345...
* TCP_NODELAY set
* connect to 192.168.11.103 port 9345 failed: Connection refused
* Failed to connect to 192.168.11.103 port 9345: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 192.168.11.103 port 9345: Connection refused
@creamy-pencil-82913 could you please share if any particular curl command to be verified?
tried deleting from cluster
kubectl delete node agent
and added again.. it worked..
As i mentioned earlier its a single server cluster, i couldn’t find kube-api/etcd/scheduler pods in kube-system namespace. facing below error in
kubectl get events
Copy code
'(combined from similar events): Unable to attach or mount volumes: unmounted
  volumes=[jenkins-home], unattached volumes=[kube-api-access-r5rgj sc-config-volume
  admin-secret jenkins-config plugins plugin-dir jenkins-home tmp-volume jenkins-cache]:
  timed out waiting for the condition'
c
not sure what you mean you can’t find the pods?
m
only following pods are in kube-system
Copy code
kubectl get pods -n kube-system
NAME                                                    READY   STATUS      RESTARTS   AGE
cilium-785bs                                            1/1     Running     0          112m
cilium-8t96j                                            1/1     Running     1          142d
cilium-fnwhw                                            1/1     Running     4          84d
cilium-hqt85                                            1/1     Running     1          94d
cilium-jz5l7                                            1/1     Running     1          84d
cilium-l56kv                                            1/1     Running     3          84d
cilium-node-init-4sdnt                                  1/1     Running     0          84d
cilium-node-init-8kccp                                  1/1     Running     0          94d
cilium-node-init-9dgwz                                  1/1     Running     1          142d
cilium-node-init-bt8w7                                  1/1     Running     0          84d
cilium-node-init-h5zp4                                  1/1     Running     0          112m
cilium-node-init-pzkh5                                  1/1     Running     0          94d
cilium-node-init-r9wr9                                  1/1     Running     0          84d
cilium-operator-85f67b5cb7-w676b                        1/1     Running     8          154m
cilium-operator-85f67b5cb7-wwlcr                        1/1     Running     1          142d
cilium-q5t5d                                            1/1     Running     3          84d
external-dns-dc9dd7d74-h6dqw                            1/1     Running     0          89d
helm-install-rke2-cilium-2rc7l                          0/1     Completed   166        14h
helm-install-rke2-coredns-pkk8n                         0/1     Completed   165        14h
helm-install-rke2-ingress-nginx-cg5g2                   0/1     Completed   164        14h
helm-install-rke2-metrics-server-c2cdz                  0/1     Completed   164        14h
kube-proxy-svagent6                                         1/1     Running     2          14h
kube-proxy-svagent5                                         1/1     Running     2          14h
kube-proxy-svagent4                                         1/1     Running     1          14h
kube-proxy-svagent3                                     1/1     Running     2          14h
kube-proxy-svagent1                                          1/1     Running     2          14h
kube-proxy-svagent2                                         1/1     Running     1          112m
metrics-server-8bbfb4bdb-x78tm                          1/1     Running     4          73d
rke2-coredns-rke2-coredns-78d6d5c574-2wcrb              1/1     Running     5          73d
rke2-coredns-rke2-coredns-78d6d5c574-wnvlm              1/1     Running     0          81d
rke2-coredns-rke2-coredns-autoscaler-7c58bd5b6c-25wxc   1/1     Running     5          73d
rke2-ingress-nginx-controller-5xzq5                     1/1     Running     1          84d
rke2-ingress-nginx-controller-7d5d2                     1/1     Running     0          94d
rke2-ingress-nginx-controller-8j6jl                     1/1     Running     1          94d
rke2-ingress-nginx-controller-9sgqb                     1/1     Running     0          112m
rke2-ingress-nginx-controller-cj75v                     1/1     Running     1          84d
rke2-ingress-nginx-controller-lt64l                     1/1     Running     1          89d
rke2-ingress-nginx-controller-rthjm                     1/1     Running     1          84d
rke2-metrics-server-5df7d77b5b-b4qlw                    1/1     Running     10         73d
c
I can’t say I’ve ever seen that. What do you see from the kubelet logs (
/var/lib/rancher/rke2/agent/logs/kubelet.log
) on the servers?
m
Copy code
I0911 10:08:17.644983 2559675 kubelet.go:1846] "Starting kubelet main sync loop"
E0911 10:08:17.645017 2559675 kubelet.go:1870] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
I0911 10:08:17.717783 2559675 kuberuntime_manager.go:1044] "Updating runtime config through cri with podcidr" CIDR="10.42.0.0/24"
I0911 10:08:17.718054 2559675 kubelet_network.go:76] "Updating Pod CIDR" originalPodCIDR="" newPodCIDR="10.42.0.0/24"
I0911 10:08:17.719543 2559675 kubelet_node_status.go:71] "Attempting to register node" node="sv-server"
I0911 10:08:17.727884 2559675 kubelet_node_status.go:109] "Node was previously registered" node="sv-server"
I0911 10:08:17.727927 2559675 kubelet_node_status.go:74] "Successfully registered node" node="sv-server"
I0911 10:08:17.730628 2559675 setters.go:577] "Node became not ready" node="sv-server" condition={Type:Ready Status:False LastHeartbeatTime:2022-09-11 10:08:17.730606108 +0900 JST m=+5.229502926 LastTransitionTime:2022-09-11 10:08:17.730606108 +0900 JST m=+5.229502926 Reason:KubeletNotReady Message:container runtime status check may not have completed yet}
E0911 10:08:17.745759 2559675 kubelet.go:1870] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
I0911 10:08:17.838507 2559675 cpu_manager.go:199] "Starting CPU manager" policy="none"
I0911 10:08:17.838514 2559675 cpu_manager.go:200] "Reconciling" reconcilePeriod="10s"
I0911 10:08:17.838522 2559675 state_mem.go:36] "Initialized new in-memory state store"
I0911 10:08:17.838607 2559675 state_mem.go:88] "Updated default CPUSet" cpuSet=""
I0911 10:08:17.838613 2559675 state_mem.go:96] "Updated CPUSet assignments" assignments=map[]
I0911 10:08:17.838616 2559675 policy_none.go:44] "None policy: Start"
I0911 10:08:17.842140 2559675 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
I0911 10:08:17.842227 2559675 operation_generator.go:181] parsed scheme: ""
I0911 10:08:17.842232 2559675 operation_generator.go:181] scheme "" not registered, fallback to default scheme
I0911 10:08:17.842246 2559675 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/driver.longhorn.io-reg.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:08:17.842250 2559675 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:08:17.846483 2559675 csi_plugin.go:99] <http://kubernetes.io/csi|kubernetes.io/csi>: Trying to validate a new CSI Driver with name: <http://driver.longhorn.io|driver.longhorn.io> endpoint: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock versions: 1.0.0
I0911 10:08:17.846504 2559675 csi_plugin.go:112] <http://kubernetes.io/csi|kubernetes.io/csi>: Register new plugin with name: <http://driver.longhorn.io|driver.longhorn.io> at endpoint: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
I0911 10:08:17.846527 2559675 clientconn.go:106] parsed scheme: ""
I0911 10:08:17.846530 2559675 clientconn.go:106] scheme "" not registered, fallback to default scheme
I0911 10:08:17.846556 2559675 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:08:17.846561 2559675 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:08:17.846575 2559675 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
I0911 10:08:17.846998 2559675 manager.go:414] "Got registration request from device plugin with resource" resourceName="<http://nvidia.com/gpu|nvidia.com/gpu>"
I0911 10:08:17.847051 2559675 endpoint.go:196] parsed scheme: ""
I0911 10:08:17.847056 2559675 endpoint.go:196] scheme "" not registered, fallback to default scheme
I0911 10:08:17.847067 2559675 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/nvidia-gpu.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:08:17.847071 2559675 clientconn.go:948] ClientConn switching balancer to "pick_first"
Copy code
I0911 10:23:46.343334 2613006 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
I0911 10:23:46.343408 2613006 operation_generator.go:181] parsed scheme: ""
I0911 10:23:46.343413 2613006 operation_generator.go:181] scheme "" not registered, fallback to default scheme
I0911 10:23:46.343435 2613006 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/driver.longhorn.io-reg.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:23:46.343441 2613006 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:23:46.344368 2613006 manager.go:414] "Got registration request from device plugin with resource" resourceName="<http://nvidia.com/gpu|nvidia.com/gpu>"
I0911 10:23:46.344442 2613006 endpoint.go:196] parsed scheme: ""
I0911 10:23:46.344448 2613006 endpoint.go:196] scheme "" not registered, fallback to default scheme
I0911 10:23:46.344463 2613006 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/nvidia-gpu.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:23:46.344469 2613006 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:23:46.345122 2613006 csi_plugin.go:99] <http://kubernetes.io/csi|kubernetes.io/csi>: Trying to validate a new CSI Driver with name: <http://driver.longhorn.io|driver.longhorn.io> endpoint: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock versions: 1.0.0
I0911 10:23:46.345222 2613006 csi_plugin.go:112] <http://kubernetes.io/csi|kubernetes.io/csi>: Register new plugin with name: <http://driver.longhorn.io|driver.longhorn.io> at endpoint: /var/lib/kubelet/plugins/driver.longhorn.io/csi.sock
I0911 10:23:46.345274 2613006 clientconn.go:106] parsed scheme: ""
I0911 10:23:46.345280 2613006 clientconn.go:106] scheme "" not registered, fallback to default scheme
I0911 10:23:46.345323 2613006 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/driver.longhorn.io/csi.sock  <nil> 0 <nil>}] <nil> <nil>}
I0911 10:23:46.345328 2613006 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0911 10:23:46.345352 2613006 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
I0911 10:23:46.450868 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.450972 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451007 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451028 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451046 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451070 2613006 topology_manager.go:187] "Topology Admit Handler"
I0911 10:23:46.451222 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="634a15c871a3e54d0199a15d05f9c72ff5ba31404ea65c8d912dcf06e333895d"
I0911 10:23:46.451231 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="7fce89225ab4c291f8e1b87d8c658dc3088c996e9d3fc7f57b733e283e298521"
I0911 10:23:46.451337 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="325983142c047ff495a362f6707395225b140b31d84dc6506ade1655a288de64"
I0911 10:23:46.451359 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="ee3b6083a3da1d86a6cff65a343eb706db097a7d72d935335c7d6523800733c7"
I0911 10:23:46.451390 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="3d9b0508b979ba80fc6d8dd4b494bcac6c1b8fe408118b22809e2c63c50a11f6"
I0911 10:23:46.451420 2613006 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="42cf05994f2f66910cf03179d65eaf1d075085db9da75ef65cb0fdeb31c5aef8"
c
Can you attach the complete kubelet log, as well as the rke2 journald logs?
m
Copy code
I0911 10:23:48.520467 2613006 request.go:668] Waited for 1.194376708s due to client-side throttling, not priority and fairness, request: GET:<https://127.0.0.1:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%!D(MISSING)hubble-server-certs&limit=500&resourceVersion=0>
E0911 10:23:48.571164 2613006 configmap.go:200] Couldn't get configMap cattle-monitoring-system/rancher-monitoring-prometheus-adapter: failed to sync configmap cache: timed out waiting for the condition
E0911 10:23:48.571230 2613006 nestedpendingoperations.go:301] Operation for "{volumeName:<http://kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config|kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config> podName:bc41451c-7b4e-4103-aba8-d443a8c1a3ef nodeName:}" failed. No retries permitted until 2022-09-11 10:23:49.071212607 +0900 JST m=+8.058691933 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"config\" (UniqueName: \"<http://kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config\|kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config\>") pod \"rancher-monitoring-prometheus-adapter-8846d4757-xmxc6\" (UID: \"bc41451c-7b4e-4103-aba8-d443a8c1a3ef\") : failed to sync configmap cache: timed out waiting for the condition"
E0911 10:23:50.082189 2613006 configmap.go:200] Couldn't get configMap cattle-monitoring-system/rancher-monitoring-prometheus-adapter: failed to sync configmap cache: timed out waiting for the condition
E0911 10:23:50.082248 2613006 nestedpendingoperations.go:301] Operation for "{volumeName:<http://kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config|kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config> podName:bc41451c-7b4e-4103-aba8-d443a8c1a3ef nodeName:}" failed. No retries permitted until 2022-09-11 10:23:51.082231147 +0900 JST m=+10.069710473 (durationBeforeRetry 1s). Error: "MountVolume.SetUp failed for volume \"config\" (UniqueName: \"<http://kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config\|kubernetes.io/configmap/bc41451c-7b4e-4103-aba8-d443a8c1a3ef-config\>") pod \"rancher-monitoring-prometheus-adapter-8846d4757-xmxc6\" (UID: \"bc41451c-7b4e-4103-aba8-d443a8c1a3ef\") : failed to sync configmap cache: timed out waiting for the condition"
I0911 10:23:54.522787 2613006 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"config-volume\" (UniqueName: \"<http://kubernetes.io/projected/f98d064c-913f-4cce-a3c6-c62357ea29ce-config-volume\|kubernetes.io/projected/f98d064c-913f-4cce-a3c6-c62357ea29ce-config-volume\>") pod \"botkube-c9bcbb9df-2hp8f\" (UID: \"f98d064c-913f-4cce-a3c6-c62357ea29ce\") "
I0911 10:23:54.522800 2613006 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"istio-envoy\" (UniqueName: \"<http://kubernetes.io/empty-dir/88b1d52f-f58c-437e-bef6-04216db27b0d-istio-envoy\|kubernetes.io/empty-dir/88b1d52f-f58c-437e-bef6-04216db27b0d-istio-envoy\>") pod \"jonxu-0\" (UID: \"88b1d52f-f58c-437e-bef6-04216db27b0d\") "
c
the kubelets are failing to create the mirror pods for the static pods, but I don’t know why. If you can upload the full logs I might be able to tell why.
It might also be interesting to see the output of:
Copy code
for SERVER in $(kubectl get endpoints kubernetes --no-headers | awk '{print $2}' | xargs -n1 -d,); do kubectl -s https://$SERVER get leases -A; done
just saw your DM. The output of the above command would also be interesting.
🙆‍♂️ 1
m
Copy code
NAMESPACE          NAME                                                                                                            HOLDER                                                                         AGE
gpu-operator       <http://53822513.nvidia.com|53822513.nvidia.com>                                                                                             gpu-operator-794b8c8ddc-lg87z_f6edc1cf-8a2c-4a87-ae44-43dd37c8d2b1             142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.apiserversource.reconciler.00-of-01                              eventing-controller-79895f9c56-jc6b9_5bb67dc6-0fcc-4195-a83d-20a70c63e033      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.channel.reconciler.00-of-01                                      eventing-controller-79895f9c56-jc6b9_150deb66-ceff-4505-9ebf-f089e9616305      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.containersource.reconciler.00-of-01                              eventing-controller-79895f9c56-jc6b9_7544a369-6d4e-4b0e-a7dc-ca1230d42e44      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.eventtype.reconciler.00-of-01                                    eventing-controller-79895f9c56-jc6b9_c9093abe-2805-423b-bd1c-eca0ba861dad      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.parallel.reconciler.00-of-01                                     eventing-controller-79895f9c56-jc6b9_053baeb4-18dc-417b-b0ca-636268998057      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.pingsource.reconciler.00-of-01                                   eventing-controller-79895f9c56-jc6b9_f7b335dd-fe91-498c-8253-0d72448469b6      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.sequence.reconciler.00-of-01                                     eventing-controller-79895f9c56-jc6b9_d86ec61e-9fef-4493-9b7e-6eea8839273f      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.source.crd.reconciler.00-of-01                                   eventing-controller-79895f9c56-jc6b9_ca3a2f01-927f-44a7-99ee-02cf2755c767      142d
knative-eventing   controller.knative.dev.eventing.pkg.reconciler.subscription.reconciler.00-of-01                                 eventing-controller-79895f9c56-jc6b9_cfd2ca13-3d95-4dd7-b297-c44812c37b68      142d
knative-eventing   eventing-webhook.configmapwebhook.00-of-01                                                                      eventing-webhook-78f897666-pb6lt_5847b453-0449-40a2-9a88-eccd7caad2e8          142d
knative-eventing   eventing-webhook.conversionwebhook.00-of-01                                                                     eventing-webhook-78f897666-pb6lt_e6a01bb0-67ba-439e-9ce5-e65eb75760bd          142d
knative-eventing   eventing-webhook.defaultingwebhook.00-of-01                                                                     eventing-webhook-78f897666-pb6lt_a1528982-16b7-481a-8271-889744eb63fa          142d
knative-eventing   eventing-webhook.sinkbindings.00-of-01                                                                          eventing-webhook-78f897666-pb6lt_f18ca410-7fab-4b96-8e97-ad6571b2c429          142d
knative-eventing   eventing-webhook.sinkbindings.webhook.sources.knative.dev.00-of-01                                              eventing-webhook-78f897666-pb6lt_170aabc7-10ed-4086-9599-da821157fae2          142d
knative-eventing   eventing-webhook.validationwebhook.00-of-01                                                                     eventing-webhook-78f897666-pb6lt_04a9bdc5-32ed-4748-99f6-c570bd2262d4          142d
knative-eventing   eventing-webhook.webhookcertificates.00-of-01                                                                   eventing-webhook-78f897666-pb6lt_08571725-cebb-433b-8719-c2b74d20d86e          142d
knative-eventing   inmemorychannel-controller.knative.dev.eventing.pkg.reconciler.inmemorychannel.controller.reconciler.00-of-01   imc-controller-688df5bdb4-h8c5g_899bb490-aa39-41df-a83b-44059cf0e6ac           142d
knative-eventing   inmemorychannel-dispatcher.knative.dev.eventing.pkg.reconciler.inmemorychannel.dispatcher.reconciler.00-of-01   imc-dispatcher-646978d797-7fgb7_89549517-dac5-4167-b47c-8662bfe48e17           142d
knative-eventing   mt-broker-controller.knative.dev.eventing.pkg.reconciler.broker.reconciler.00-of-01                             mt-broker-controller-67c977497-959ft_a7ad5cf3-21e7-4522-97ba-1680a15a602e      142d
knative-eventing   mt-broker-controller.knative.dev.eventing.pkg.reconciler.broker.trigger.reconciler.00-of-01                     mt-broker-controller-67c977497-959ft_4c80242a-6d14-4cca-8b35-78333b51e76d      142d
knative-serving    autoscaler-bucket-00-of-01                                                                                      autoscaler-5c648f7465-mb8dj_10.42.10.66                                        142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.configuration.reconciler.00-of-01                                 controller-57c545cbfb-rfts5_a85f2981-49fd-4184-9421-64616116ed23               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.gc.reconciler.00-of-01                                            controller-57c545cbfb-rfts5_6f73ae4e-33e6-4a81-97e9-291f906f3688               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.labeler.reconciler.00-of-01                                       controller-57c545cbfb-rfts5_efc28882-968c-49ea-ba49-5a7eb91f0898               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.revision.reconciler.00-of-01                                      controller-57c545cbfb-rfts5_cd36830e-d6ed-488a-baf4-3aa92bdd6ea3               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.route.reconciler.00-of-01                                         controller-57c545cbfb-rfts5_2deb9296-1663-4798-9acb-b84785fdd0c8               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.serverlessservice.reconciler.00-of-01                             controller-57c545cbfb-rfts5_b935f998-cac7-48fe-bb32-24282c071cc2               142d
knative-serving    controller.knative.dev.serving.pkg.reconciler.service.reconciler.00-of-01                                       controller-57c545cbfb-rfts5_bfdacf4a-8c39-4d66-87a8-984a373d7894               142d
knative-serving    istio-webhook.configmapwebhook.00-of-01                                                                         istio-webhook-578b6b7654-lwhq6_f5aa2b57-939d-4382-a85b-91d6bee45758            142d
knative-serving    istio-webhook.defaultingwebhook.00-of-01                                                                        istio-webhook-578b6b7654-lwhq6_28e2d5d1-f4bf-42ff-961a-c1644d2830a4            142d
knative-serving    istio-webhook.webhookcertificates.00-of-01                                                                      istio-webhook-578b6b7654-lwhq6_03a80682-4109-490a-914e-db8a07f87df6            142d
knative-serving    istiocontroller.knative.dev.net-istio.pkg.reconciler.ingress.reconciler.00-of-01                                networking-istio-6b88f745c-kgws5_cbd3c902-eabf-45d0-864b-2ad23567fe3f          142d
knative-serving    istiocontroller.knative.dev.net-istio.pkg.reconciler.serverlessservice.reconciler.00-of-01                      networking-istio-6b88f745c-kgws5_3c8b5a99-81e9-407f-90b0-de78eb68580b          142d
knative-serving    webhook.configmapwebhook.00-of-01                                                                               webhook-6fffdc4d78-ftj79_2e7bc2df-7856-4187-a4e9-ed7542c52e88                  142d
knative-serving    webhook.defaultingwebhook.00-of-01                                                                              webhook-6fffdc4d78-ftj79_0ac3b9c9-11cb-4be0-9d03-3b55c3eb4e17                  142d
knative-serving    webhook.validationwebhook.00-of-01                                                                              webhook-6fffdc4d78-ftj79_2ba978d8-7ba6-42d9-a122-df7581c40aac                  142d
knative-serving    webhook.webhookcertificates.00-of-01                                                                            webhook-6fffdc4d78-ftj79_f834637d-d6c2-460a-bf5c-6c801d201069                  142d
Copy code
kube-node-lease    sv-agent3                                                                                                            sv-agent3                                                                           94d
kube-node-lease    sv-agent4                                                                                                            sv-agent4                                                                           94d
kube-node-lease    sv-agent5                                                                                                            sv-agent5                                                                           94d
kube-node-lease    sv-server                                                                                                            sv-server                                                                           143d
kube-node-lease    sv-agent6                                                                                                            sv-agent6                                                                           94d
kube-node-lease    sv-agent1                                                                                                             sv-agent1                                                                            94d
kube-node-lease    sv-agent2                                                                                                             sv-agent2                                                                            94d
kube-system        cert-manager-cainjector-leader-election                                                                         cert-manager-cainjector-5bdc6f956-kbq4q_16708ed7-8bec-43ea-b3ab-162c852bfe09   142d
kube-system        cert-manager-controller                                                                                         cert-manager-7d8cf77cc9-99nnp-external-cert-manager-controller                 142d
kube-system        cilium-operator-resource-lock                                                                                   sv-server-ioxXPUyruo                                                                143d
kube-system        cloud-controller-manager                                                                                        sv-server_7e20af5e-54f3-4263-b673-40155be06df5                                      143d
kube-system        kube-controller-manager                                                                                         sv-server_47875a89-2eb4-4733-ac7f-e0b2040b5408                                      143d
kube-system        kube-scheduler                                                                                                  sv-server_c74db847-a314-4283-a8c6-c6b1aab00a72                                      143d
kubeflow           workflow-controller                                                                                             workflow-controller-5cb67bb9db-zr5vv                                           142d
longhorn-system    driver-longhorn-io                                                                                              csi-provisioner-77b7fb5549-w57v5                                               102d
longhorn-system    external-attacher-leader-driver-longhorn-io                                                                     csi-attacher-6688cff467-nfw6b                                                  102d
longhorn-system    external-resizer-driver-longhorn-io                                                                             csi-resizer-58f5bb8799-4c9xn                                                   102d
longhorn-system    external-snapshotter-leader-driver-longhorn-io                                                                  csi-snapshotter-6d4f88d689-r9864                                               102d
c
OK.. I take it from that output that you have only a single server named sv-server? so it’s not a problem with etcd… this is weird, I’ve never seen the kubelet fail to create mirror pods for the control-plane static pods.
😢 1
m
yes, it’s sv-server
1973 Views