Hi, We are experiencing the following problem with...
# rke2
k
Hi, We are experiencing the following problem with our RKE2 + Cilium cluster (v.1.25.11+rke2r1); Once we create a cluster all nodes are online and working fine. But slowly ones are reported back as ‘NotReady’. Looking into the node it appears that it lost all network connectivity. Even a ping to 127.0.0.1 is not working. After a reboot of the host all is back healthy again. In the logs of kubelet we see a lot of errors like
Copy code
"Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
failed to ensure lease exists, will retry in 200ms, error: Get "<https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/xxx?timeout=10s>": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Looking into the cilium logs I see these errors:
Copy code
2023-07-13T06:22:16.59076557Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/cilium_network_policy.go:155|github.com/cilium/cilium/pkg/k8s/watchers/cilium_network_policy.go:155>: watch of *v2.CiliumNetworkPolicy ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590821235Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/cilium_endpoint.go:97|github.com/cilium/cilium/pkg/k8s/watchers/cilium_endpoint.go:97>: watch of *v2.CiliumEndpoint ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590827854Z stderr F level=error msg="Cannot update CEP" containerID= controller="sync-to-k8s-ciliumendpoint (1066)" datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=1066 error="Patch \"<https://10.40.0.1:443/apis/cilium.io/v2/namespaces/kube-system/ciliumendpoints/rke2-metrics-server-78b84fff48-fdp5l>\": http2: client connection lost" identity=27735 ipv4= ipv6= k8sPodName=/ subsys=endpointsynchronizer
2023-07-13T06:22:16.590830104Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/resource/resource.go:183|github.com/cilium/cilium/pkg/k8s/resource/resource.go:183>: watch of *v2alpha1.CiliumBGPPeeringPolicy ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590840863Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/service.go:72|github.com/cilium/cilium/pkg/k8s/watchers/service.go:72>: watch of *v1.Service ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590843107Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/network_policy.go:72|github.com/cilium/cilium/pkg/k8s/watchers/network_policy.go:72>: watch of *v1.NetworkPolicy ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590845068Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/pod.go:146|github.com/cilium/cilium/pkg/k8s/watchers/pod.go:146>: watch of *v1.Pod ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590846983Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/resource/resource.go:183|github.com/cilium/cilium/pkg/k8s/resource/resource.go:183>: watch of *v1.Node ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590848872Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/node.go:94|github.com/cilium/cilium/pkg/k8s/watchers/node.go:94>: watch of *v1.Node ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590850758Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/cilium_node.go:133|github.com/cilium/cilium/pkg/k8s/watchers/cilium_node.go:133>: watch of *v2.CiliumNode ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590853395Z stderr F level=error msg="Cannot update CEP" containerID= controller="sync-to-k8s-ciliumendpoint (1811)" datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=1811 error="Patch \"<https://10.40.0.1:443/apis/cilium.io/v2/namespaces/cattle-system/ciliumendpoints/rancher-b5b87bf46-84w2t>\": http2: client connection lost" identity=48370 ipv4= ipv6= k8sPodName=/ subsys=endpointsynchronizer
2023-07-13T06:22:16.590855302Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/namespace.go:63|github.com/cilium/cilium/pkg/k8s/watchers/namespace.go:63>: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590857245Z stderr F level=error msg="Cannot update CEP" containerID= controller="sync-to-k8s-ciliumendpoint (1439)" datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=1439 error="Patch \"<https://10.40.0.1:443/apis/cilium.io/v2/namespaces/kube-system/ciliumendpoints/rke2-ingress-nginx-controller-5nh4m>\": http2: client connection lost" identity=32313 ipv4= ipv6= k8sPodName=/ subsys=endpointsynchronizer
2023-07-13T06:22:16.59085911Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/identitybackend/identity.go:363|github.com/cilium/cilium/pkg/k8s/identitybackend/identity.go:363>: watch of *v2.CiliumIdentity ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590868464Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:157|github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:157>: watch of *v1.EndpointSlice ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590870741Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/cilium_clusterwide_network_policy.go:97|github.com/cilium/cilium/pkg/k8s/watchers/cilium_clusterwide_network_policy.go:97>: watch of *v2.CiliumClusterwideNetworkPolicy ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:16.590875346Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/resource/resource.go:183|github.com/cilium/cilium/pkg/k8s/resource/resource.go:183>: watch of *v1.Service ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding" subsys=klog
2023-07-13T06:22:40.156133358Z stderr F level=info msg="Removed endpoint" containerID= datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=470 identity=4 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2023-07-13T06:22:41.242265123Z stderr F level=info msg="New endpoint" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=275 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2023-07-13T06:22:41.242283233Z stderr F level=info msg="Resolving identity labels (blocking)" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=275 identityLabels="reserved:health" ipv4= ipv6= k8sPodName=/ subsys=endpoint
2023-07-13T06:22:41.242290031Z stderr F level=info msg="Identity of endpoint changed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=275 identity=4 identityLabels="reserved:health" ipv4= ipv6= k8sPodName=/ oldIdentity="no identity" subsys=endpoint
2023-07-13T06:22:41.320972719Z stderr F level=info msg="Rewrote endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=275 identity=4 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2023-07-13T06:22:42.866548423Z stderr F level=warning msg="hold timer expired" Key=10.20.5.1 State=BGP_FSM_ESTABLISHED Topic=Peer asn=4240020103 component=gobgp.BgpServerInstance subsys=bgp-control-plane
2023-07-13T06:22:42.866570436Z stderr F level=warning msg="sent notification" Code=4 Data="[]" Key=10.20.5.1 State=BGP_FSM_ESTABLISHED Subcode=0 Topic=Peer asn=4240020103 component=gobgp.BgpServerInstance subsys=bgp-control-plane
2023-07-13T06:22:42.866577495Z stderr F level=info msg="Peer Down" Key=10.20.5.1 Reason=hold-timer-expired State=BGP_FSM_ESTABLISHED Topic=Peer asn=4240020103 component=gobgp.BgpServerInstance subsys=bgp-control-plane
2023-07-13T06:22:42.866812526Z stderr F level=info msg="type:STATE  peer:{conf:{local_asn:4240020103  neighbor_address:\"10.20.5.1\"  peer_asn:4220020000}  state:{local_asn:4240020103  neighbor_address:\"10.20.5.1\"  peer_asn:4220020000  session_state:IDLE  router_id:\"31.169.60.134\"}  transport:{local_address:\"10.20.5.13\"  local_port:40933  remote_port:179}}"
2023-07-13T06:22:47.484185094Z stderr F level=warning msg="<http://github.com/cilium/cilium/pkg/k8s/watchers/service.go:72|github.com/cilium/cilium/pkg/k8s/watchers/service.go:72>: failed to list *v1.Service: Get \"<https://10.40.0.1:443/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=12111>\": dial tcp 10.40.0.1:443: i/o timeout" subsys=klog
2023-07-13T06:22:47.484235925Z stderr F level=info msg="Trace[692371586]: \"Reflector ListAndWatch\" name:<http://github.com/cilium/cilium/pkg/k8s/watchers/service.go:72|github.com/cilium/cilium/pkg/k8s/watchers/service.go:72> (13-Jul-2023 06:22:17.483) (total time: 30000ms):" subsys=klog
2023-07-13T06:22:47.484239761Z stderr F level=info msg="Trace[692371586]: ---\"Objects listed\" error:Get \"<https://10.40.0.1:443/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=12111>\": dial tcp 10.40.0.1:443: i/o timeout 30000ms (06:22:47.484)" subsys=klog
But I am not really sure if this is causing the broken interfaces or if Cilium is just suffering from the network interfaces being down. Any tips on how to troubleshoot this?
1313 Views