https://rancher.com/ logo
#general
Title
# general
a

adamant-kite-43734

02/07/2023, 8:13 AM
This message was deleted.
q

quiet-rocket-69094

02/07/2023, 9:24 PM
Do you have RKE logs from the nodes in the cluster and any errors from the failurer?
e

eager-london-83975

02/07/2023, 9:25 PM
That is the thing, no errors reported in the masternode. While there are a few kubelet errors in the nodes that fail to add
q

quiet-rocket-69094

02/07/2023, 9:26 PM
what about errors on the rke containerr spun up on the nodes you're trying to add?
e

eager-london-83975

02/07/2023, 9:26 PM
The canal pods and csi pods attempt to run, but they fail to run because the node is not found ( literally what i've managed to catch inside the pods before the exit )
q

quiet-rocket-69094

02/07/2023, 9:27 PM
the node they are on is not found?
e

eager-london-83975

02/07/2023, 9:28 PM
so what happens is this: i execute the rke up command It detects the node is not found in the cluster, so it attempts to add it The node appears in the cluster for like 1 second, and the canal and csi pods attempt to run, the node disappears the pods disappear too the only containers in the failed to add node are nginx proxy, kubelet and kube-proxy
kube-proxy runs fine, all services and iptables look good
kubelet throws errors that that specific node is not found in the cluster
but there is no error as to why it failed to get added
q

quiet-rocket-69094

02/07/2023, 9:33 PM
How many nodes are currently in the cluster? Has it ever worked?
e

eager-london-83975

02/07/2023, 9:53 PM
it definitely worked
I have 5 nodes, where 3 are masternodes and 2 are worked nodes
there used to be 4 worker nodes and 3 masternodes
q

quiet-rocket-69094

02/07/2023, 9:54 PM
Are you re-using nodes at all?
e

eager-london-83975

02/07/2023, 9:54 PM
I have, once, removed a node without using RKE.
the nodes are created in openstack, freshly installed
no other purpose than that
i even tried reinstalling them multiple times
i created fresh nodes too, nothing worked
it seems like the masternodes somehow refuse to accept a new node, because these nodes are 1:1 with the nodes that are already running fine there
I have double checked the state file and it also looks good both the one stored in the cluster config, and the one i have locally
I even allowed each and every port in the security groups to eliminate connectivity issues so that cant be it
the fact that the node appears once, kube-proxy detects everything, removes that possibility
q

quiet-rocket-69094

02/07/2023, 10:01 PM
and you're not reusing hostnames or IPs or anything
?
It sounds to me as though the node is stored as deleted or something
e

eager-london-83975

02/07/2023, 10:02 PM
Actually, before when i removed them, i did try to re-add the same ones. Then when it didnt work, i tried creating new nodes with new ips, but they didnt work either
that sounds very plausible
is there anyway i can undo that
q

quiet-rocket-69094

02/07/2023, 10:03 PM
well we need to find the logs for what is being done.
Either kube-apiserver or the rke pod logs. Preferably debug logs to show each action
e

eager-london-83975

02/07/2023, 10:05 PM
I will do exactly that! Sounds like a very good step and debugging this
since there are 3 masternodes, the request will end up in either of them right ?
q

quiet-rocket-69094

02/07/2023, 10:07 PM
so stick api in debug, try adding the node, watch it fail, then gather logs (can use https://www.suse.com/support/kb/doc/?id=000020191) and go thrrough them carefully. If I find the time I can as well if you upload them here
yeah, I'd suggest just getting logs from all the nodes
e

eager-london-83975

02/07/2023, 10:08 PM
check your pm
q

quiet-rocket-69094

02/07/2023, 10:08 PM
any of the apiservers could get the request and from there it'll get forwarded to etcd and enacted
e

eager-london-83975

02/07/2023, 10:11 PM
rancher/mirrored-coreos-etcd:v3.5.4 "/usr/local/bin/etcd…" 3 minutes ago Restarting (2) 12 seconds ago etcd
so one of the etcd servers is down
actually all of them are restarting
wth
q

quiet-rocket-69094

02/07/2023, 10:13 PM
yeah, if they can't talk to each other they will fail. I think the log line in your dm (below jic) indicates a communication issue:
Err: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:2379: connect: connection refused". Reconnecting..
though it could also be a result of reaching out when the service it is trying to reach is down
e

eager-london-83975

02/07/2023, 10:13 PM
no mybad, i added the extra arg debug: 'true'
that seems to have broken it
Copy code
{"log":"I0206 15:56:23.423463  102814 nodeinfomanager.go:401] Failed to publish CSINode: nodes \"172.31.64.124\" not found\n","stream":"stderr","time":"2023-02-06T15:56:23.423877017Z"}
{"log":"E0206 15:56:23.423538  102814 nodelease.go:49] \"Failed to get node when trying to set owner ref to the node lease\" err=\"nodes \\\"172.31.64.124\\\" not found\" node=\"172.31.64.124\"\n","stream":"stderr","time":"2023-02-06T15:56:23.423884649Z"}
q

quiet-rocket-69094

02/07/2023, 10:44 PM
do you see ooms in the system logs?
e

eager-london-83975

02/07/2023, 10:44 PM
Copy code
{"log":"I0206 15:56:23.517061  102814 kubelet_node_status.go:70] \"Attempting to register node\" node=\"172.31.64.124\"\n","stream":"stderr","time":"2023-02-06T15:56:23.517103406Z"}
{"log":"I0206 15:56:23.564323  102814 pod_container_deletor.go:79] \"Container not found in pod's containers\" containerID=\"16c015636c294ec4951c52c812bfb7a749d7478ae01fcd06e87c68c9dbe13f0b\"\n","stream":"stderr","time":"2023-02-06T15:56:23.564377059Z"}
{"log":"I0206 15:56:23.564438  102814 pod_container_deletor.go:79] \"Container not found in pod's containers\" containerID=\"74ac10fb915213571a031ecc368377d5a1fc2b821294f7b89c5d25709212032a\"\n","stream":"stderr","time":"2023-02-06T15:56:23.564509351Z"}
{"log":"I0206 15:56:23.594484  102814 nodeinfomanager.go:401] Failed to publish CSINode: nodes \"172.31.64.124\" not found\n","stream":"stderr","time":"2023-02-06T15:56:23.594537424Z"}
{"log":"E0206 15:56:23.613671  102814 kubelet.go:2424] \"Error getting node\" err=\"node \\\"172.31.64.124\\\" not found\"\n","stream":"stderr","time":"2023-02-06T15:56:23.613718025Z"}
{"log":"E0206 15:56:23.714006  102814 kubelet.go:2424] \"Error getting node\" err=\"node \\\"172.31.64.124\\\" not found\"\n","stream":"stderr","time":"2023-02-06T15:56:23.714138815Z"}
{"log":"I0206 15:56:23.796723  102814 kubelet_node_status.go:73] \"Successfully registered node\" node=\"172.31.64.124\"\n","stream":"stderr","time":"2023-02-06T15:56:23.796805534Z"}
{"log":"I0206 15:56:23.814490  102814 kuberuntime_manager.go:1095] \"Updating runtime config through cri with podcidr\" CIDR=\"10.42.3.0/24\"\n","stream":"stderr","time":"2023-02-06T15:56:23.8145537Z"}
{"log":"time=\"2023-02-06T15:56:23Z\" level=info msg=\"Docker cri received runtime config \u0026RuntimeConfig{NetworkConfig:\u0026NetworkConfig{PodCidr:10.42.3.0/24,},}\"\n","stream":"stderr","time":"2023-02-06T15:56:23.814887962Z"}
{"log":"I0206 15:56:23.814954  102814 kubelet_network.go:60] \"Updating Pod CIDR\" originalPodCIDR=\"\" newPodCIDR=\"10.42.3.0/24\"\n","stream":"stderr","time":"2023-02-06T15:56:23.814989768Z"}
{"log":"E0206 15:56:23.820507  102814 kubelet.go:2349] \"Container runtime network not ready\" networkReady=\"NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized\"\n","stream":"stderr","time":"2023-02-06T15:56:23.820555587Z"}
{"log":"I0206 15:56:24.390034  102814 apiserver.go:52] \"Watching apiserver\"\n","stream":"stderr","time":"2023-02-06T15:56:24.390174552Z"}
{"log":"W0206 15:56:24.492020  102814 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/cinder.csi.openstack.org-reg.sock /var/lib/kubelet/plugins_registry/cinder.csi.openstack.org-reg.sock \u003cnil\u003e 0 \u003cnil\u003e}. Err: connection error: desc = \"transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/cinder.csi.openstack.org-reg.sock: connect: connection refused\". Reconnecting...\n","stream":"stderr","time":"2023-02-06T15:56:24.492124017Z"}
{"log":"I0206 15:56:24.593985  102814 kubelet.go:2072] \"SyncLoop ADD\" source=\"api\" pods=[kube-system/canal-jlj2w kube-system/csi-cinder-nodeplugin-wpbl6 lens-metrics/node-exporter-q5bv7]\n","stream":"stderr","time":"2023-02-06T15:56:24.594114492Z"}
Not sure what that is tbh
q

quiet-rocket-69094

02/07/2023, 10:46 PM
oom is out of memory. If you used that log collector to gather the logs we should be able to disect the bundles. There should be system logs in the tarballs that you can run a grep against to grab any instance of oom (
grep -i oom
)
e

eager-london-83975

02/07/2023, 10:47 PM
Copy code
{"log":"WARNING: No oom kill disable support\n","stream":"stderr","time":"2023-02-06T15:56:13.315394441Z"}
{"log":"time=\"2023-02-06T15:56:13Z\" level=info msg=\"Docker Info: \u0026{ID:CK74:4GP6:6PS6:4GLD:U5XJ:K2GI:AXCQ:NOYQ:L44E:3F3S:N7Z5:5CLG Containers:6 ContainersRunning:2 ContainersPaused:0 ContainersStopped:4 Images:13 Driver:overlay2 DriverStatus:[[Backing Filesystem extfs] [Supports d_type true] [Native Overlay Diff true] [userxattr false]] SystemStatus:[] Plugins:{Volume:[local] Network:[bridge host ipvlan macvlan null overlay] Authorization:[] Log:[awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog]} MemoryLimit:true SwapLimit:true KernelMemory:false KernelMemoryTCP:false CPUCfsPeriod:true CPUCfsQuota:true CPUShares:true CPUSet:true PidsLimit:true IPv4Forwarding:true BridgeNfIptables:true BridgeNfIP6tables:true Debug:false NFd:36 OomKillDisable:false NGoroutines:43 SystemTime:2023-02-06T16:56:13.378779511+01:00 LoggingDriver:json-file CgroupDriver:systemd NEventsListener:0 KernelVersion:5.14.0-202.el9.x86_64 OperatingSystem:CentOS Stream 9 OSType:linux Architecture:x86_64 IndexServerAddress:<https://index.docker.io/v1/> RegistryConfig:0xc000460000 NCPU:4 MemTotal:8061296640 GenericResources:[] DockerRootDir:/var/lib/docker HTTPProxy: HTTPSProxy: NoProxy: Name:elx-rke-prod-w-1.openstacklocal Labels:[] ExperimentalBuild:false ServerVersion:20.10.21 ClusterStore: ClusterAdvertise: Runtimes:map[io.containerd.runc.v2:{Path:runc Args:[]} io.containerd.runtime.v1.linux:{Path:runc Args:[]} runc:{Path:runc Args:[]}] DefaultRuntime:runc Swarm:{NodeID: NodeAddr: LocalNodeState:inactive ControlAvailable:false Error: RemoteManagers:[] Nodes:0 Managers:0 Cluster:\u003cnil\u003e Warnings:[]} LiveRestoreEnabled:false Isolation: InitBinary:docker-init ContainerdCommit:{ID:31aa4358a36870b21a992d3ad2bef29e1d693bec Expected:31aa4358a36870b21a992d3ad2bef29e1d693bec} RuncCommit:{ID:v1.1.4-0-g5fd4c4d Expected:v1.1.4-0-g5fd4c4d} InitCommit:{ID:de40ad0 Expected:de40ad0} SecurityOptions:[name=seccomp,profile=default name=cgroupns] ProductLicense: Warnings:[]}\"\n","stream":"stderr","time":"2023-02-06T15:56:13.383885388Z"}
{"log":"I0206 15:56:23.376468  102814 flags.go:64] FLAG: --oom-score-adj=\"-999\"\n","stream":"stderr","time":"2023-02-06T15:56:23.376519779Z"}
{"log":"I0206 15:56:23.383649  102814 container_manager_linux.go:267] \"Creating Container Manager object based on Node Config\" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:systemd KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:\u003cnil\u003e} {Signal:nodefs.available Operator:LessThan Value:{Quantity:\u003cnil\u003e Percentage:0.1} GracePeriod:0s MinReclaim:\u003cnil\u003e} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:\u003cnil\u003e Percentage:0.05} GracePeriod:0s MinReclaim:\u003cnil\u003e} {Signal:imagefs.available Operator:LessThan Value:{Quantity:\u003cnil\u003e Percentage:0.15} GracePeriod:0s MinReclaim:\u003cnil\u003e}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}\n","stream":"stderr","time":"2023-02-06T15:56:23.383674407Z"}
{"log":"I0206 15:56:23.422031  102814 manager.go:1203] Started watching for new ooms in manager\n","stream":"stderr","time":"2023-02-06T15:56:23.422068371Z"}
This is everything that matches OOM
q

quiet-rocket-69094

02/07/2023, 10:48 PM
ok, doesn't look like any oom kills
e

eager-london-83975

02/07/2023, 10:48 PM
Copy code
6f0ddc243c56   rancher/hyperkube:v1.24.4-rancher1   "/opt/rke-tools/entr…"   31 hours ago   Up 31 hours             kube-proxy
c8912e33d14d   rancher/hyperkube:v1.24.4-rancher1   "/opt/rke-tools/entr…"   31 hours ago   Up 31 hours             kubelet
c3354d945114   rancher/rke-tools:v0.1.87            "nginx-proxy CP_HOST…"   31 hours ago   Up 31 hours             nginx-proxy
q

quiet-rocket-69094

02/07/2023, 10:48 PM
Here, I don't hve much time but I'll do a quick screen share if you can. Give me a slack call
e

eager-london-83975

02/07/2023, 10:48 PM
sure
@quiet-rocket-69094 I have managed to solve it thanks to you. After debugging the issue for a while with TJ, we were able to find out that the openstack-cloud-controller-manager was deleting the node as it was being added. Disabling it fixed the issue temporarily, but the real reason still remains unknown. Will have to check on that now!
🎉 1
q

quiet-rocket-69094

02/08/2023, 11:50 PM
Just in case anyone runs into this or any other issue in the futre where some action appears to be taken in a cluster and you need to track down why it's happening, my suggestion is to enable debug logging for the k8s distribution (this may help: https://www.suse.com/support/kb/doc/?id=000020199), log the time (in UTC as the default and best practice is servers logging in UTC to avoid time zone confusion), reproduce the issue, then gather logs. You can use the Rancher log collector (https://www.suse.com/support/kb/doc/?id=000020230) as the tar bundles produced are pretty nice for root cause analysis. When going through the apiserver logs (I like to load all logs from all kube-apiserver containers in
lnav
) filter for something to isolate the resource being changed and cross reference with the timestamp of change. In this case we found a
DELETE
being called against the node IP by the OCCM so Ergjan tested disabling it
168 Views