This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

07/04/2023, 10:42 AM

This message was deleted.

curved-army-69172

07/04/2023, 10:45 AM

I mean maybe I could update it to 1.24.14 (latest version) but I want to learn 😉 and troubleshoot what's the issue

bland-account-99790

07/04/2023, 12:38 PM

if you ssh into the VM and check the journalctl logs, is there anything interesting?

curved-army-69172

07/04/2023, 12:38 PM

not that I found anything

curved-army-69172

07/04/2023, 12:39 PM

i saw an info message that for some part (i think scheduler) the node took leader when another control-node was rebootet

curved-army-69172

07/04/2023, 12:39 PM

so it seemed to work perfectly fine

curved-army-69172

07/04/2023, 12:43 PM

no messages in error, a few warnings every hour or so

bland-account-99790

07/04/2023, 12:45 PM

if you check the pods or nodes directly with kubectl, does it look healthy?

curved-army-69172

07/04/2023, 12:47 PM

on the workload page of rancher itself for that cluster everything looks fine, let me check on the hosts via the k3s included kubectl

curved-army-69172

07/04/2023, 12:47 PM

Copy code

NAMESPACE                  NAME                                                     READY   STATUS      RESTARTS          AGE
cattle-fleet-system        fleet-agent-7d6d98b485-bjwbk                             1/1     Running     17 (4h32m ago)    259d
cattle-monitoring-system   alertmanager-rancher-monitoring-alertmanager-0           2/2     Running     8 (4h32m ago)     371d
cattle-monitoring-system   prometheus-rancher-monitoring-prometheus-0               3/3     Running     9 (7d3h ago)      371d
cattle-monitoring-system   pushprox-k3s-server-client-5zr79                         1/1     Running     4 (4h32m ago)     371d
cattle-monitoring-system   pushprox-k3s-server-client-frj4c                         1/1     Running     5 (4h36m ago)     371d
cattle-monitoring-system   pushprox-k3s-server-client-gwgvd                         1/1     Running     3 (7d3h ago)      371d
cattle-monitoring-system   pushprox-k3s-server-client-js4v4                         1/1     Running     5 (4h42m ago)     371d
cattle-monitoring-system   pushprox-k3s-server-proxy-f4f5d4874-pwtmn                1/1     Running     3 (7d3h ago)      371d
cattle-monitoring-system   rancher-monitoring-grafana-57777cc795-z7xb2              3/3     Running     12 (4h36m ago)    371d
cattle-monitoring-system   rancher-monitoring-kube-state-metrics-5bc8bb48bd-lcxxd   1/1     Running     4 (4h32m ago)     371d
cattle-monitoring-system   rancher-monitoring-operator-f79dc4944-ln2bl              1/1     Running     5 (4h36m ago)     371d
cattle-monitoring-system   rancher-monitoring-prometheus-adapter-8846d4757-mkbgw    1/1     Running     3 (7d3h ago)      371d
cattle-monitoring-system   rancher-monitoring-prometheus-node-exporter-7df65        1/1     Running     4 (4h36m ago)     371d
cattle-monitoring-system   rancher-monitoring-prometheus-node-exporter-fcmhz        1/1     Running     3 (7d3h ago)      371d
cattle-monitoring-system   rancher-monitoring-prometheus-node-exporter-ksl5g        1/1     Running     4 (4h32m ago)     371d
cattle-monitoring-system   rancher-monitoring-prometheus-node-exporter-vf847        1/1     Running     5 (4h42m ago)     371d
cattle-system              cattle-cluster-agent-854bbc47f5-2lqw5                    1/1     Running     15 (4h36m ago)    237d
cattle-system              cattle-cluster-agent-854bbc47f5-fb9ml                    1/1     Running     23 (4h42m ago)    237d
cattle-system              system-upgrade-controller-7b8d94c7f5-kcj4k               1/1     Running     4 (4h32m ago)     259d
kube-system                coredns-b96499967-w8gjk                                  1/1     Running     4 (4h42m ago)     194d
kube-system                helm-install-cinder-csi-2v5tq                            0/1     Completed   0                 194d
kube-system                helm-install-openstack-controller-manager-p25s6          0/1     Completed   0                 194d
kube-system                helm-install-traefik-crd-msww2                           0/1     Completed   0                 194d
kube-system                helm-install-traefik-d5z75                               0/1     Completed   0                 7d3h
kube-system                local-path-provisioner-84bb864455-57hh4                  1/1     Running     6 (4h42m ago)     371d
kube-system                metrics-server-ff9dbcb6c-jkwjn                           1/1     Running     4 (4h42m ago)     194d
kube-system                openstack-cinder-csi-controllerplugin-547bc58794-x8n7m   6/6     Running     289 (4h42m ago)   371d
kube-system                openstack-cinder-csi-nodeplugin-2jqmw                    3/3     Running     61 (4h32m ago)    371d
kube-system                openstack-cinder-csi-nodeplugin-69nf5                    3/3     Running     34 (4h42m ago)    371d
kube-system                openstack-cinder-csi-nodeplugin-gdglv                    3/3     Running     29 (5d1h ago)     371d
kube-system                openstack-cinder-csi-nodeplugin-sthn8                    3/3     Running     72 (4h36m ago)    371d
kube-system                openstack-cloud-controller-manager-mzkql                 1/1     Running     24 (4h36m ago)    371d
kube-system                openstack-cloud-controller-manager-n4q77                 1/1     Running     22 (4h32m ago)    371d
kube-system                openstack-cloud-controller-manager-tn6ft                 1/1     Running     24 (4h42m ago)    371d
kube-system                traefik-85fcf6b649-fq6ck                                 1/1     Running     2 (4h36m ago)     25d

looks good to me i'd say

curved-army-69172

07/04/2023, 12:48 PM

Copy code

# kubectl get nodes
NAME                             STATUS   ROLES                              AGE    VERSION
infra-kube-ctrl-e5a402eb-8tc2n   Ready    control-plane,etcd,master,worker   371d   v1.24.4+k3s1
infra-kube-ctrl-e5a402eb-bw9q4   Ready    control-plane,etcd,master,worker   371d   v1.24.4+k3s1
infra-kube-ctrl-e5a402eb-smwfs   Ready    control-plane,etcd,master,worker   371d   v1.24.4+k3s1
infra-kube-wrk-e67b3b9e-tqpvz    Ready    worker                             371d   v1.24.4+k3s1

gray-pillow-95920

07/04/2023, 1:07 PM

Check the network stack, there was an element removed (vxlan) from the Ubuntu kernel in v20.04. https://www.seanfoley.blog/microk8s-ubuntu-22-04-lts-jammy-jellyfish-broken-needs-vxlan-support/ This can cause what you're seeing, especially on new nodes (not usually upgraded ones)

curved-army-69172

07/04/2023, 1:10 PM

That’s the thing I did not upgrade it. Neither Ubuntu nor rancher nor k3s. Only after it was stuck in reconciling I decided to upgrade the outdated Ubuntu packages and reboot the nodes

bland-account-99790

07/04/2023, 1:39 PM

pods looks good although lots of restarts. Anyway, what does the log of

cattle-cluster-agent-854bbc47f5

say?

curved-army-69172

07/04/2023, 4:48 PM

logs good so far, a bunch of depcreatation warnings for policy/v1beta1 which would be removed in 1.25 and an error during subscribe websocket: close sent

curved-army-69172

07/04/2023, 4:50 PM

btw, the one cattle-cluster-agent running on the "reconciling" node has no error, only the other one on the "active" node

curved-army-69172

07/04/2023, 4:51 PM

@gray-pillow-95920 - checked via lsmod on the node - vxlan is available

curved-army-69172

07/04/2023, 4:53 PM

the many restarts on the pods might be from our openstack upgrade from 1,5 weeks ago. there we shutodwn all vms due to some network migration. afterwards the cluster was started and in kind of a broken state for a few hours or a day or so. then i rebootet the nodes one by one and afterwards all was good (cluster manager in rancher dashboard showed active) until yesterday (or so, didn't monitor to closely) when this one node showed up as reconciling without anything else being done or executed on that cluster

gray-pillow-95920

07/04/2023, 6:06 PM

Ok, I haven't seen that. I missed the vxlan step and rebuilt, then found vxlan was still missing. Ok if it didn't help. Just cost me a lot of time so thought I'd share

curved-army-69172

07/06/2023, 6:44 AM

problem "solved" itself 😉 played around and tried to update to 1.24.14, after that rerun our terraform script (which also for that cluster tried to disable the local-storage provider which in turn leads to new VMs being booted. That killed the cluster in the end 😉 So I ended up deleting it and build it fresh. Now everything is fine.

bland-account-99790

07/07/2023, 12:54 PM

great! Sorry I could not help you more, we have a busy week with July releases

28 Views

Open in Slack

Previous Next