This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

01/03/2025, 2:04 AM

This message was deleted.

bumpy-mechanic-40986

01/03/2025, 2:08 AM

Recently updated all nodes to v1.31.4

bumpy-mechanic-40986

01/03/2025, 2:11 AM

The nodes are running as Proxmox VMs, the hosts are all SSD and I don't think I have a high load but I did move a couple of them to nvme disks to reduce any IOPS issues for etcd?

creamy-pencil-82913

01/03/2025, 3:25 AM

What specifically is it checking to determine that things are up or down?

bumpy-mechanic-40986

01/03/2025, 3:27 AM

For Portainer, I believe it trying to access the main service?

kubernetes.default.svc

bumpy-mechanic-40986

01/03/2025, 3:35 AM

The two nodes that seem to consistently report back

*Error scraping target:* Get "<https://10.42.43.72:6443/metrics>": context deadline exceeded

Are the two VMs that are running on the same host, but they are running on different disks? I tried migrating the third back onto the same host as the other two and it seemed to be in a worse position, so I migrated it back. My end state/goal is for three hosts each with a single VM each but I'm keen to narrow this down further on what's happening before committing to setting up the third host.

bumpy-mechanic-40986

01/03/2025, 3:37 AM

For clarity, the IPs are different for each endpoint:

*Error scraping target:* Get "<https://10.42.43.72:6443/metrics>": context deadline exceeded

*Error scraping target:* Get "<https://10.42.43.70:6443/metrics>": context deadline exceeded

creamy-pencil-82913

01/03/2025, 5:46 AM

are you dropping traffic between nodes when they’re on different hosts?

bumpy-mechanic-40986

01/03/2025, 5:50 AM

Not intentionally, I've tried pinging hosts between different nodes and checking I can curl on 6443 from different nodes and they all seem to work as expected.

bumpy-mechanic-40986

01/03/2025, 5:51 AM

Happy and keen for any direction to test

creamy-pencil-82913

01/03/2025, 5:59 AM

can you curl 6443 on every node from each of the other nodes?

bumpy-mechanic-40986

01/03/2025, 6:11 AM

Yes, at least enough to get a 401 response so I think it's a performance issue? The VMs are 2CPU + 4gb/8gb.

bumpy-mechanic-40986

01/03/2025, 6:12 AM

And that is from each server and agent node (all five could curl each of the three servers on 6443)

bumpy-mechanic-40986

01/04/2025, 10:49 PM

Ok, I am now even more confused. I pulled the client cert into Postman on my desktop and I can hit all three endpoints for /metrics perfectly with sub 200ms and sub 100ms for the

10.42.43.70

node which to I doubled the vCPU as part of testing.

10.42.43.71

which is the only node that consistently reports as healthy in Prometheus responds in 150-300ms so I can only assume that these values are totally ok. So could it just be that my Prometheus deployment is broken in some form? But that wouldn't explain why I'm also having issues with Portainer?

bumpy-mechanic-40986

01/05/2025, 12:02 AM

I have found something useful?

kubectl logs -n kube-system metrics-server-7cbbc464f4-wpvpq

is filled with:

"Failed to scrape node, timeout to access kubelet" err="Get \"<https://10.42.43.71:10250/metrics/resource>\": context deadline exceeded" node="k3s02" timeout="10s"

bumpy-mechanic-40986

01/05/2025, 12:08 AM

hmm maybe not I killed the pod, it restarted fine and the logs didn't return at all (they were showing up every 15 seconds or so)

bumpy-mechanic-40986

01/05/2025, 12:14 AM

hmm ok, maybe this is giving me a path to investigate. I killed all of the Prometheus pods as well and now I have just one node/endpoint showing as an error:

*Error scraping target:* Get "<https://10.42.43.71:6443/metrics>": context deadline exceeded

bumpy-mechanic-40986

01/05/2025, 12:17 AM

So this has to be something network related, k3s02/10.42.43.71 is the node that I have running on the second host. But from the VM shell I can curl the endpoint completely fine?

bumpy-mechanic-40986

01/05/2025, 12:39 AM

I think I have a bit of a pattern: If the metrics/promethus server is deployed to pve01 (host/proxmox machine that I've deployed an agent role to) packets are dropped/timeouts to k3s01/k3s03 which are the two VMs on that host. And then vice-versa if deployed to pve02 and k3s02 but I don't fully understand why yet.

creamy-pencil-82913

01/05/2025, 1:14 AM

Ports not open between the two hosts? Or packets being dropped when leaving the CNI?

bumpy-mechanic-40986

01/05/2025, 1:19 AM

I'm thinking the later as: • It's able to query the nodes on the other host, it's failing when trying to query the nodes on the same host (as VMs). • I have zero issues when I query from the VM/host shell directly, so it must be that it's coming from the CNI?

bumpy-mechanic-40986

01/05/2025, 1:21 AM

I've been trying to work out how to exec into the prometheus-server pod to ping/curl back out but struggling as it's busybox/alpine and I get a non root user

bumpy-mechanic-40986

01/05/2025, 1:36 AM

yeah ok, I think I've proved that now:

kubectl run debug-pod --rm -it --image=nicolaka/netshoot --overrides='{"spec": { "nodeSelector": {"kubernetes.io/hostname": "pve01"}}}' -- /bin/bash in bash at 143342

If you don't see a command prompt, try pressing enter.

debug-pod:~# ping 10.42.43.70

PING 10.42.43.70 (10.42.43.70) 56(84) bytes of data.

^C

--- 10.42.43.70 ping statistics ---

3 packets transmitted, 0 received, 100% packet loss, time 2039ms

debug-pod:~# ping 10.42.43.71

PING 10.42.43.71 (10.42.43.71) 56(84) bytes of data.

64 bytes from 10.42.43.71: icmp_seq=1 ttl=63 time=0.528 ms

64 bytes from 10.42.43.71: icmp_seq=2 ttl=63 time=0.163 ms

^C

--- 10.42.43.71 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1009ms

rtt min/avg/max/mdev = 0.163/0.345/0.528/0.182 ms

debug-pod:~# ping 10.42.43.72

PING 10.42.43.72 (10.42.43.72) 56(84) bytes of data.

^C

--- 10.42.43.72 ping statistics ---

5 packets transmitted, 0 received, 100% packet loss, time 4133ms

bumpy-mechanic-40986

01/05/2025, 1:50 AM

hmmm, my next thought is that I did reconfigure the Proxmox interfaces so that I can have multiple VLAN tags for a single VM. Which means the main interface is now

vmbr0.43@vmbr0

where it was

vmbr0

when they were originally registered to the cluster

bumpy-mechanic-40986

01/05/2025, 2:10 AM

I'm guessing my next best step is to manually uninstall each agent node completely and then re add? This would take a bit of co-ordination to manage the longhorn volumes so keen to know if there are steps I could try before that. I guess I could add a temporary node on what will be pve03 set up in the same way and see if that works. I am confused as to why networking is completely fine to the VM on the other host and not on the same host. I would have thought the routes would have been identical.

bumpy-mechanic-40986

01/06/2025, 3:36 AM

It was my Proxmox interfaces being misconfigured. Redid the VLANS at the interface level rather than at the bridge and I got bridging between the CNI and the Host VMs. I now have green across the board with Prometheus and I've not seen Portainer reporting

kubernetes.default.svc

as down since. Thanks for rubber ducking @creamy-pencil-82913 particularly as the issue ended up not being related/caused by k3s at all!

52 Views

Open in Slack

Previous Next