This message was deleted.
# k3s
a
This message was deleted.
b
Recently updated all nodes to v1.31.4
The nodes are running as Proxmox VMs, the hosts are all SSD and I don't think I have a high load but I did move a couple of them to nvme disks to reduce any IOPS issues for etcd?
c
What specifically is it checking to determine that things are up or down?
b
For Portainer, I believe it trying to access the main service?
kubernetes.default.svc
The two nodes that seem to consistently report back
*Error scraping target:* Get "<https://10.42.43.72:6443/metrics>": context deadline exceeded
Are the two VMs that are running on the same host, but they are running on different disks? I tried migrating the third back onto the same host as the other two and it seemed to be in a worse position, so I migrated it back. My end state/goal is for three hosts each with a single VM each but I'm keen to narrow this down further on what's happening before committing to setting up the third host.
For clarity, the IPs are different for each endpoint:
*Error scraping target:* Get "<https://10.42.43.72:6443/metrics>": context deadline exceeded
*Error scraping target:* Get "<https://10.42.43.70:6443/metrics>": context deadline exceeded
c
are you dropping traffic between nodes when they’re on different hosts?
b
Not intentionally, I've tried pinging hosts between different nodes and checking I can curl on 6443 from different nodes and they all seem to work as expected.
Happy and keen for any direction to test
c
can you curl 6443 on every node from each of the other nodes?
b
Yes, at least enough to get a 401 response so I think it's a performance issue? The VMs are 2CPU + 4gb/8gb.
And that is from each server and agent node (all five could curl each of the three servers on 6443)
Ok, I am now even more confused. I pulled the client cert into Postman on my desktop and I can hit all three endpoints for /metrics perfectly with sub 200ms and sub 100ms for the
10.42.43.70
node which to I doubled the vCPU as part of testing.
10.42.43.71
which is the only node that consistently reports as healthy in Prometheus responds in 150-300ms so I can only assume that these values are totally ok. So could it just be that my Prometheus deployment is broken in some form? But that wouldn't explain why I'm also having issues with Portainer?
I have found something useful?
kubectl logs -n kube-system metrics-server-7cbbc464f4-wpvpq
is filled with:
"Failed to scrape node, timeout to access kubelet" err="Get \"<https://10.42.43.71:10250/metrics/resource>\": context deadline exceeded" node="k3s02" timeout="10s"
hmm maybe not I killed the pod, it restarted fine and the logs didn't return at all (they were showing up every 15 seconds or so)
hmm ok, maybe this is giving me a path to investigate. I killed all of the Prometheus pods as well and now I have just one node/endpoint showing as an error:
*Error scraping target:* Get "<https://10.42.43.71:6443/metrics>": context deadline exceeded
So this has to be something network related, k3s02/10.42.43.71 is the node that I have running on the second host. But from the VM shell I can curl the endpoint completely fine?
I think I have a bit of a pattern: If the metrics/promethus server is deployed to pve01 (host/proxmox machine that I've deployed an agent role to) packets are dropped/timeouts to k3s01/k3s03 which are the two VMs on that host. And then vice-versa if deployed to pve02 and k3s02 but I don't fully understand why yet.
c
Ports not open between the two hosts? Or packets being dropped when leaving the CNI?
b
I'm thinking the later as: • It's able to query the nodes on the other host, it's failing when trying to query the nodes on the same host (as VMs). • I have zero issues when I query from the VM/host shell directly, so it must be that it's coming from the CNI?
I've been trying to work out how to exec into the prometheus-server pod to ping/curl back out but struggling as it's busybox/alpine and I get a non root user
yeah ok, I think I've proved that now:
kubectl run debug-pod --rm -it --image=nicolaka/netshoot --overrides='{"spec": { "nodeSelector": {"kubernetes.io/hostname": "pve01"}}}' -- /bin/bash in bash at 143342
If you don't see a command prompt, try pressing enter.
debug-pod:~# ping 10.42.43.70
PING 10.42.43.70 (10.42.43.70) 56(84) bytes of data.
^C
--- 10.42.43.70 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2039ms
debug-pod:~# ping 10.42.43.71
PING 10.42.43.71 (10.42.43.71) 56(84) bytes of data.
64 bytes from 10.42.43.71: icmp_seq=1 ttl=63 time=0.528 ms
64 bytes from 10.42.43.71: icmp_seq=2 ttl=63 time=0.163 ms
^C
--- 10.42.43.71 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1009ms
rtt min/avg/max/mdev = 0.163/0.345/0.528/0.182 ms
debug-pod:~# ping 10.42.43.72
PING 10.42.43.72 (10.42.43.72) 56(84) bytes of data.
^C
--- 10.42.43.72 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4133ms
hmmm, my next thought is that I did reconfigure the Proxmox interfaces so that I can have multiple VLAN tags for a single VM. Which means the main interface is now
vmbr0.43@vmbr0
where it was
vmbr0
when they were originally registered to the cluster
I'm guessing my next best step is to manually uninstall each agent node completely and then re add? This would take a bit of co-ordination to manage the longhorn volumes so keen to know if there are steps I could try before that. I guess I could add a temporary node on what will be pve03 set up in the same way and see if that works. I am confused as to why networking is completely fine to the VM on the other host and not on the same host. I would have thought the routes would have been identical.
It was my Proxmox interfaces being misconfigured. Redid the VLANS at the interface level rather than at the bridge and I got bridging between the CNI and the Host VMs. I now have green across the board with Prometheus and I've not seen Portainer reporting
kubernetes.default.svc
as down since. Thanks for rubber ducking @creamy-pencil-82913 particularly as the issue ended up not being related/caused by k3s at all!