This message was deleted.
# k3s
a
This message was deleted.
c
https://github.com/k3s-io/k3s/issues/6664#issuecomment-1360878062 maybe? Did you do this at some point in the past and forget to make it persistent?
l
Hmm, no I've never run that command before
I've never had this issue before
c
did you reboot to apply a kernel upgrade or something?
give it a try
l
Yes that's exactly what happened
c
likely then
since it’s a kernel bug in the network interface drivers that makes that necessary
l
6.1.0-18-amd64 to 6.1.0-20-amd64
c
not super likely to change on a minor patch like that though :/
but try it anyway
l
It sounds like I need to run this only on the node where coredns is? That's not the node I rebooted
Just going off of the comment you linked, there
Also, would I do that to both ipv4 and ipv6 flannel interfaces?
(coredns is only using ipv6)
Also, I'm not actually seeing errors like "UDP: bad checksum"
c
only use ipv6 huh
you won’t see errors anywhere unless you’re tcpdumping the traffic, are you?
try it for both the interfaces, yes
l
No I'm not, sorry, the flannel issue linked in there said that was in the dmesg
Okay, giving this a shot on both nodes (the one with problems and the one hosting coredns)
No change, as far as I can tell. Do I need to restart anything?
Here's a traceroute to the kube-dns endpoint IP on a container running on the node that works:
Copy code
test2:/# traceroute fda5:f00d:c:0:1::3ec
traceroute to fda5:f00d:c:0:1::3ec (fda5:f00d:c:0:1::3ec), 30 hops max, 72 byte packets
 1  fda5:f00d:c:0:2::1 (fda5:f00d:c:0:2::1)  0.023 ms  0.012 ms  0.010 ms
 2  fda5:f00d:c:0:1:: (fda5:f00d:c:0:1::)  0.303 ms  0.217 ms  0.128 ms
 3  fda5-f00d-c-0-1--3ec.kube-dns.kube-system.svc.cluster.local (fda5:f00d:c:0:1::3ec)  0.226 ms  0.132 ms  0.215 ms
Here's one from the one that's busted:
Copy code
test:/# traceroute fda5:f00d:c:0:1::3ec
traceroute to fda5:f00d:c:0:1::3ec (fda5:f00d:c:0:1::3ec), 30 hops max, 72 byte packets
 1  fda5:f00d:c::1 (fda5:f00d:c::1)  0.014 ms  0.016 ms  0.010 ms
 2  *  *  *
 3  *  *  *
 4  *  *  *
 5  *  *  *
 6  *  *  *
 7  *  *  *
 8  *  *^C
It's not just DNS, either. The busted node can't reach any pod
With the same traceroute
It feels like something is broken at the flannel level
Just digging through the k3s log looking for something weird, I found this "SHOULD NOT HAPPEN" near the startup:
Copy code
Apr 15 14:49:44 s1 k3s[1985]: I0415 14:49:44.821299    1985 event.go:307] "Event occurred" object="kube-system/coredns" fieldPath="" kind="Addon" apiVersion="<http://k3s.cattle.io/v1|k3s.cattle.io/v1>" type="Normal" reason="ApplyingManifest" message="Applying manifest at \"/var/lib/rancher/k3s/server/manifests/co
redns.yaml\""
Apr 15 14:49:44 s1 k3s[1985]: I0415 14:49:44.842578    1985 kubelet_node_status.go:70] "Attempting to register node" node="s1"
Apr 15 14:49:44 s1 k3s[1985]: E0415 14:49:44.844576    1985 fieldmanager.go:155] "[SHOULD NOT HAPPEN] failed to update managedFields" err="failed to convert new object (/s1; /v1, Kind=Node) to smd typed: .status.addresses: duplicate entries for key [type=\"InternalIP\"]" versionKind="/, Kind=" namespace="" name="s1"
Apr 15 14:49:44 s1 k3s[1985]: I0415 14:49:44.844759    1985 server.go:154] "Starting Kubernetes Scheduler" version="v1.28.8+k3s1"
Actually looking at the s1 node looks okay though. One ipv4
InternalIP
, one ipv6
InternalIP
Okay fun new datapoint: ipv4 works fine
ipv6 is the issue
@creamy-pencil-82913 I tried reverting the kernel update, by the way. It didn't change anything either
I see another interesting error on this node:
Copy code
Apr 16 09:39:08 s1 k3s[2045]: E0416 09:39:08.470831    2045 service_health.go:145] "Failed to start healthcheck" err="listen tcp 0.0.0.0:32316: bind: address already in use" node="s1" service="nginx-ingress/nginx-ingress-controller" port=32316
And it's true, that port appears taken by k3s server itself
Also: this node is drained and cordoned. It has a few daemonsets on it is all. Why is it trying to do stuff relating to the ingress controller? It's not even running on it
c
I suspect that’s the service node port? node ports are available on all nodes, by default.
l
It's the healthcheck node port, yeah
But... what is clashing here?
The other nodes aren't having that error, as far as I can tell
Maybe that's just a red herring. The current state is that this node cannot ping the address of any other node's flannel-v6.1 iface, and no other node can ping this one's flannel-v6.1 iface. They can ping each other outside of k3s, though
And they can ping each other's flannel.1 iface
Okay check this out: On s1 (the bad node), we can see the other nodes (s1 and s2) in the forwarding table on the flannel iface:
Copy code
$ sudo bridge fdb show dev flannel-v6.1
f6:46:0d:3c:21:48 dst fda5:868e:e104:310::20 self permanent
ea:47:ac:42:44:2f dst fda5:868e:e104:310::30 self permanent
However, on the other nodes, the entry for s1 has the wrong MAC
I'm rebooting s1 again to test, but any chance that MAC is changing on me?
Yes it is. The flannel-v6.1 iface keeps changing its MAC
Something isn't communicating that to the rest of the cluster
This makes me think maybe etcd is having a problem, but I don't know how to debug etcd within k3s
c
@bland-account-99790 any ideas?
etcd is honestly the last thing I would go poking at.
l
Oh good 😛
I just assumed that was the piece that was supposed to tell the rest of the cluster about the flannel iface's mac address
kubectl describe node s1
shows the proper MAC in the
<http://flannel.alpha.coreos.com/backend-v6-data|flannel.alpha.coreos.com/backend-v6-data>
annotation
c
flannel updates the annotation on the node to record that IIRC. I don’t think it’s normal that it would change all the time though.
l
Interesting. I wasn't super surprised to see it changing given that it's a virtual iface; to keep a static MAC it must be based on something or recorded somewhere, yes?
c
yeah I’m not that far in the weeds with flannel. Manuel who I tagged up above is one of the maintainers and would know better.
l
Okay very good. Is he on an overlapping tz, or should I log an issue at this point?
This is starting to sound a bit like https://github.com/k3s-io/k3s/issues/4188
I'm on v1.28.8+k3s1 though, so it can't be exactly that
@bland-account-99790/@creamy-pencil-82913 making the forwarding tables on s2 and s3 use the proper mac is making everything work again
Not ideal that we need to do that manually though, obviously
Just rebooted another node, and its mac address changed as well, with the same effects
So now I'm just taking care of these tables manually 😬
I'll log a bug
Note, by the way, that flannel.1's mac does NOT change
c
flannel-io/flannel may be a better place for that but we’ll see
l
Yeah let's see what @bland-account-99790 thinks, happy to log it over there. My flannel debugging capabilities are pretty limited by k3s
So I won't be able to respond to their questions very well 😛
c
what flannel backend are you using?
l
vxlan
c
k
l
I'll add that to the issue
b
Good morning! Unfortunately, we live in different time zones 😞. I replied to the issue, I think the problem is that the PR fixing it forgot the v6 interface
l
Morning @bland-account-99790! Afternoon for you, eh? No problem, I'm very used to globally distributed teams 🙂 . Github issues are definitely the way to go, especially once I get it reduced to the actual problem
@creamy-pencil-82913 is there a way in k3s to be able to see the flannel info messages?
c
yes they’re just in the k3s log with everything else