This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

04/16/2024, 12:23 AM

This message was deleted.

creamy-pencil-82913

04/16/2024, 12:42 AM

https://github.com/k3s-io/k3s/issues/6664#issuecomment-1360878062 maybe? Did you do this at some point in the past and forget to make it persistent?

late-barista-75996

04/16/2024, 12:44 AM

Hmm, no I've never run that command before

late-barista-75996

04/16/2024, 12:44 AM

I've never had this issue before

creamy-pencil-82913

04/16/2024, 12:44 AM

did you reboot to apply a kernel upgrade or something?

creamy-pencil-82913

04/16/2024, 12:44 AM

give it a try

late-barista-75996

04/16/2024, 12:45 AM

Yes that's exactly what happened

creamy-pencil-82913

04/16/2024, 12:45 AM

likely then

creamy-pencil-82913

04/16/2024, 12:45 AM

since it’s a kernel bug in the network interface drivers that makes that necessary

late-barista-75996

04/16/2024, 12:45 AM

6.1.0-18-amd64 to 6.1.0-20-amd64

creamy-pencil-82913

04/16/2024, 12:45 AM

not super likely to change on a minor patch like that though :/

creamy-pencil-82913

04/16/2024, 12:46 AM

but try it anyway

late-barista-75996

04/16/2024, 12:47 AM

It sounds like I need to run this only on the node where coredns is? That's not the node I rebooted

late-barista-75996

04/16/2024, 12:47 AM

Just going off of the comment you linked, there

late-barista-75996

04/16/2024, 12:48 AM

Also, would I do that to both ipv4 and ipv6 flannel interfaces?

late-barista-75996

04/16/2024, 12:48 AM

(coredns is only using ipv6)

late-barista-75996

04/16/2024, 12:50 AM

Also, I'm not actually seeing errors like "UDP: bad checksum"

creamy-pencil-82913

04/16/2024, 12:51 AM

only use ipv6 huh

creamy-pencil-82913

04/16/2024, 12:51 AM

you won’t see errors anywhere unless you’re tcpdumping the traffic, are you?

creamy-pencil-82913

04/16/2024, 12:51 AM

try it for both the interfaces, yes

late-barista-75996

04/16/2024, 12:54 AM

No I'm not, sorry, the flannel issue linked in there said that was in the dmesg

late-barista-75996

04/16/2024, 12:54 AM

Okay, giving this a shot on both nodes (the one with problems and the one hosting coredns)

late-barista-75996

04/16/2024, 12:58 AM

No change, as far as I can tell. Do I need to restart anything?

late-barista-75996

04/16/2024, 1:08 AM

Here's a traceroute to the kube-dns endpoint IP on a container running on the node that works:

Copy code

test2:/# traceroute fda5:f00d:c:0:1::3ec
traceroute to fda5:f00d:c:0:1::3ec (fda5:f00d:c:0:1::3ec), 30 hops max, 72 byte packets
 1  fda5:f00d:c:0:2::1 (fda5:f00d:c:0:2::1)  0.023 ms  0.012 ms  0.010 ms
 2  fda5:f00d:c:0:1:: (fda5:f00d:c:0:1::)  0.303 ms  0.217 ms  0.128 ms
 3  fda5-f00d-c-0-1--3ec.kube-dns.kube-system.svc.cluster.local (fda5:f00d:c:0:1::3ec)  0.226 ms  0.132 ms  0.215 ms

Here's one from the one that's busted:

Copy code

test:/# traceroute fda5:f00d:c:0:1::3ec
traceroute to fda5:f00d:c:0:1::3ec (fda5:f00d:c:0:1::3ec), 30 hops max, 72 byte packets
 1  fda5:f00d:c::1 (fda5:f00d:c::1)  0.014 ms  0.016 ms  0.010 ms
 2  *  *  *
 3  *  *  *
 4  *  *  *
 5  *  *  *
 6  *  *  *
 7  *  *  *
 8  *  *^C

late-barista-75996

04/16/2024, 1:11 AM

It's not just DNS, either. The busted node can't reach any pod

late-barista-75996

04/16/2024, 1:12 AM

With the same traceroute

late-barista-75996

04/16/2024, 1:12 AM

It feels like something is broken at the flannel level

late-barista-75996

04/16/2024, 1:24 AM

Just digging through the k3s log looking for something weird, I found this "SHOULD NOT HAPPEN" near the startup:

Copy code

Apr 15 14:49:44 s1 k3s[1985]: I0415 14:49:44.821299    1985 event.go:307] "Event occurred" object="kube-system/coredns" fieldPath="" kind="Addon" apiVersion="<http://k3s.cattle.io/v1|k3s.cattle.io/v1>" type="Normal" reason="ApplyingManifest" message="Applying manifest at \"/var/lib/rancher/k3s/server/manifests/co
redns.yaml\""
Apr 15 14:49:44 s1 k3s[1985]: I0415 14:49:44.842578    1985 kubelet_node_status.go:70] "Attempting to register node" node="s1"
Apr 15 14:49:44 s1 k3s[1985]: E0415 14:49:44.844576    1985 fieldmanager.go:155] "[SHOULD NOT HAPPEN] failed to update managedFields" err="failed to convert new object (/s1; /v1, Kind=Node) to smd typed: .status.addresses: duplicate entries for key [type=\"InternalIP\"]" versionKind="/, Kind=" namespace="" name="s1"
Apr 15 14:49:44 s1 k3s[1985]: I0415 14:49:44.844759    1985 server.go:154] "Starting Kubernetes Scheduler" version="v1.28.8+k3s1"

late-barista-75996

04/16/2024, 1:25 AM

Actually looking at the s1 node looks okay though. One ipv4

InternalIP

, one ipv6

InternalIP

late-barista-75996

04/16/2024, 1:44 AM

Okay fun new datapoint: ipv4 works fine

late-barista-75996

04/16/2024, 1:44 AM

ipv6 is the issue

late-barista-75996

04/16/2024, 3:38 PM

@creamy-pencil-82913 I tried reverting the kernel update, by the way. It didn't change anything either

late-barista-75996

04/16/2024, 4:41 PM

I see another interesting error on this node:

Copy code

Apr 16 09:39:08 s1 k3s[2045]: E0416 09:39:08.470831    2045 service_health.go:145] "Failed to start healthcheck" err="listen tcp 0.0.0.0:32316: bind: address already in use" node="s1" service="nginx-ingress/nginx-ingress-controller" port=32316

late-barista-75996

04/16/2024, 4:41 PM

And it's true, that port appears taken by k3s server itself

late-barista-75996

04/16/2024, 5:04 PM

Also: this node is drained and cordoned. It has a few daemonsets on it is all. Why is it trying to do stuff relating to the ingress controller? It's not even running on it

creamy-pencil-82913

04/16/2024, 5:22 PM

I suspect that’s the service node port? node ports are available on all nodes, by default.

late-barista-75996

04/16/2024, 5:24 PM

It's the healthcheck node port, yeah

late-barista-75996

04/16/2024, 5:24 PM

But... what is clashing here?

late-barista-75996

04/16/2024, 5:25 PM

The other nodes aren't having that error, as far as I can tell

late-barista-75996

04/16/2024, 5:40 PM

Maybe that's just a red herring. The current state is that this node cannot ping the address of any other node's flannel-v6.1 iface, and no other node can ping this one's flannel-v6.1 iface. They can ping each other outside of k3s, though

late-barista-75996

04/16/2024, 5:40 PM

And they can ping each other's flannel.1 iface

late-barista-75996

04/16/2024, 5:49 PM

Okay check this out: On s1 (the bad node), we can see the other nodes (s1 and s2) in the forwarding table on the flannel iface:

Copy code

$ sudo bridge fdb show dev flannel-v6.1
f6:46:0d:3c:21:48 dst fda5:868e:e104:310::20 self permanent
ea:47:ac:42:44:2f dst fda5:868e:e104:310::30 self permanent

late-barista-75996

04/16/2024, 5:49 PM

However, on the other nodes, the entry for s1 has the wrong MAC

late-barista-75996

04/16/2024, 5:50 PM

I'm rebooting s1 again to test, but any chance that MAC is changing on me?

late-barista-75996

04/16/2024, 5:55 PM

Yes it is. The flannel-v6.1 iface keeps changing its MAC

late-barista-75996

04/16/2024, 5:55 PM

Something isn't communicating that to the rest of the cluster

late-barista-75996

04/16/2024, 5:58 PM

This makes me think maybe etcd is having a problem, but I don't know how to debug etcd within k3s

creamy-pencil-82913

04/16/2024, 6:04 PM

@bland-account-99790 any ideas?

creamy-pencil-82913

04/16/2024, 6:04 PM

etcd is honestly the last thing I would go poking at.

late-barista-75996

04/16/2024, 6:04 PM

Oh good 😛

late-barista-75996

04/16/2024, 6:05 PM

I just assumed that was the piece that was supposed to tell the rest of the cluster about the flannel iface's mac address

late-barista-75996

04/16/2024, 6:07 PM

kubectl describe node s1

shows the proper MAC in the

<http://flannel.alpha.coreos.com/backend-v6-data|flannel.alpha.coreos.com/backend-v6-data>

annotation

creamy-pencil-82913

04/16/2024, 6:11 PM

flannel updates the annotation on the node to record that IIRC. I don’t think it’s normal that it would change all the time though.

late-barista-75996

04/16/2024, 6:13 PM

Interesting. I wasn't super surprised to see it changing given that it's a virtual iface; to keep a static MAC it must be based on something or recorded somewhere, yes?

creamy-pencil-82913

04/16/2024, 6:13 PM

yeah I’m not that far in the weeds with flannel. Manuel who I tagged up above is one of the maintainers and would know better.

late-barista-75996

04/16/2024, 6:14 PM

Okay very good. Is he on an overlapping tz, or should I log an issue at this point?

late-barista-75996

04/16/2024, 6:26 PM

This is starting to sound a bit like https://github.com/k3s-io/k3s/issues/4188

late-barista-75996

04/16/2024, 6:28 PM

I'm on v1.28.8+k3s1 though, so it can't be exactly that

late-barista-75996

04/16/2024, 6:44 PM

@bland-account-99790/@creamy-pencil-82913 making the forwarding tables on s2 and s3 use the proper mac is making everything work again

late-barista-75996

04/16/2024, 6:45 PM

Not ideal that we need to do that manually though, obviously

late-barista-75996

04/16/2024, 7:27 PM

Just rebooted another node, and its mac address changed as well, with the same effects

late-barista-75996

04/16/2024, 7:27 PM

So now I'm just taking care of these tables manually 😬

late-barista-75996

04/16/2024, 7:27 PM

I'll log a bug

late-barista-75996

04/16/2024, 7:40 PM

Note, by the way, that flannel.1's mac does NOT change

late-barista-75996

04/16/2024, 8:18 PM

https://github.com/k3s-io/k3s/issues/9957

creamy-pencil-82913

04/16/2024, 8:20 PM

flannel-io/flannel may be a better place for that but we’ll see

late-barista-75996

04/16/2024, 8:21 PM

Yeah let's see what @bland-account-99790 thinks, happy to log it over there. My flannel debugging capabilities are pretty limited by k3s

late-barista-75996

04/16/2024, 8:21 PM

So I won't be able to respond to their questions very well 😛

creamy-pencil-82913

04/16/2024, 8:22 PM

what flannel backend are you using?

late-barista-75996

04/16/2024, 8:22 PM

vxlan

creamy-pencil-82913

04/16/2024, 8:22 PM

late-barista-75996

04/16/2024, 8:22 PM

I'll add that to the issue

bland-account-99790

04/17/2024, 7:28 AM

Good morning! Unfortunately, we live in different time zones 😞. I replied to the issue, I think the problem is that the PR fixing it forgot the v6 interface

bland-account-99790

04/17/2024, 10:02 AM

https://github.com/flannel-io/flannel/pull/1946

late-barista-75996

04/17/2024, 3:52 PM

Morning @bland-account-99790! Afternoon for you, eh? No problem, I'm very used to globally distributed teams 🙂 . Github issues are definitely the way to go, especially once I get it reduced to the actual problem

late-barista-75996

04/17/2024, 4:03 PM

@creamy-pencil-82913 is there a way in k3s to be able to see the flannel info messages?

creamy-pencil-82913

04/17/2024, 5:23 PM

yes they’re just in the k3s log with everything else

Open in Slack

Previous Next