This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

09/15/2022, 8:58 PM

This message was deleted.

creamy-pencil-82913

09/15/2022, 9:05 PM

Where is DNS failing? Can the pods reach the in-cluster DNS service (coredns)? Can coredns reach its upstream DNS servers?

prehistoric-solstice-99854

09/15/2022, 9:07 PM

I enabled logs in the configmap for coredns and then ran

kubectl logs -f …

and tried to do some DNS requests and I don’t see any traffic. At this point, it doesn’t even look like the request hits the nodes.

creamy-pencil-82913

09/15/2022, 9:08 PM

can you resolve kubernetes records, such as kubernetes.default.svc? that one should always exist.

creamy-pencil-82913

09/15/2022, 9:08 PM

Or is it just external records that you can’t hit?

prehistoric-solstice-99854

09/15/2022, 9:08 PM

host <http://updates.suse.com|updates.suse.com>

fails but

host <http://updates.suse.com|updates.suse.com> 1.1.1.1

works.

prehistoric-solstice-99854

09/15/2022, 9:09 PM

So try

nslookup kubernetes.default.svc

creamy-pencil-82913

09/15/2022, 9:09 PM

I like dig better, but sure

creamy-pencil-82913

09/15/2022, 9:10 PM

kubernetes.default.svc.cluster.local.

if you want to try a FQDN, without depending on search behavior

prehistoric-solstice-99854

09/15/2022, 9:10 PM

Copy code

# dig kubernetes.default.svc

; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> kubernetes.default.svc
;; global options: +cmd
;; connection timed out; no servers could be reached

creamy-pencil-82913

09/15/2022, 9:11 PM

OK, so you can’t reach the coredns service. Do you have the vxlan ports open between nodes?

creamy-pencil-82913

09/15/2022, 9:11 PM

most common cause of that is something is causing inter-node cluster traffic to be dropped

prehistoric-solstice-99854

09/15/2022, 9:13 PM

The firewall on the box is off, and there shouldn’t be any restrictions between VMs.

prehistoric-solstice-99854

09/15/2022, 9:14 PM

Does xvlan require jumbo frames on the physical network, ie. are there any MTU requirements for physical switches?

prehistoric-solstice-99854

09/15/2022, 9:20 PM

It seems we have a Dell switch that is giving us issues with MTUs over 1500.

creamy-pencil-82913

09/15/2022, 9:45 PM

the default MTU for the vxlan interfaces under the Canal CNI is 1450, which should result in an on-the-wire frame size of 1500 when it leaves the node. If you’re running in VMs its possible that vmware is adding further encapsulation on to your traffic and as a result the traffic is exceeding the MTU or is getting fragmented and dropped.

creamy-pencil-82913

09/15/2022, 9:46 PM

https://github.com/rancher/rke2-charts/blob/main/charts/rke2-canal/rke2-canal/v3.23.3-build2022081001/values.yaml#L74

prehistoric-solstice-99854

09/15/2022, 9:47 PM

Possible. We’re looking into that issue. Is there anything else I can test/tweak incase it isn’t the MTU issue?

prehistoric-solstice-99854

09/15/2022, 9:48 PM

Or can I modify that config in my cluster to push the MTU down to 1400?

creamy-pencil-82913

09/15/2022, 9:49 PM

yeah, you can try using a HelmChartConfig to modify the chart values. I’m not sure what will happen if you try to change it on a running cluster; you might need to do it from the get-go or at least kill and recreate all the pods to get it set properly.

prehistoric-solstice-99854

09/15/2022, 9:49 PM

Okay, anything else I can try?

creamy-pencil-82913

09/15/2022, 9:50 PM

https://docs.rke2.io/helm/#customizing-packaged-components-with-helmchartconfig in this case you’d want to set the rke2-canal chart’s

calico.vethuMTU

value

creamy-pencil-82913

09/15/2022, 9:51 PM

other than raising the MTU on the switch? no

creamy-pencil-82913

09/15/2022, 9:51 PM

thats where I would start if you are seeing traffic leave one node but not show up on the other

prehistoric-solstice-99854

09/15/2022, 9:52 PM

Okay. Thanks!

little-actor-95014

09/16/2022, 3:22 PM

if MTU doesn't help, might try disabling tx checksum offloading.

sudo ethtool -K flannel.1 tx-checksum-ip-generic off

I've personally only ever seen this issue with Ubuntu, but it's a stab in the dark if nothing else helps

prehistoric-solstice-99854

09/16/2022, 3:28 PM

Thanks for that Zach. At this point I’m doing a lot of blind stabbing. 😉

prehistoric-solstice-99854

09/16/2022, 3:36 PM

@little-actor-95014 That fixed my problem!!!!

prehistoric-solstice-99854

09/16/2022, 3:37 PM

Now I need to understand what I just did and will this change persist across reboots.

little-actor-95014

09/16/2022, 3:39 PM

From my understanding of the issues, there's a kernel bug that causes checksumming for UDP traffic of VXLAN to be wrong and the traffic dropped. I've had mixed results in reproduction of the issue. RKE2 Issue: https://github.com/rancher/rke2/issues/1541 Flannel Issue: https://github.com/flannel-io/flannel/issues/1279

little-actor-95014

09/16/2022, 3:43 PM

The weirdest part for me has been that I've reinstalled things from scratch to repro and haven't been able to. And that most issues I find that reference the bug and potential workarounds say it was fixed in K8s/RKE2 a year-ish ago

prehistoric-solstice-99854

09/16/2022, 3:44 PM

Huh. Maybe I’m just using an older kernel where it’s still broken?

prehistoric-solstice-99854

09/16/2022, 3:45 PM

These server nodes are running 4.18 kernel.

little-actor-95014

09/16/2022, 3:47 PM

It is possible Oracle hasn't backported the patch, it seems it was patched in 5.7, but the bug introduced back in 2006 🙂

prehistoric-solstice-99854

09/16/2022, 3:48 PM

Possible. Thank you so much for that help. That just saved me so much time in troubleshooting.

308 Views

Open in Slack

Previous Next