This message was deleted.
# k3s
a
This message was deleted.
a
I am installing it with the following options: shell: "curl -sfL https://get.k3s.io | K3S_TOKEN={{ k3s_token }} INSTALL_K3S_EXEC='server --cluster-init -i {{ internal_ip }} --flannel-iface {{ nic_name }}.4000 --node-external-ip {{ internal_ip }} --disable=servicelb --disable-network-policy' sh -"
where internal_ip is the internal IP address on VLAN 4000 (192.168.x.x). That is the same VLAN that the LoadBalancer IP's are also presented in the 159.x.x.x range.
Is there anyone in this slack providing support?
m
You might try wireguard instead of vxlan for flannel.
a
Hi Scott. Yes, I tried it just out of desperation, and even host-gw mode and all of them do exactly the same
m
I wonder if it's a routing issue. Can you curl from the same layer2 vlan?
"The pods are also accessible from inside the cluster private IP range via CURL from another pod." suggests that you can
In this case, it seems like those IPs are missing a route from the outside world
a
Yes, I can run CURL on one of the hosts (192.168.x.x) and it reaches the POD just fine. Same with another POD in the cluster
m
Sounds like you have layer2 connectivity but don't have the necessary routing for outside requests to find their way in
a
That's what it feels like to me. But where would that route come from? I can boot one of those hosts in recovery mode (light OS) and configure NETPLAN to directly assign the LB IP to a host and it is then reachable across the Internet
m
I don't have any experience with Hetzner at all. It's BGP policy for bare metal.
a
so it feels like for some reason the IP to Mac is never propagated beyond the local vswitch
m
Right - that makes sense if you don't have BGP routing to it
So you're able to connect over layer2, but not from outside of it
a
I haven't actually tried BGP because they indicated it wouldn't work, I am only using L2, but I have a mind to try BGP
m
If you don't do BGP, MetalLB will only work over layer2
So that explains your problem
a
that makes sense
I was following this guide which implied it worked: https://mlohr.com/kubernetes-cluster-on-hetzner-bare-metal-servers/
I'll try doing a BGP advertisement and see what happens now.
m
You need to have BGP setup with your network ops team or provider and then do this: https://metallb.universe.tf/configuration/#bgp-configuration
Layer2 can work, by the way, if you expose everything through an external balancer that can talk on that subnet and is exposed somehow to the outside world so it acts as a proxy. That doesn't sound like what you're trying to architect, though
a
Emm. it feels like if Hetzner tell me it won't work, I am going to struggle to get any AS information from them
m
But you can do what you're doing and proxy requests through something like haproxy, nginx plug, or a cloud provider load balancer
a
Yes, I was trying to avoid using their cloud LB because it is pretty limited in what it can do
m
Yup, they tend to be pretty limited. Perhaps you can install haproxy on that subnet and expose services through it.
a
I'm very familiar with the Azure LB, but it is really designed for fronting Azure VNET's and not external networks
I am using HA Proxy in the cluster for some things, but the risk is I end up with a single point of failure still
m
Or connect a VPN to your internal network and add routing that way, but then it'll pass through your network to Hetzner's which adds extra bandwidth and latency and dependencies.
You can do haproxy HA with keepalived
You route traffic through the VIP shared between the haproxy nodes.
That's what we're doing with our Rancher stuff
a
I might have to do something like that VPN in the interim. I've been bashing my head for days on how the L2 IP to MAC got propagated to the routers, and the answer is it isn't without BGP ๐Ÿ™‚
m
I've done it all of the above ways. It's job security having to learn about all this stuff ๐Ÿ™‚
a
In your HA proxy config, are those running in the cluster exposed via a Daemonset and node ports or something?
m
If you're not a DevOps Engineer yet you will be by the time you get it all setup ๐Ÿ˜‚
a
My head it hurting from everything i've had to learn in the last week along to move my fully functioning Dapp off Azure k8s, Mongo Atlas and Vercel into dedicated hosts
m
I generally use ClusterIP and then expose through Ingress. DNS is pointed at haproxy which routes the request back to the cluster. cert-manager to handle the automatic letsencrypt for everything.
a
I started at Ingress, but moved to Services just to rule out Traefik. Will move back now I now the issue
m
Since you are using MetalLB, you can use LoadBalancer type services and you'll need to add config in haproxy to route traffic to the MetaLB IP based on however you want to do that, but to do host based routing, you'll need to do layer7, so you'll want to have certbot or something like that on haproxy.
It's less work to not use MetalLB and use ClusterIP instead this way, since the ingress controller already handles the name based routing for you.
MetalLB is helpful if you have BGP routing and/or if you want to do something like hyperconverged or kubevirt with Multus and do full on VMs inside your k8s.
MetalLB is used in OpenStack+Kubernetes, for example
a
I started at ClusterIP before discovering Metal, but my issue there was I still needed an external Load Balancer to direct the traffic at the different instances on the nodes
m
If you don't want to do an external load balancer you can also set DNS A records to the nodes running your ingress handlers with a low TTL, but you risk disruptions when nodes go down since there's no active health checking that way
a
I started with that DNS round robin solution too, but DNS is not fault tolerant aware. If a node goes down, it will still keep sending the bad IP to clients
m
Absolutely correct. That's the problem that load balancers solve.
a
yeah exactly... so in your scenario, where do you normally sit you HA proxy? On the same cluster, or an additional box?
m
Outside of the cluster somewhere that can talk to both the cluster and can listen publicly. The cluster doesn't need to be able to listen publicly - it just needs to be able to talk to haproxy and other cluster nodes.
a
Gotcha. Just had a thought, why is the route from an external client to my cluster working for me if I boot a host into recovery mode, create an 02-netplan.yaml with an IP address equal to any of my L2 IP address ranges. What is advertising the route there? I am not setting up BGP
m
One thing to note is that if you don't use something like MetalLB, you'll need to set service type to ClusterIP if they're set to LoadBalancer in helm charts, or else they'll be stuck in pending forever.
โœ… 1
a
That would suggest the vSwitch feature in Heztner is already doing the advertisement and presenting those interfaces to my hosts: https://docs.hetzner.com/robot/dedicated-server/network/vswitch
m
I'm guessing the node external IP has a route but your BGP address pool doesn't
The k3s default loadbalancer uses the node externalIP by default, which is why the DNS trick works
a
No, I can get to any of the 159.x.x.x addresses if I assign them as IP's on Netplan to a NIC
it is only when I let Metal assign the IP to a MAC dynamically that it doesn't work
m
Yeah, unless you do BGP, MetalLB will only work over Layer2.
a
at least I can ping that IP or traceroute to it, and see it reaches my servers NIC
m
Layer2 is link-local
You can do Layer2 to an external load balancer and expose things that way or setup BGP to get external routes in
a
I wonder if Ubuntu then is doing something different if you stick it into netplan
m
BGP and layer2 and network level stuff
Are you familiar with OSI model?
haproxy can run at layer 4 or 7. MetalLB without BGP runs at layer2.
layer4 = tcp/udp mode. layer7 = http/https mode. (in haproxy speak)
a
Yes, learned the OSI model about 20 years ago ๐Ÿ™‚
m
Yup - these are old tricks and they're still relevant ๐Ÿ™‚
Containers are also similar to bsd jail chroots
a
Definitely. The mystery for me is why it works in Netplan without K8s. Which suggests the route is advertised already by the vSwitch
m
Yeah, when you get into Hetzner, you've left my wheelhouse
a
their prices can't be beat really.. .so worth the pain
m
a
64gb, 8 cores, 512GB NVMe dedicated servers which are very fast for about 40 euros
m
If Hetzner is like AWS, you probably have a route section where you can define routes for your cloud networking to a gateway
a
no, you just ordered a subnet range for your existing VNET, and it is mapped via the primary interface of the server. So no routing config is accessible to the user, but it does work
m
"You can use any private IP addresses for free within the VLAN. Plus, you can order additional public subnets (IPv4 and IPv6) by going to the
IPs
menu tab."
Is this IP pool that you used public or private?
(Even if it is public, you still need a route)
a
Yep.. so my 192.168.x.x range is in VLAN 4000 which I have just used. The IP's menu is where I have ordered a routable subnet on 159.x.x.x range which is mapped to all servers in the vSwitch
m
192.168.x.x is not public
a
no.. that is my private range... just for host to host comms for k3s
159.x.x.x is the LB public range
m
"*Public subnet* You need to create an additional routing table for the public subnet so you can configure another default gateway." Ref. https://docs.hetzner.com/robot/dedicated-server/network/vswitch/#traffic
That's interesting that it has you set the route table yourself and isn't providing it over DHCP.
a
I believe I am doing that in NETPLAN like this: network: version: 2 ethernets: enp41s0: dhcp4: no vlans: enp41s0.4000: id: 4000 link: enp41s0 mtu: 1400 addresses: - 192.168.100.2/24 routes: - to: 0.0.0.0/0 via: 159.69.172.25 table: 1 on-link: true routing-policy: - from: 159.69.172.24/29 to: 10.43.0.0/16 table: 254 priority: 0 - from: 159.69.172.24/29 to: 10.42.0.0/16 table: 254 priority: 0 - from: 159.69.172.24/29 table: 1 priority: 10 - to: 159.69.172.24/29 table: 1 priority: 10
m
Should be able to check with
route -n
and
ip r
a
I am at the limits of my linux knowledge here. Glad I have you for help. This is route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 65.109.113.129 0.0.0.0 UG 0 0 0 enp41s0 10.42.0.0 10.42.0.0 255.255.255.0 UG 0 0 0 flannel.1 10.42.0.0 0.0.0.0 255.255.0.0 U 0 0 0 flannel-wg 10.42.1.0 10.42.1.0 255.255.255.0 UG 0 0 0 flannel.1 10.42.2.0 0.0.0.0 255.255.255.0 U 0 0 0 cni0 159.69.172.0 0.0.0.0 255.255.255.0 U 0 0 0 enp41s0.4000 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 192.168.100.0 0.0.0.0 255.255.255.0 U 0 0 0 enp41s0.4000
m
I'm more used to RHEL and SUSE, so not very familiar with netplan
a
so shouldn't I be seeing a 0.0.0.0 via 159.69.172.25 here (gw)
m
159.69.172.0   0.0.0.0        255.255.255.0  U    0     0       0 enp41s0.4000
What's your
ip r
look like
a
odd. i am still seeing a flannel.wg route here, even though I disabled that again and went back to vxlan
default via 65.109.113.129 dev enp41s0 proto static onlink 10.42.0.0/24 via 10.42.0.0 dev flannel.1 onlink 10.42.0.0/16 dev flannel-wg scope link 10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink 10.42.2.0/24 dev cni0 proto kernel scope link src 10.42.2.1 159.69.172.0/24 dev enp41s0.4000 proto kernel scope link src 159.69.172.26 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 192.168.100.0/24 dev enp41s0.4000 proto kernel scope link src 192.168.100.2
m
no metrics?
a
that is all ip r comes back with for me
m
scope link
on 159.69.172.0/24 confirms layer2 routing
a
that's very interesting. so that means L2 link local address only
m
So, that's as far as the host is concerned
a
going to reboot this server to recovery mode and see if it still has the working config and try and run those command there
m
Looks like you need to define the gateway for it if you want it to be publicly routable - https://docs.hetzner.com/robot/dedicated-server/network/vswitch/#server-configuration-linux
It looks like you followed the netplan docs but the netplan docs don't setup the gateway in the example
โœ… 1
a
Thanks for all your help Scott. That has been incredibly helpful for me. I'm going to put it into recovery mode and see if the working config is there. If it isn't i'll follow the Hertzner guide to setup routing to it, then run the route and ip r commands to check the output. From there I might be able to work out how to setup Netplan, though my problem may be if I want the IP's to float, I can't assign them to any host
๐Ÿ‘ 1
m
Good luck!
a
That sounds like my problem then. I wasn't fully familiar with the Ip command to know I was missing anything.
Really appreciate it. I am so much further along now
๐Ÿ‘ 1
Odd. In recovery mode I followed the Hetzner docs using the Ip commands to setup the VLAN and the extra public IP. On first PING from my remote laptop it responded, then immediately went to request timeout Perhaps one of the other nodes got upset or the network detected a spanning tree loop and blocked it
m
Interesting. Sounds like progress, though. Does it still work over layer2?
It's possible ICMP isn't supported - you might try curl
Or you might try
traceroute -T
a
Hi, sorry, needed sleep ๐Ÿ™‚
It still is working via L2 from the other hosts, but not via L3. And it is working properly now remotely, it must have taken time to propagate
Just got to work out how to get that routing config into Netplan now, without hard coding the IP to a single host
Hey @miniature-salesclerk-33951 So I have had to add the IP to the NIC's in Netplan and the routes and now I have an identical output wit IP ROUTE and ROUTE -n on both a standalone host in recovery mode that is pingable, and a k3s node that is also pingable now and I can traceroute to it.
๐ŸŽ‰ 1
However, still the traffic does not get to the pod when called externally. I just get a connected timed out unless I access it from one of the nodes within the local network. This is pointing more and more like a CNI issue as suspected by the MetalLB guys I spoke to.
m
Is it possible the mtu isn't set correctly? It might explain why ICMP gets through but not larger packets.
a
I have it set to 1400, but there is a note in that Hetzner article about Netplan not setting it. Though that article is talking about a much older version of Netplan
When I statically assign the IP to the hosts using Netplan or the IP commands, it does respond to pings and also if you do a Traceroute you can see it goes to the Internet interface on VLAN 4000 as it should: 1 192.168.1.1 (192.168.1.1) 11.420 ms 4.466 ms 4.265 ms 2 185.232.119.252 (185.232.119.252) 19.061 ms 16.024 ms 15.818 ms 3 185.232.119.144 (185.232.119.144) 19.660 ms 15.404 ms 185.232.119.146 (185.232.119.146) 15.642 ms 4 185.232.119.128 (185.232.119.128) 17.313 ms 185.232.119.130 (185.232.119.130) 17.298 ms 185.232.119.128 (185.232.119.128) 16.695 ms 5 195.66.227.209 (195.66.227.209) 18.449 ms 15.226 ms 26.395 ms 6 core6.par.hetzner.com (213.239.252.169) 24.291 ms 25.731 ms 25.546 ms 7 213-239-245-217.clients.your-server.de (213.239.245.217) 41.778 ms core11.nbg1.hetzner.com (213.239.252.173) 36.272 ms 36.181 ms 8 vswitchgw.juniper2.dc1.nbg1.hetzner.com (213.239.245.186) 37.504 ms vswitchgw.juniper2.dc1.nbg1.hetzner.com (213.239.245.62) 33.721 ms vswitchgw.juniper2.dc1.nbg1.hetzner.com (213.239.245.186) 36.834 ms 9 static.26.172.69.159.clients.your-server.de (159.69.172.26) 61.126 ms 57.870 ms 56.397 ms
If however, I do a traceroute to an IP managed by MetalLB, then it hits the frontend interface on 65.x (but doesn't respond to pings, which I think is right for Metal): 1 192.168.1.1 (192.168.1.1) 6.053 ms 4.201 ms 5.441 ms 2 185.232.119.252 (185.232.119.252) 17.600 ms 15.897 ms 16.190 ms 3 185.232.119.144 (185.232.119.144) 17.489 ms 15.798 ms 185.232.119.146 (185.232.119.146) 14.850 ms 4 185.232.119.130 (185.232.119.130) 59.486 ms 14.223 ms 12.832 ms 5 195.66.227.209 (195.66.227.209) 14.902 ms 14.471 ms 16.598 ms 6 core6.par.hetzner.com (213.239.252.169) 23.860 ms 23.795 ms 22.142 ms 7 core11.nbg1.hetzner.com (213.239.252.173) 38.614 ms core12.nbg1.hetzner.com (213.239.252.253) 36.700 ms 35.750 ms 8 vswitchgw.juniper2.dc1.nbg1.hetzner.com (213.239.245.186) 38.128 ms vswitchgw.juniper2.dc1.nbg1.hetzner.com (213.239.245.62) 39.776 ms vswitchgw.juniper2.dc1.nbg1.hetzner.com (213.239.245.186) 38.918 ms 9 static.48.13.108.65.clients.your-server.de (65.108.13.48) 62.170 ms 56.627 ms 55.951 ms
Also looks that by statically assigning the IP to the NIC, which is required to populate the default route, prevents that IP being allocated by Metal. It skipped it ad started at .27 instead
Plus, all of my networking is running on VLAN 4000 on the 192.168.x.x address. That includes complex apps like Mongo, Kafka, ArgoCD, etc all working just fine without errors. It is only this Ingress with issues
m
That makes sense - you'd have an IP conflict.
It might be something where you need to set the gateway in your MetalLB config for the ip address pool
a
There isn't a gateway option in metal is there? I can't see anything like that in the IPAddressPool or L2Advertsement CRD's
m
Looking at the docs, if there's an externalTrafficPolicy on the service, that could factor, too
I ran into this recently on JupyterHub helm deployments
a
I have externalTrafficPolicy set to Cluster. My understanding is all that controls is hpw the cni (flannel) routes the traffic either to the same node, or any node
m
Yeah, I was confusing that with egress policy, which is different
a
trying it set to local, though i don't think it will help
yep, that makes no difference. It does get to the node though, so it isn't an external routing problem, which also suggests MetalLB is fine. I can see the packets being received by TCPDUMP, but just not forwarded to the pods
something very odd. Using MTR over --tcp from my laptop to any service port I am exposing via the LB, shows the packets stop at the vswitch network device (juniper) just before my server. If I run the same test against a port that the LB is not listening on, then it also shows the host reverse DNS name and IP as the last hop. I can change the service port while it is running and as soon as I do, the packet will then get through. Doesn't matter what the port is same result. It is like the host just swallows it if it is being listened for, without responding back to MTR
m
that's weird
Oh!
"Since wireguard is a Layer3 vpn, almost all load-balancers will not work, this includes kube-vip and metalLB." - https://www.reddit.com/r/selfhosted/comments/mu6et4/has_anyone_setup_k3s_over_wireguard_is_it_possible/ So you might try going back to vxlan
a
Thanks. I am using VXLAN already. I only switched to wireguard briefly a few days ago to rule out vxlan issues ๐Ÿ˜ž
Dumb question time (I'm resorting to those now). What is the difference between agents and servers? I only have 3 nodes and want HA for things like ETCD, so I need 3x Servers right,, no Nodes?
m
Generally in Kubernetes of any flavor, you have three control plane nodes that also run etcd. You have worker nodes outside of that that don't run etcd that you shedule workloads on. k3s is a little special in that it can run on just one node. It is possible to have your control plane nodes schedule workloads if all you have is three machines to work with.
So there is a minimum of 3 nodes necessary to do HA over etcd
If you keep those 3 server scheduleable, then they are also functioning as worker nodes. Not ideal for larger production, but probably OK for a quick prototype or home lab situation.
a
Thanks for the confirmation, that is what I expected. These are monster nodes with a lot of hardware at their disposal. to run a single app, so sounds ideal for my scenario ๐Ÿ˜‰
๐Ÿ™‚ 1
I'm still puzzled by the fact I get back more responses than I am sending. This could be corrupting the transmission for more complex protocols like http. ARPING 159.69.172.28 from 192.168.100.2 enp41s0.4000 Unicast reply from 159.69.172.28 [A8A1599410:28] 2.768ms Unicast reply from 159.69.172.28 [A8A1599410:28] 0.878ms Unicast reply from 159.69.172.28 [A8A1599410:28] 4.089ms Unicast reply from 159.69.172.28 [A8A1599410:28] 0.888ms Unicast reply from 159.69.172.28 [A8A1599410:28] 13.128ms Unicast reply from 159.69.172.28 [A8A1599410:28] 0.930ms Unicast reply from 159.69.172.28 [A8A1599410:28] 5.925ms Sent 4 probes (1 broadcast(s)) Received 7 response(s) I just deleted the advertisement, rebooted my nodes, and recreated the L2 advertisement. Now I can see double the responses on two of the three nodes, and on the third nodes which actually has the IP, it isn't resolving
m
I think that makes sense. The assigned IP should listen on any nodes which have a MetalLB controller. It's possible that the metallb controllers were currently in a deployment rollout and the old pods hadn't been pruned off yet on two of them?
a
Completely rebuilt the OS, and redeployed this time using Calico as the networking stack, to rule out Flannel with the Networking guys were convinced was the problem. Almost the same issue, but a little more interesting. So I added a new host into the mix, and repurposed one of the others. Before, each of the three hosts were in the same location, but each in a different data centre. The new three hosts are spread across 2 different data centres only. The two hosts in the same DC can ARPING to the LoadBalancer IP. The other host in a different DC cannot. This vSwitch is designed to allow you to connected all your servers together in the same VLAN, even in different countries. But MetalLB is not propagating any MAC address changes beyond the local DC switch, which is probably right. I've no idea how this can get resolved. I think I can request co-location, which would introduce a single point of failure, and may still not advertise the IP beyond the switch.
Oh, this is actually a little more complex. The two hosts that can ARPING the address are in the same DC, but they are NOT in the same DC as where the address resolves to. The host that cannot ARPING the address is the one hosting it. Also, the other host not in the cluster, but in the same VLAN and a 3rd DC, also cannot ARPING them
m
I'm curious why you replaced flannel with Calico and not Cilium? With Cilium, you can remove kube-proxy from the equation, which might help.
There's a guide for k3s + cilium + metallb https://cilium.io/blog/2020/04/29/cilium-with-rancher-labs-k3s/
a
I'm up for anything, so I will try that. I only did it to rule out Flannel. Is Flannel also using kube-proxy? Thanks for the guide. This is pretty painless now as it is all scripted via anisble, takes about 15 mins to have all servers built and all apps and infra deployed. If it actually worked that would be a bonus ๐Ÿ™‚
Oh, it uses Vagrant. Lol, something else to learm, why not ๐Ÿ™‚
Vagrant is only used to installed K3s by the looks of it, so I can probably ignore that and use Ansible. Should be pretty painless to switch over
I've just noticed something in the MTR from my laptop: static.48.13.108.65.clients.your-server.de That last step it is going to is the correct host, that has that MAC. Is the MAC address for the frontend IP (65.x.x.x) going to be the same as the LoadBalancer IP being advertised? They are on the same NIC
Holy Shi*t. I works using Cilium as the network stack. Maybe kube-proxy was the issue after all.
๐ŸŽ‰ 1
Thanks for that suggestion!
858 Views