This message was deleted.
# k3s
a
This message was deleted.
c
because with ipv6 a /48 would be huuuuuuuuuge
and the controller-manager has to store a bitmask of all the allocated IPs in that range. With something larger than a /108 the bitmask is like, several megs of memory.
c
It's really not all that huge if you follow the actual specifications... It's also the smallest range you can reliably announce over BGP...
And the issue with several megs of memory... in 2022... is what?
c
You know how big IPv6 is right
c
Yes...
c
a /16 with IPv4 is 2^16=65536 4 byte addresses. A /48 with IPv6 is 2^80=1208925819614629174706176 128 bit addresses
c
That's completely ignoring everything about the ipv6 specification on subnet allocations...
c
Why would you need 2^80 services in your cluster
This isn’t even something that gets routed outside the cluster, it is literally just for the addresses assigned for ClusterIP services
and the IPAM needs to keep track of every single one of those allocations
c
Each L2 as an example should have a unique /64. Because of how kubernetes work, that for the most part results in that each pod should have a /64 by itself...
c
You’re talking about pod cidrs, but you asked about the service-cluster-ip-range which is something else entirely
service-cluster-ip-range controls the CIDR block allocated for ClusterIP services.
c
And yes that's for ClusterIP services... Which every other service consist of... So even if you do a LoadBalancer type, it will still have a ClusterIP
c
Yep
c
So it's NOT just ClusterIP services... It's ALL services...
c
Are you really going to have 1.2 million billion billion ClusterIPs?
And I’m not sure what BGP announcements have to do with it, like I said these aren’t routed outside the cluster
they just exist as the target of KubeProxy rules
And either way arguing with me won’t help, the Kubernetes authors decided it would be ludicrous to ever need more than 2^20 ClusterIPs
c
Again... If you follow the specifications, each of them would have a /64 themselves because they're different L2s. So it's not billions. A full /48 is 64k total...
c
No, they don’t
Each node does not allocate out of a range of ClusterIPs
You are confusing that with pod IPs
c
And they are routed outside if you use things like Calico as the CNI
c
ClusterIP services are centrally allocated out of the controller manager
You are again confusing that with pod IPs
c
Again, you're ignoring that ClusterIP is used for ALL services, regardless if they're ClusterIP or LoadBalancer or whatever.
c
How am I?
They are not allocated per node, they are not routed outside the cluster
Only pod IP CIDRs are sub-allocated for nodes
c
You keep pointing to ClusterIP services... It's all services, not just ClusterIP services, because all services have a ClusterIP and this limitation limits all of it.
c
If you want to go argue with the Kubernetes maintainers why you need to have more than 1048576 services in a cluster you are welcome to do so
c
What I want is that specs are followed and that the assigned range is announceable.
But so what you're saying it's an upstreams limitation that would break compatibility?
Or upstreams is the wrong word here since it's not a fork but still
c
It is a limitation enforced by the Kubernetes controller manager itself. The service IPAM will refuse to start if you try to get it to track an ipv6 CIDR larger than /108
c
Right but don't k3s has its own controller manager?
c
No, we just run the Kubernetes apiserver, controller-manager, scheduler, and so on
We run other things in addition to that, but we also use all of the core upstream stuff
c
Ah right. So then yea it's them that I need to ask I guess.
c
As I said earlier, announcing the service CIDR is not a logical thing to do anyway, the service IPs don’t actually exist anywhere, they are just the target of iptables rules managed by kube-proxy. You will never find anything that actually listens on or handles traffic to those IPs, and you are not expected to pass traffic to those IPs except from within the cluster.
That’s why they’re called cluster IPs, you’re not SUPPOSED to get at them from outside the cluster. I highly doubt upstream will want to change that.
c
That is true for ClusterIP... It's not true for as an example LoadBalancer which is explicitly designed to be exposed.
c
kube-proxy fakes those IPs with iptables or lvs, some other CNIs will do the same with ebpf… but they don’t really exist anywhere
c
And Calico does it with... BGP...
c
Calico routes traffic between pods with BGP, not cluster IPs. As far as I know.
c
You have seriously only just scrateched the surface of calico then... It does NOT just do traffic between pods... It also handles services, and external peering, both with the underlaying network on the nodes, but you can also do BGP peering with external service providers.
And ofc, while I could do a /108 as service range, and announce the full /48... The only thing that accomplishes is that now there's 60 bits that are completely dead space, completely negating one of the goals in ipv6 which was obfuscating using big ranges so that you can't just guess the IP of a service... That's why the smallest subnet size is /64
c
If you’re using Calico then you should be using their pod and service IPAMs and not the core Kubernetes one anyways? So the limitation doesn’t even affect you.
They have their own thing for managing blocks that doesn’t use the kubernetes built-in IPAM at all
c
Hm? Don't those settings have to match? I've had issues in the past when I used Weave when it didn't match 😕
I guess I’ve not tried to use it as the service IPAM instead, just as the node IPAM
I see that Tigera does have a blog post up about advertising the service ip range via BGP but I note that they don’t talk about doing it for ipv6, just ipv4. https://www.tigera.io/blog/advertising-kubernetes-service-ips-with-calico-and-bgp/
c
Yea because you can't announce a /108... There's no provider that would peer with you with that size and most would probably outright block further communication if you even tried it 🙂
c
This seems really dumb though, it just ECMPs the cluster cidr across all the nodes, it doesn’t even try to steer traffic to the nodes actually hosting the endpoints for the service.
so it still has to go through iptables or whatnot and then get bounced over to the pod
c
It doesn't need to. Calico does a VXLAN mesh that it routes all traffic over so regardless of which node the pod is running on, it'll still reach it..
c
I am still not sure why you would want to let things directly connect to clusterip services from outside the cluster anyway
yes, but what even is the point of pushing this routing outside the cluster if it still has to bounce around between the nodes.
exposing clusterips outside the cluster is considered an antipattern anyway
you’re intended to use loadbalancers and/or ingress to get into the cluster, and then clusterips within it.
c
Well, if I expose the pod network, well then there's no load balancing. That's exactly why load balance is used. But that still needs an IP and uses the service range.
But it seems calico indeed does not care at all about the settings defined during cluster setup. So I guess the problem is moot, even if I consider it a weird limitation 🙂
c
you’re just going to send all your traffic from outside the cluster directly to pods somehow and not use loadbalancers at all?
c
MetalLB
c
Yeah you don't need to advertise the ClusterIP for that, you can just get metallb to do the bgp peering. It will even do better than calico and specifically send traffic to nodes with endpoints for the service.
That would be a way better approach than trying to advertise your ClusterIP range with equal weight to all the nodes.