https://rancher.com/ logo
Title
c

creamy-hospital-75658

12/08/2022, 11:05 PM
Can anyone explain to me why when using ipv6 --service-cluster-ip-range has to be a /108 or smaller? It neither makes sense to me and especially when we can specify a /16 for ipv4, we should at least be able to specify a /48 😕... I wanted to use one /48 for pods and another /48 for services 😕
c

creamy-pencil-82913

12/09/2022, 12:04 AM
because with ipv6 a /48 would be huuuuuuuuuge
and the controller-manager has to store a bitmask of all the allocated IPs in that range. With something larger than a /108 the bitmask is like, several megs of memory.
c

creamy-hospital-75658

12/09/2022, 12:06 AM
It's really not all that huge if you follow the actual specifications... It's also the smallest range you can reliably announce over BGP...
And the issue with several megs of memory... in 2022... is what?
c

creamy-pencil-82913

12/09/2022, 12:08 AM
You know how big IPv6 is right
c

creamy-hospital-75658

12/09/2022, 12:08 AM
Yes...
c

creamy-pencil-82913

12/09/2022, 12:09 AM
a /16 with IPv4 is 2^16=65536 4 byte addresses. A /48 with IPv6 is 2^80=1208925819614629174706176 128 bit addresses
c

creamy-hospital-75658

12/09/2022, 12:10 AM
That's completely ignoring everything about the ipv6 specification on subnet allocations...
c

creamy-pencil-82913

12/09/2022, 12:10 AM
Why would you need 2^80 services in your cluster
This isn’t even something that gets routed outside the cluster, it is literally just for the addresses assigned for ClusterIP services
and the IPAM needs to keep track of every single one of those allocations
c

creamy-hospital-75658

12/09/2022, 12:11 AM
Each L2 as an example should have a unique /64. Because of how kubernetes work, that for the most part results in that each pod should have a /64 by itself...
c

creamy-pencil-82913

12/09/2022, 12:12 AM
You’re talking about pod cidrs, but you asked about the service-cluster-ip-range which is something else entirely
service-cluster-ip-range controls the CIDR block allocated for ClusterIP services.
c

creamy-hospital-75658

12/09/2022, 12:12 AM
And yes that's for ClusterIP services... Which every other service consist of... So even if you do a LoadBalancer type, it will still have a ClusterIP
c

creamy-pencil-82913

12/09/2022, 12:12 AM
Yep
c

creamy-hospital-75658

12/09/2022, 12:13 AM
So it's NOT just ClusterIP services... It's ALL services...
c

creamy-pencil-82913

12/09/2022, 12:13 AM
Are you really going to have 1.2 million billion billion ClusterIPs?
And I’m not sure what BGP announcements have to do with it, like I said these aren’t routed outside the cluster
they just exist as the target of KubeProxy rules
And either way arguing with me won’t help, the Kubernetes authors decided it would be ludicrous to ever need more than 2^20 ClusterIPs
c

creamy-hospital-75658

12/09/2022, 12:15 AM
Again... If you follow the specifications, each of them would have a /64 themselves because they're different L2s. So it's not billions. A full /48 is 64k total...
c

creamy-pencil-82913

12/09/2022, 12:15 AM
No, they don’t
Each node does not allocate out of a range of ClusterIPs
You are confusing that with pod IPs
c

creamy-hospital-75658

12/09/2022, 12:15 AM
And they are routed outside if you use things like Calico as the CNI
c

creamy-pencil-82913

12/09/2022, 12:15 AM
ClusterIP services are centrally allocated out of the controller manager
You are again confusing that with pod IPs
c

creamy-hospital-75658

12/09/2022, 12:16 AM
Again, you're ignoring that ClusterIP is used for ALL services, regardless if they're ClusterIP or LoadBalancer or whatever.
c

creamy-pencil-82913

12/09/2022, 12:16 AM
How am I?
They are not allocated per node, they are not routed outside the cluster
Only pod IP CIDRs are sub-allocated for nodes
c

creamy-hospital-75658

12/09/2022, 12:17 AM
You keep pointing to ClusterIP services... It's all services, not just ClusterIP services, because all services have a ClusterIP and this limitation limits all of it.
c

creamy-pencil-82913

12/09/2022, 12:18 AM
If you want to go argue with the Kubernetes maintainers why you need to have more than 1048576 services in a cluster you are welcome to do so
c

creamy-hospital-75658

12/09/2022, 12:19 AM
What I want is that specs are followed and that the assigned range is announceable.
But so what you're saying it's an upstreams limitation that would break compatibility?
Or upstreams is the wrong word here since it's not a fork but still
c

creamy-pencil-82913

12/09/2022, 12:21 AM
It is a limitation enforced by the Kubernetes controller manager itself. The service IPAM will refuse to start if you try to get it to track an ipv6 CIDR larger than /108
c

creamy-hospital-75658

12/09/2022, 12:22 AM
Right but don't k3s has its own controller manager?
c

creamy-pencil-82913

12/09/2022, 12:22 AM
No, we just run the Kubernetes apiserver, controller-manager, scheduler, and so on
We run other things in addition to that, but we also use all of the core upstream stuff
c

creamy-hospital-75658

12/09/2022, 12:23 AM
Ah right. So then yea it's them that I need to ask I guess.
c

creamy-pencil-82913

12/09/2022, 12:24 AM
As I said earlier, announcing the service CIDR is not a logical thing to do anyway, the service IPs don’t actually exist anywhere, they are just the target of iptables rules managed by kube-proxy. You will never find anything that actually listens on or handles traffic to those IPs, and you are not expected to pass traffic to those IPs except from within the cluster.
That’s why they’re called cluster IPs, you’re not SUPPOSED to get at them from outside the cluster. I highly doubt upstream will want to change that.
c

creamy-hospital-75658

12/09/2022, 12:26 AM
That is true for ClusterIP... It's not true for as an example LoadBalancer which is explicitly designed to be exposed.
c

creamy-pencil-82913

12/09/2022, 12:26 AM
kube-proxy fakes those IPs with iptables or lvs, some other CNIs will do the same with ebpf… but they don’t really exist anywhere
c

creamy-hospital-75658

12/09/2022, 12:26 AM
And Calico does it with... BGP...
c

creamy-pencil-82913

12/09/2022, 12:27 AM
Calico routes traffic between pods with BGP, not cluster IPs. As far as I know.
c

creamy-hospital-75658

12/09/2022, 12:28 AM
You have seriously only just scrateched the surface of calico then... It does NOT just do traffic between pods... It also handles services, and external peering, both with the underlaying network on the nodes, but you can also do BGP peering with external service providers.
And ofc, while I could do a /108 as service range, and announce the full /48... The only thing that accomplishes is that now there's 60 bits that are completely dead space, completely negating one of the goals in ipv6 which was obfuscating using big ranges so that you can't just guess the IP of a service... That's why the smallest subnet size is /64
c

creamy-pencil-82913

12/09/2022, 12:33 AM
If you’re using Calico then you should be using their pod and service IPAMs and not the core Kubernetes one anyways? So the limitation doesn’t even affect you.
They have their own thing for managing blocks that doesn’t use the kubernetes built-in IPAM at all
c

creamy-hospital-75658

12/09/2022, 12:34 AM
Hm? Don't those settings have to match? I've had issues in the past when I used Weave when it didn't match 😕
I guess I’ve not tried to use it as the service IPAM instead, just as the node IPAM
I see that Tigera does have a blog post up about advertising the service ip range via BGP but I note that they don’t talk about doing it for ipv6, just ipv4. https://www.tigera.io/blog/advertising-kubernetes-service-ips-with-calico-and-bgp/
c

creamy-hospital-75658

12/09/2022, 12:38 AM
Yea because you can't announce a /108... There's no provider that would peer with you with that size and most would probably outright block further communication if you even tried it 🙂
c

creamy-pencil-82913

12/09/2022, 12:39 AM
This seems really dumb though, it just ECMPs the cluster cidr across all the nodes, it doesn’t even try to steer traffic to the nodes actually hosting the endpoints for the service.
so it still has to go through iptables or whatnot and then get bounced over to the pod
c

creamy-hospital-75658

12/09/2022, 12:41 AM
It doesn't need to. Calico does a VXLAN mesh that it routes all traffic over so regardless of which node the pod is running on, it'll still reach it..
c

creamy-pencil-82913

12/09/2022, 12:41 AM
I am still not sure why you would want to let things directly connect to clusterip services from outside the cluster anyway
yes, but what even is the point of pushing this routing outside the cluster if it still has to bounce around between the nodes.
exposing clusterips outside the cluster is considered an antipattern anyway
you’re intended to use loadbalancers and/or ingress to get into the cluster, and then clusterips within it.
c

creamy-hospital-75658

12/09/2022, 12:45 AM
Well, if I expose the pod network, well then there's no load balancing. That's exactly why load balance is used. But that still needs an IP and uses the service range.
But it seems calico indeed does not care at all about the settings defined during cluster setup. So I guess the problem is moot, even if I consider it a weird limitation 🙂
c

creamy-pencil-82913

12/09/2022, 12:51 AM
you’re just going to send all your traffic from outside the cluster directly to pods somehow and not use loadbalancers at all?
c

creamy-hospital-75658

12/09/2022, 12:55 AM
MetalLB
c

creamy-pencil-82913

12/09/2022, 1:56 AM
Yeah you don't need to advertise the ClusterIP for that, you can just get metallb to do the bgp peering. It will even do better than calico and specifically send traffic to nodes with endpoints for the service.
That would be a way better approach than trying to advertise your ClusterIP range with equal weight to all the nodes.