https://rancher.com/ logo
b

brief-rainbow-54144

11/25/2022, 9:12 PM
Hi all, as is seemingly standard when joining channels like this... I think i've got an issue but it could also be RHEL and its horrible tx-checksum-ip-generic issue manifesting itself in a different way:
Copy code
failed calling webhook \"<http://validator.longhorn.io|validator.longhorn.io>\": failed to call webhook: Post \"<https://longhorn-admission-webhook.longhorn-system.svc:9443/v1/webhook/validaton?timeout=10s>
Any pointers welcome - I'm about to fire up a dnsutils pod to look at the DNS now.
c

creamy-pencil-82913

11/25/2022, 9:17 PM
You seem to have cut off the actual error. Did it time out? did it fail to resolve? did it return a HTTP error response?
b

brief-rainbow-54144

11/25/2022, 9:18 PM
You're correct... ": context deadline exceeded"
I was definitely hit by the checksum issue for flannel - Nothing would actually resolve DNS initially (thanks RHEL) but implementing a service to auto-disable the checksumming seems to have remedied that, and now i'm here. Lol
(Bear in mind this springs into life perfectly fine on a Rocky setup, i'm just forced into using RHEL here)
That said, this is 32 nodes and not 7 (not sure if that makes a difference or not).
root@dnsutils:/# host longhorn-admission-webhook.longhorn-system.svc longhorn-admission-webhook.longhorn-system.svc.cluster.local has address 10.44.172.173
DNS looks good...
c

creamy-pencil-82913

11/25/2022, 9:29 PM
tx checksum thing is likely if you’re using vxlan
b

brief-rainbow-54144

11/25/2022, 9:30 PM
Yeah... I might just create some systemd udev stuff to get round it, so the moment an interface comes up it auto-disables it.
Should I have to do this? Absolutely not. Lol
It's weird because it's clearly communicating... I get leader election messages from the managers...
Copy code
longhorn-manager time="2022-11-25T21:34:06Z" level=info msg="New upgrade leader elected: node-k98l-020"                                                                                  longhorn-manager time="2022-11-25T21:34:13Z" level=info msg="New upgrade leader elected: node-k98l-033"                                                                                  longhorn-manager time="2022-11-25T21:34:23Z" level=info msg="New upgrade leader elected: node-k98l-021"                                                                                  longhorn-manager time="2022-11-25T21:34:35Z" level=info msg="New upgrade leader elected: node-k98l-022"
Ok so the plot thickens
I can fire up a smaller cluster and... All fine. Manager picks up, longhorn starts. This looks to be some sort of scalability issue
How many nodes does Longhorn support scaling-wise?
64 Views