https://rancher.com/ logo
Title
d

dry-dawn-97788

05/11/2023, 5:46 PM
I am trying to get an RKE2 cluster up and running, but I get into trouble when I try to install
cert-manager
, or more specifically when I define
ClusterIssuer
which in turn trigger the webhook
cert-manager-webhook
. As it is the API server that want to connect to the webhook, it needs two things: • Be able to resolve the DNS address of the URL https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s - for which it needs configuration and reachability to the CoreDNS service. • Then it needs to be able to reach the service endpoint as well. Neither of these two works. What puzzles me is that the API server runs as a host-networked pod, and therefore lacks the DNS config needed, it just has config that points to the external name server (e.g. as if the ClusterFirstWithHostNet setting was not set on the pod). Secondly I do not known how to debug the issue with not being able to reach the service (must be some kind of iptables misconfiguration?). All help / ideas are welcome! Thanks!
I might add that I have tried
calico
and
canal
as CNIs with their default settings. Things are running on Proxmox VMs and Ubuntu 22.04. Tried
v1.26.4+rke2r1
and
v1.25.9+rke2r1
c

creamy-pencil-82913

05/11/2023, 6:00 PM
Yes, the apiserver has options to work with those requirements, anticipating that the webhook will be deployed to the cluster, but that the apiserver might not be a part of the cluster
1. if you target a service, it looks up the service endpoints directly 2. if it is not a part of the cluster, you can set up an egress proxy (Konnectivity service) that the apiserver can use to reach the endpoints
You shouldn’t need to use either of these on RKE2, as the server nodes are also part of the cluster, and should have the ability to reach pod and service addresses, even from the host network namespace
cert-manager is required for rancher, we install it all the time without any issues. I suspect there is something blocking inter-node CNI traffic in your environment.
confirm that you have the correct CNI ports open, and that you don’t have any other iptables rules on your nodes blocking traffic to/from cluster CIDR ranges.
d

dry-dawn-97788

05/11/2023, 6:04 PM
Yes, I guess I am doing something wrong - but I really have tried all I can think of and I'm out of options. It's a HA cluster. I think the CNI traffic is working fine. pod to pod traffic is OK. And normal pods can reach services.
c

creamy-pencil-82913

05/11/2023, 6:04 PM
what version of rke2 are you on?
d

dry-dawn-97788

05/11/2023, 6:04 PM
Firewall is inactive - so it should not be the issue.
$ /usr/local/bin/rke2 --version
rke2 version v1.25.9+rke2r1 (842d05e64bcbf78552f1db0b32700b8faea403a0)
go version go1.19.8 X:boringcrypto
This is the errors I am getting:
Error from server (InternalError): error when creating "cert-manager/k8s/letsencrypt-prod-issuer.yaml": Internal error occurred: failed calling webhook "<http://webhook.cert-manager.io|webhook.cert-manager.io>": failed to call webhook: Post "<https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s>": context deadline exceeded
Error from server (InternalError): error when creating "cert-manager/k8s/letsencrypt-staging-issuer.yaml": Internal error occurred: failed calling webhook "<http://webhook.cert-manager.io|webhook.cert-manager.io>": failed to call webhook: Post "<https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s>": context deadline exceeded
c

creamy-pencil-82913

05/11/2023, 6:16 PM
is the webhook pod healthy?
d

dry-dawn-97788

05/11/2023, 6:16 PM
Yes, I can invoke it with curl from other pods. Both directly to its pod-IP and the service cluster IP.
But you say that the api server looks up the endpoint(s) directly? Does that imply that it does not use the cluster DNS at all? But rather make explicit connections directly to the pod instead?
c

creamy-pencil-82913

05/11/2023, 6:36 PM
yep
if it is defined as a Service at least
can you show the output of
kubectl get node -o yaml | grep node-args
?
you can also see that it targets the service directly:
brandond@dev01:~$ kubectl get validatingwebhookconfigurations cert-manager-webhook -o yaml | grep -A10 webhooks:
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    caBundle: XXX
    service:
      name: cert-manager-webhook
      namespace: cert-manager
      path: /validate
      port: 443
  failurePolicy: Fail
d

dry-dawn-97788

05/11/2023, 6:49 PM
Output of the
node-args
command:
kubectl get node -o yaml | grep node-args
      <http://rke2.io/node-args|rke2.io/node-args>: '["server","--token","********","--data-dir","/var/lib/rancher/rke2","--cni","canal","--tls-san","cluster.local","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","<http://k8s-api.dcs.tickup.net|k8s-api.dcs.tickup.net>","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","10.101.0.11","--tls-san","<http://v1112-dcs-master-2.dcs.tickup.net|v1112-dcs-master-2.dcs.tickup.net>","--tls-san","10.101.0.12","--tls-san","<http://v1113-dcs-master-3.dcs.tickup.net|v1113-dcs-master-3.dcs.tickup.net>","--tls-san","10.101.0.13","--tls-san","<http://v1114-dcs-master-4.dcs.tickup.net|v1114-dcs-master-4.dcs.tickup.net>","--tls-san","10.101.0.14","--tls-san","<http://v1115-dcs-master-5.dcs.tickup.net|v1115-dcs-master-5.dcs.tickup.net>","--tls-san","10.101.0.15","--snapshotter","overlayfs","--node-name","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>"]'
      <http://rke2.io/node-args|rke2.io/node-args>: '["server","--server","<https://v1111-dcs-master-1.dcs.tickup.net:9345>","--token","********","--data-dir","/var/lib/rancher/rke2","--cni","canal","--tls-san","cluster.local","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","<http://k8s-api.dcs.tickup.net|k8s-api.dcs.tickup.net>","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","10.101.0.11","--tls-san","<http://v1112-dcs-master-2.dcs.tickup.net|v1112-dcs-master-2.dcs.tickup.net>","--tls-san","10.101.0.12","--tls-san","<http://v1113-dcs-master-3.dcs.tickup.net|v1113-dcs-master-3.dcs.tickup.net>","--tls-san","10.101.0.13","--tls-san","<http://v1114-dcs-master-4.dcs.tickup.net|v1114-dcs-master-4.dcs.tickup.net>","--tls-san","10.101.0.14","--tls-san","<http://v1115-dcs-master-5.dcs.tickup.net|v1115-dcs-master-5.dcs.tickup.net>","--tls-san","10.101.0.15","--snapshotter","overlayfs","--node-name","<http://v1112-dcs-master-2.dcs.tickup.net|v1112-dcs-master-2.dcs.tickup.net>"]'
      <http://rke2.io/node-args|rke2.io/node-args>: '["server","--server","<https://v1111-dcs-master-1.dcs.tickup.net:9345>","--token","********","--data-dir","/var/lib/rancher/rke2","--cni","canal","--tls-san","cluster.local","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","<http://k8s-api.dcs.tickup.net|k8s-api.dcs.tickup.net>","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","10.101.0.11","--tls-san","<http://v1112-dcs-master-2.dcs.tickup.net|v1112-dcs-master-2.dcs.tickup.net>","--tls-san","10.101.0.12","--tls-san","<http://v1113-dcs-master-3.dcs.tickup.net|v1113-dcs-master-3.dcs.tickup.net>","--tls-san","10.101.0.13","--tls-san","<http://v1114-dcs-master-4.dcs.tickup.net|v1114-dcs-master-4.dcs.tickup.net>","--tls-san","10.101.0.14","--tls-san","<http://v1115-dcs-master-5.dcs.tickup.net|v1115-dcs-master-5.dcs.tickup.net>","--tls-san","10.101.0.15","--snapshotter","overlayfs","--node-name","<http://v1113-dcs-master-3.dcs.tickup.net|v1113-dcs-master-3.dcs.tickup.net>"]'
      <http://rke2.io/node-args|rke2.io/node-args>: '["server","--server","<https://v1111-dcs-master-1.dcs.tickup.net:9345>","--token","********","--data-dir","/var/lib/rancher/rke2","--cni","canal","--tls-san","cluster.local","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","<http://k8s-api.dcs.tickup.net|k8s-api.dcs.tickup.net>","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","10.101.0.11","--tls-san","<http://v1112-dcs-master-2.dcs.tickup.net|v1112-dcs-master-2.dcs.tickup.net>","--tls-san","10.101.0.12","--tls-san","<http://v1113-dcs-master-3.dcs.tickup.net|v1113-dcs-master-3.dcs.tickup.net>","--tls-san","10.101.0.13","--tls-san","<http://v1114-dcs-master-4.dcs.tickup.net|v1114-dcs-master-4.dcs.tickup.net>","--tls-san","10.101.0.14","--tls-san","<http://v1115-dcs-master-5.dcs.tickup.net|v1115-dcs-master-5.dcs.tickup.net>","--tls-san","10.101.0.15","--snapshotter","overlayfs","--node-name","<http://v1114-dcs-master-4.dcs.tickup.net|v1114-dcs-master-4.dcs.tickup.net>"]'
      <http://rke2.io/node-args|rke2.io/node-args>: '["server","--server","<https://v1111-dcs-master-1.dcs.tickup.net:9345>","--token","********","--data-dir","/var/lib/rancher/rke2","--cni","canal","--tls-san","cluster.local","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","<http://k8s-api.dcs.tickup.net|k8s-api.dcs.tickup.net>","--tls-san","<http://v1111-dcs-master-1.dcs.tickup.net|v1111-dcs-master-1.dcs.tickup.net>","--tls-san","10.101.0.11","--tls-san","<http://v1112-dcs-master-2.dcs.tickup.net|v1112-dcs-master-2.dcs.tickup.net>","--tls-san","10.101.0.12","--tls-san","<http://v1113-dcs-master-3.dcs.tickup.net|v1113-dcs-master-3.dcs.tickup.net>","--tls-san","10.101.0.13","--tls-san","<http://v1114-dcs-master-4.dcs.tickup.net|v1114-dcs-master-4.dcs.tickup.net>","--tls-san","10.101.0.14","--tls-san","<http://v1115-dcs-master-5.dcs.tickup.net|v1115-dcs-master-5.dcs.tickup.net>","--tls-san","10.101.0.15","--snapshotter","overlayfs","--node-name","<http://v1115-dcs-master-5.dcs.tickup.net|v1115-dcs-master-5.dcs.tickup.net>"]'
      <http://rke2.io/node-args|rke2.io/node-args>: '["agent","--server","<https://v1111-dcs-master-1.dcs.tickup.net:9345>","--token","********","--data-dir","/var/lib/rancher/rke2","--snapshotter","overlayfs","--node-name","<http://v1121-dcs-worker-1.dcs.tickup.net|v1121-dcs-worker-1.dcs.tickup.net>"]'
      <http://rke2.io/node-args|rke2.io/node-args>: '["agent","--server","<https://v1111-dcs-master-1.dcs.tickup.net:9345>","--token","********","--data-dir","/var/lib/rancher/rke2","--snapshotter","overlayfs","--node-name","<http://v1122-dcs-worker-2.dcs.tickup.net|v1122-dcs-worker-2.dcs.tickup.net>"]'
      <http://rke2.io/node-args|rke2.io/node-args>: '["agent","--server","<https://v1111-dcs-master-1.dcs.tickup.net:9345>","--token","********","--data-dir","/var/lib/rancher/rke2","--snapshotter","overlayfs","--node-name","<http://v1123-dcs-worker-3.dcs.tickup.net|v1123-dcs-worker-3.dcs.tickup.net>"]'
      <http://rke2.io/node-args|rke2.io/node-args>: '["agent","--server","<https://v1111-dcs-master-1.dcs.tickup.net:9345>","--token","********","--data-dir","/var/lib/rancher/rke2","--snapshotter","overlayfs","--node-name","<http://v1124-dcs-worker-4.dcs.tickup.net|v1124-dcs-worker-4.dcs.tickup.net>"]'
      <http://rke2.io/node-args|rke2.io/node-args>: '["agent","--server","<https://v1111-dcs-master-1.dcs.tickup.net:9345>","--token","********","--data-dir","/var/lib/rancher/rke2","--snapshotter","overlayfs","--node-name","<http://v1125-dcs-worker-5.dcs.tickup.net|v1125-dcs-worker-5.dcs.tickup.net>"]'
c

creamy-pencil-82913

05/11/2023, 6:50 PM
I don’t see anything in there that would influence the apiserver’s behavior with regards to how it connects to the webhook service
d

dry-dawn-97788

05/11/2023, 6:51 PM
OK, thanks!
c

creamy-pencil-82913

05/11/2023, 6:51 PM
there are a lot of unnecessary args in there, things set to their defaults and such, but other than that looks ok
are you by any chance using a proxy? do you have HTTP_PROXY/HTTPS_PROXY env vars set?
d

dry-dawn-97788

05/11/2023, 6:52 PM
Yeah, I got carried away with the san stuff. I should clean that up.
No, there are no proxies involved.
c

creamy-pencil-82913

05/11/2023, 6:53 PM
you might try setting
--egress-selector-mode=cluster
on the servers, just to see if that helps?
set that on the servers, then restart the services on the servers and agents
d

dry-dawn-97788

05/11/2023, 6:54 PM
OK, I'll try that!
c

creamy-pencil-82913

05/11/2023, 6:54 PM
you’re on 1.25.9 on all of the servers and agents?
d

dry-dawn-97788

05/11/2023, 6:55 PM
Yes, I just rebuilt the cluster from scratch (deploying new VMs) and eveything is installed using Ansible, so it should be equal on all nodes.
Untitled
The
--egress-selector-mode=cluster
did not seem to help.
c

creamy-pencil-82913

05/11/2023, 7:16 PM
hmm. if you add --debug to the server flags, do you get anything useful in the logs?
d

dry-dawn-97788

05/11/2023, 7:41 PM
I don't find anything special. It's late now - I'm giving up for today. Thanks for all your help!