Hey team, I deployed a 3-node Harvester cluster o...
# harvester
m
Hey team, I deployed a 3-node Harvester cluster on our local servers. Since yesterday, the Harvester UI has been inaccessible, and now the entire cluster seems to be down — all server nodes are showing as NotReady. I also attempted to reset the Harvester configuration using the appropriate command. I’ve collected the related logs and can share them here. Looking for help to troubleshoot and recover the cluster. kubectl get nodes -A E0903 120631.023219 25163 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.16443 connect: connection refused" E0903 120631.024828 25163 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.16443 connect: connection refused" E0903 120631.027031 25163 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.16443 connect: connection refused" E0903 120631.028734 25163 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.16443 connect: connection refused" E0903 120631.030133 25163 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://127.0.0.1:6443/api?timeout=32s\": dial tcp 127.0.0.16443 connect: connection refused" The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port? sudo /opt/rke2/bin/rke2 server --cluster-reset --config /etc/rancher/rke2/config.yaml.d/90-harvester-server.yaml WARN[0000] not running in CIS mode INFO[0000] Applying Pod Security Admission Configuration INFO[0000] Static pod cleanup in progress INFO[0000] Logging temporary containerd to /var/lib/rancher/rke2/agent/containerd/containerd.log INFO[0000] Running temporary containerd /var/lib/rancher/rke2/bin/containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd INFO[0010] Static pod cleanup completed successfully INFO[0010] Starting rke2 v1.32.4+rke2r1 (4e465c0f03edba9a2af3b3c77d09840d3f7681ef) INFO[0010] Managed etcd cluster initializing INFO[0010] Updated load balancer rke2-agent-load-balancer default server: 127.0.0.1:9345 INFO[0010] Running load balancer rke2-agent-load-balancer 127.0.0.1:6444 -> [] [default: 127.0.0.1:9345] WARN[0010] Failed to get apiserver address from etcd: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.12379 connect: connection refused" INFO[0010] Running load balancer rke2-api-server-agent-load-balancer 127.0.0.1:6443 -> [] [default: ] INFO[0011] Password verified locally for node orion1 INFO[0011] certificate CN=orion1 signed by CN=rke2-server-ca@1754633700: notBefore=2025-08-08 061500 +0000 UTC notAfter=2026-09-03 105320 +0000 UTC INFO[0011] certificate CN=systemnodeorion1,O=system:nodes signed by CN=rke2-client-ca@1754633700: notBefore=2025-08-08 061500 +0000 UTC notAfter=2026-09-03 105320 +0000 UTC INFO[0011] certificate CN=system:kube-proxy signed by CN=rke2-client-ca@1754633700: notBefore=2025-08-08 061500 +0000 UTC notAfter=2026-09-03 105321 +0000 UTC INFO[0011] certificate CN=system:rke2-controller signed by CN=rke2-client-ca@1754633700: notBefore=2025-08-08 061500 +0000 UTC notAfter=2026-09-03 105321 +0000 UTC INFO[0011] Using private registry config file at /etc/rancher/rke2/registries.yaml INFO[0011] Module overlay was already loaded INFO[0011] Module nf_conntrack was already loaded INFO[0011] Module br_netfilter was already loaded INFO[0011] Module iptable_nat was already loaded INFO[0011] Module iptable_filter was already loaded INFO[0011] Runtime image index.docker.io/rancher/rke2-runtime:v1.32.4-rke2r1 bin and charts directories already exist; skipping extract INFO[0011] Updated manifest /var/lib/rancher/rke2/server/manifests/rancher-vsphere-csi.yaml to set cluster configuration values INFO[0011] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-cilium.yaml to set cluster configuration values INFO[0011] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-flannel.yaml to set cluster configuration values INFO[0011] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml to set cluster configuration values INFO[0011] No cluster configuration value changes necessary for manifest /var/lib/rancher/rke2/server/manifests/rke2-snapshot-validation-webhook.yaml INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-traefik.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/harvester-csi-driver.yaml to set cluster configuration values INFO[0012] No cluster configuration value changes necessary for manifest /var/lib/rancher/rke2/server/manifests/rancher/rke2-etcd-snapshot-extra-metadata.yaml INFO[0012] No cluster configuration value changes necessary for manifest /var/lib/rancher/rke2/server/manifests/rancher/cluster-agent.yaml INFO[0012] No cluster configuration value changes necessary for manifest /var/lib/rancher/rke2/server/manifests/rancher/managed-chart-config.yaml INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-runtimeclasses.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/harvester-cloud-provider.yaml to set cluster configuration values INFO[0012] No cluster configuration value changes necessary for manifest /var/lib/rancher/rke2/server/manifests/rancher/addons.yaml INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-canal.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-multus.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-snapshot-controller.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-calico-crd.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-calico.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-traefik-crd.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rancher-vsphere-cpi.yaml to set cluster configuration values INFO[0012] Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-snapshot-controller-crd.yaml to set cluster configuration values INFO[0012] Logging containerd to /var/lib/rancher/rke2/agent/containerd/containerd.log INFO[0012] Running containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml INFO[0013] containerd is now running INFO[0013] Pulling images from /var/lib/rancher/rke2/agent/images/cloud-controller-manager-image.txt INFO[0013] Pulling image index.docker.io/rancher/rke2-cloud-provider:v1.32.0-rc3.0.20241220224140-68fbd1a6b543-build20250101 WARN[0015] Failed to get apiserver address from etcd: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.12379 connect: connection refused" WARN[0020] Failed to get apiserver address from etcd: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.12379 connect: connection refused" INFO[0021] Polling for API server readiness: GET /readyz failed: Get "https://127.0.0.1:6443/readyz?timeout=15s&verbose=": EOF WARN[0025] Failed to get apiserver address from etcd: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.12379 connect: connection refused" ERRO[0029] Error encountered while importing /var/lib/rancher/rke2/agent/images/cloud-controller-manager-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/cloud-controller-manager-image.txt: image "index.docker.io/rancher/rke2-cloud-provider:v1.32.0-rc3.0.20241220224140-68fbd1a6b543-build20250101": not found INFO[0029] Pulling images from /var/lib/rancher/rke2/agent/images/etcd-image.txt INFO[0029] Pulling image index.docker.io/rancher/hardened-etcd:v3.5.21-k3s1-build20250411 WARN[0030] Failed to get apiserver address from etcd: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.12379 connect: connection refused" WARN[0035] Failed to get apiserver address from etcd: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.12379 connect: connection refused" ERRO[0036] Error encountered while importing /var/lib/rancher/rke2/agent/images/etcd-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/etcd-image.txt: image "index.docker.io/rancher/hardened-etcd:v3.5.21-k3s1-build20250411": not found
h
That’s weird - usually that would mean a bad network disconnect between nodes. Is there a firewall or different VLAN port connectivity on the ports? Maybe one of your bond interface ports doesn’t have the same config ? Regardless, I’ve never seen this and have never needed a reset either. Can you ensure their time server is reachable, the network between them is correctly configured and power off and back on the entire cluster?
b
Very weird considering the home loopback port failures. Powercycle would be my go-to move too. I've seen the kernel get into really weird states before (not just in Harvester) where basic modules stop functioning. Ceph, then encryption failures, etc., and a power cycle on the box normally fixes it up.