I run a small metal kubernetes cluster managed by a single r Rancher Users #general

I run a small metal kubernetes cluster managed by ...

echoing-island-97495

11/20/2024, 6:41 PM

I run a small metal kubernetes cluster managed by a single rancher instance running in docker on a VM. We take nightly tarballs of /var/lib/rancher and in the past I've been able to upgrade by: • stopping and deleting the rancher container • pulling the latest stable rancher container • starting the rancher If the VM is also due for an upgrade I've been able to: • rebuild/replace the VM • pull and start the rancher container • stop the rancher container • delete and replace the /var/lib/rancher contents from the tarball • start the rancher container This time it's failing. The docker container goes into a ~30s restart loop. Each try goes pretty much:

Copy code

INFO: Running k3s server --cluster-init --cluster-reset
2024/11/20 18:25:13 [INFO] Rancher version v2.10.0 (df45e368c82d4027410fa4700371982b9236b7c8) is starting
2024/11/20 18:25:13 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2024/11/20 18:25:13 [INFO] Listening on /tmp/log.sock
2024/11/20 18:25:13 [INFO] Waiting for server to become available: Get "<https://127.0.0.1:6444/version?timeout=15m0s>": dial tcp 127.0.0.1:6444: connect: connection refused
2024/11/20 18:25:15 [INFO] Waiting for server to become available: Get "<https://127.0.0.1:6444/version?timeout=15m0s>": dial tcp 127.0.0.1:6444: connect: connection refused
2024/11/20 18:25:17 [INFO] Waiting for server to become available: Get "<https://127.0.0.1:6444/version?timeout=15m0s>": dial tcp 127.0.0.1:6444: connect: connection refused
2024/11/20 18:25:19 [INFO] Waiting for server to become available: Get "<https://127.0.0.1:6444/version?timeout=15m0s>": dial tcp 127.0.0.1:6444: connect: connection refused
2024/11/20 18:25:27 [INFO] Running in single server mode, will not peer connections
2024/11/20 18:25:30 [FATAL] Internal error occurred: failed calling webhook "rancher.cattle.io.namespaces.create-non-kubesystem": failed to call webhook: Post "<https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s>": proxy error from 127.0.0.1:6443 while dialing 10.42.0.76:9443, code 502: 502 Bad Gateway

Later attempts also give instructions for removing the reset flag file if you want to try resetting again. Online searches give plenty of references to the "failed calling webhook" error but not for "Bad Gateway." Grateful for any guidance: at this point our prod cluster is running fine and I have kubectl access to it, but no graceful way to manage nodes or users/RBAC and so on. Forgot. Previous running Rancher was 2.8.5, now 2.10.

12 Views

Open in Slack

Previous Next