This message was deleted.
# rke2
a
This message was deleted.
p
check on the manager side
w
what is the manager?
p
rancher manager
w
oh btw, i'm recovering from a failure of infrastructure. I have a backup of rancher + etcd. I probably have to put etcd in single mode?
p
etcd should recover by itself i think, but do you have any ip change?
w
no i only restored one of the etcd nodes
p
whats the rke2 logs? waiting to apply plan?
w
yep
p
And on the provision log of the cluster?
w
give me a few not by pc. Gonna paste shortly
ty you
ok, i restored 5 etcd instances. That was the problem. 1 node and it was looking for it's friends 😄 i have etcd+ api on one VM for this reason. so now it creates VMs and shuts them down within a minute
yeah it's trying to do old machine sets
Deleted the old ones so the new ones would get picked up
so i guess that is what it was doing before-- It was waiting for the ETCDs. I suspect it could have been resolved by my ssh'n into the single etcd node and resetting to single node cluster
👍 1
I have ninja'd my way out of soooo many DR scenarios with rancher. I could probably offer my Rancher-Spec-Ops skillz to others 😄
i owe it to a "robust" backup strategy.
gotta see if they come online now. They're pulling images
heh, ok why didn't they fail previously tho strange. but ok, I use let's encrypt, but have
privateCA:true
, so setting to false should fix it. because
/var/lib/rancher/agent/rancher2_connection_info.json
has a self-signed CA in it.
now it just sits there
from capi-controller-manager
Copy code
I0503 22:11:16.314592       1 machine_controller_noderef.go:54] "Waiting for infrastructure provider to report spec.providerID" controller="machine" controllerGroup="<http://cluster.x-k8s.io|cluster.x-k8s.io>" controllerKind="Machine" Machine="fleet-default/prod-api-only-74dfd4db77xbjjqd-x8nsp" namespace="fleet-default" name="prod-api-only-74dfd4db77xbjjqd-x8nsp" reconcileID=20c8d7d6-d476-42bd-8a32-bc2d8055d3e0 MachineSet="fleet-default/prod-api-only-74dfd4db77xbjjqd" MachineDeployment="fleet-default/prod-api-only" Cluster="fleet-default/prod" VmwarevsphereMachine="fleet-default/prod-api-only-c6a53d6d-5kfxh"
over and over from diff machines
doesn't make sense. why would it not be able to deploy new nodes. It's a literal VM backup of rancher+db + etcds
Copy code
9 delivering planSecret prod-bootstrap-template-xwblk-machine-plan with token secret fleet-default/prod-bootstrap-template-xwblk-machine-plan-token-cpfz4 to system-agent
few of those
ok, 2 of the 5 etcd nodes were unavailable for whatever reason, so i scaled those pools down.
2024/05/03 222201 [INFO] [planner] rkecluster fleet-default/prod: configuring control plane node(s) prod-api-only-74dfd4db77xbjjqd-sggw4,prod-api-only-74dfd4db77xbjjqd-vdhsm,prod-api-only-74dfd4db77xbjjqd-x8nsp
Fri, May 3 2024 62202 pm
So that's happening. Workers are waiting i guess for API servers.
I feel like there should be some UI messaging around "be patient, don't change the node pools"