This message was deleted.
# rke2
a
This message was deleted.
w
Something is blocking the provisioning. I've tried removing the node pools and making new ones-- didn't help.
I'm not even sure where the logs are for seeing what is blocked
i have a MachineDeployments that gets created, state is
Active
but no machinesets.
thankfully i have backups of my rancher db too πŸ˜„
do snapshots get deleted from s3 after they are restored?!
I see the snapshot in rancher UI, but not in my bucket?
says
--cluster-reset-restore-path=
why is it trying to look on file system AND s3 and failing when it finds it in S3 but not in file system, then it removed it from s3? and now it can't restore because it's only on file system?
m
IIRC, it defaults to S3 even if you have local. There was an S3 disable arg to get it to use local instead of S3.
Note that you still need the agent token to be present, so if you didn't back that up before deleting all the nodes you might be screwed
But if you have the agent token and a local backup you should be able to make it work by bootstrapping it and then adding the other CP and agent nodes. Not a bad idea to restore the other CP nodes from the same backup before rejoining them
w
Docs seem to say you don't need token if it's an existing node hmm
Good thing I have backups of all 3 nodes+rancher πŸ˜‚
I'm using vSphere so I've been snapshotting them and reverting in like 10 secs when it fails haha
πŸ§™ 1
πŸ‘ 1
Thanks Scott
m
The token gets written to the RKE2 systemd service and in /var/lib/rancher/rke2/agent/. If RKE2 is uninstalled and those get deleted then you need to supply the token again.
w
yeah makes sense, i should be good because i'm not removing anything, and have the nodes
πŸ‘ 1
is the token the same for all nodes?
do i still do cluster reset?
m
Token is same for all nodes
w
so the 2 104 day are the ones i started with
i had 3, but rancher deleted for me πŸ˜„
the 16hr running one i guess got replicated lol
with the error ones, i think RKE2 gets uninstalled when taht happens.
m
Pass
--etcd-s3=false
for it to use the local data instead of s3
w
ohh thank you!
so what i am trying to do: β€’ Restore rancher to a known good working state prior to the state-- my cluster was stuck reconciling waiting for a control plane, worker, etcd, however it wouldn't make any lol β€’ once rancher is operational for the cluster again, Then restore the etcd backup from a future time hah
m
Then
--cluster-reset-restore-path=
is the full path to your local backup
w
yeah it was strange without the s3 flag, it looked at S3, complained key didn't exist. then i corrected it to match s3 path, it found it, then it complained the s3 path didn't exist in snapshots directory
m
Probably in /var/lib/rancher/rke2/server/db/snapshots
w
yeah that is correct path
so if i only have 1 node do i still need to do a reset?
or just a restore
m
Good question. It's been a minute since I've done it
If you're down to one node, then I believe the answer is yes - https://docs.rke2.io/backup_restore#cluster-reset
w
Ok
Ty
m
Remove the server/db directory on the other CP nodes and agents before starting the RKE2 service
Except add the
--etcd-s3=false
option when you restore on the first node
w
thank you Scott
m
Good luck!
w
hehe yeah
i mean it's just my home lab but i really liked that pet
i had just gotten everything working and solved my metalb/arp stuff
I have messed up so many clusters in rancher w/ failed reconciles. First time i broke a cluster when restoring a snapshot
m
Oh man, yeah, I hope you can save it 😬
w
heh, yeah I need to find some good docs on setting up cilium w/o kubeproxy
Cp? You mean etcd?
m
CP = control plane, assuming you are running etcd on the control plane nodes.
w
ahh no
each in their own pools
m
Ah, then the etcd nodes
w
πŸ˜‰ yeah thats what i figured
m
Any luck?
w
heh i restored a backup of my rancher node because it started returning 404. I restored from 2 days ago.
Copy code
root@rancher-02:/var/log# kubectl get pods --all-namespaces
NAMESPACE     NAME                                      READY   STATUS      RESTARTS   AGE
kube-system   helm-install-traefik-crd-kxrsn            0/1     Completed   0          76m
kube-system   helm-install-traefik-cfqmc                0/1     Completed   2          76m
kube-system   local-path-provisioner-687d6d7765-7qx8p   1/1     Running     0          76m
kube-system   svclb-traefik-0abeebb0-bnk28              2/2     Running     0          73m
kube-system   coredns-7b5bbc6644-bc8dm                  1/1     Running     0          76m
kube-system   traefik-64b96ccbcd-v728v                  1/1     Running     0          73m
kube-system   metrics-server-667586758d-ht9nn           1/1     Running     0          76m
my rancher is gone?
lol so i use kind, and postgres backs it. so the only thing i can think of is my db is borked? from a 2 day ago backup, where it was known good? lol
i wonder if it's because i forgot to shutdown rancher vm before restoring DB
hmm strange. rancher wrote 400mb to it
HAHAHA
i have 2 postgres pods on that host. they "switched" ports somehow and one has an empty kine db.
Copy code
{"level":"info","ts":"2023-10-02T22:03:10.874835Z","caller":"rafthttp/transport.go:355","msg":"removed remote peer","local-member-id":"82b4997c08526da6","removed-remote-peer-id":"53b8bb1243338979"}
panic: removed all voters

goroutine 229 [running]:
<http://go.etcd.io/etcd/raft/v3.(*raft).applyConfChange(0x0|go.etcd.io/etcd/raft/v3.(*raft).applyConfChange(0x0>?, {0x0, {0xc003202d10, 0x1, 0x1}, {0x0, 0x0, 0x0}})
        /go/pkg/mod/github.com/k3s-io/etcd/raft/v3@v3.5.9-k3s1/raft.go:1633 +0x1d4
<http://go.etcd.io/etcd/raft/v3.(*node).run(0xc0008bd7a0)|go.etcd.io/etcd/raft/v3.(*node).run(0xc0008bd7a0)>
        /go/pkg/mod/github.com/k3s-io/etcd/raft/v3@v3.5.9-k3s1/node.go:360 +0xaf7
created by <http://go.etcd.io/etcd/raft/v3.RestartNode|go.etcd.io/etcd/raft/v3.RestartNode>
        /go/pkg/mod/github.com/k3s-io/etcd/raft/v3@v3.5.9-k3s1/node.go:244 +0x24a
ok removing etcd/* worked
odd, it's trying to talk to other etcd nodes
Copy code
root@production-home-etcd-646e4b00-qvm24:/var/lib/rancher/rke2/server/db# rke2 server   --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-production-home-etcd-646e4b00-qvm24-1696104001 --etcd-s3=false
WARN[0000] not running in CIS mode
INFO[0000] Applying Pod Security Admission Configuration
INFO[0000] Static pod cleanup in progress
INFO[0000] Logging temporary containerd to /var/lib/rancher/rke2/agent/containerd/containerd.log
INFO[0000] Running temporary containerd /var/lib/rancher/rke2/bin/containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd
INFO[0010] Static pod cleanup completed successfully
WARN[0010] remove /var/lib/rancher/rke2/agent/etc/rke2-agent-load-balancer.json: no such file or directory
WARN[0010] remove /var/lib/rancher/rke2/agent/etc/rke2-api-server-agent-load-balancer.json: no such file or directory
INFO[0010] Starting rke2 v1.26.8+rke2r1 (6fc8479d8b95283b1422ad77cb3da6c9132374d2)
FATA[0016] starting kubernetes: preparing server: failed to get CA certs: Get "<https://172.16.1.167:9345/cacerts>": dial tcp 172.16.1.167:9345: connect: no route to host
that ip is an old etcd
deleted them from rancher
i'll try this later. strange
m
Were you trying to manage the same cluster with both Rancher and kind or were you running kind somewhere else with a postgres database running in RKE2?
Yeah, I mentioned clearing /var/lib/rancher/rke2/server/db before joining the other nodes. The old pod data might still be there until the dust settles. Were you using local hostpath for storage or something else? Longhorn, Ceph, etc?
If you joined the other etcd nodes before deleting the stale data, then the stale data might win over the backup if it hits quorum
w
The node it was trying to reach is dead. Data is vSphere csi
It was looking for a node that existed at the time of the backup.
rke2 server --cluster-reset causes a panic lol
had to clear out
etcd
directory
πŸ‘ 1
m
I recall something about this where the etcd member controller runs on the etcd leader so if the current etcd leader goes down, deleting a node wouldn't actually remove it from etcd and it impacted clusters with separate etcd and CP nodes
There was an issue for it in github recently. I think the stale members can be manually removed with etcdctl
w
Hmm so if I add a go to etcd will it rebuild the machine or just deploy it on the existing?
Heh will find out just made the change
Haha it deployed a new node
hmm
Copy code
<http://cattle.io/cn-2600_1700_1ce0_c4bf__e0-e9acbc:2600:1700:1ce0:c4bf::e0|cattle.io/cn-2600_1700_1ce0_c4bf__e0-e9acbc:2600:1700:1ce0:c4bf::e0> <http://listener.cattle.io/cn-__1-f16284:::1|listener.cattle.io/cn-__1-f16284:::1> <http://listener.cattle.io/cn-kubernetes:kubernetes|listener.cattle.io/cn-kubernetes:kubernetes> <http://listener.cattle.io/cn-kubernetes.default:kubernetes.default|listener.cattle.io/cn-kubernetes.default:kubernetes.default> <http://listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc|listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc> <http://listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local|listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local> <http://listener.cattle.io/cn-localhost:localhost|listener.cattle.io/cn-localhost:localhost> <http://listener.cattle.io/cn-production-home-cp-1c6758a3-bkl59:production-home-cp-1c6758a3-bkl59|listener.cattle.io/cn-production-home-cp-1c6758a3-bkl59:production-home-cp-1c6758a3-bkl59> <http://listener.cattle.io/cn-production-home-cp-1c6758a3-ftmr6:production-home-cp-1c6758a3-ftmr6|listener.cattle.io/cn-production-home-cp-1c6758a3-ftmr6:production-home-cp-1c6758a3-ftmr6> <http://listener.cattle.io/cn-production-home-cp-1c6758a3-wbptl:production-home-cp-1c6758a3-wbptl|listener.cattle.io/cn-production-home-cp-1c6758a3-wbptl:production-home-cp-1c6758a3-wbptl> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-2bqk9:production-home-cp-47e6a5f9-2bqk9|listener.cattle.io/cn-production-home-cp-47e6a5f9-2bqk9:production-home-cp-47e6a5f9-2bqk9> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-2m44h:production-home-cp-47e6a5f9-2m44h|listener.cattle.io/cn-production-home-cp-47e6a5f9-2m44h:production-home-cp-47e6a5f9-2m44h> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-7tg6c:production-home-cp-47e6a5f9-7tg6c|listener.cattle.io/cn-production-home-cp-47e6a5f9-7tg6c:production-home-cp-47e6a5f9-7tg6c> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-97vst:production-home-cp-47e6a5f9-97vst|listener.cattle.io/cn-production-home-cp-47e6a5f9-97vst:production-home-cp-47e6a5f9-97vst> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-9mt25:production-home-cp-47e6a5f9-9mt25|listener.cattle.io/cn-production-home-cp-47e6a5f9-9mt25:production-home-cp-47e6a5f9-9mt25> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-9tgs6:production-home-cp-47e6a5f9-9tgs6|listener.cattle.io/cn-production-home-cp-47e6a5f9-9tgs6:production-home-cp-47e6a5f9-9tgs6> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-brq2h:production-home-cp-47e6a5f9-brq2h|listener.cattle.io/cn-production-home-cp-47e6a5f9-brq2h:production-home-cp-47e6a5f9-brq2h> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-hp4m9:production-home-cp-47e6a5f9-hp4m9|listener.cattle.io/cn-production-home-cp-47e6a5f9-hp4m9:production-home-cp-47e6a5f9-hp4m9> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-snvp4:production-home-cp-47e6a5f9-snvp4|listener.cattle.io/cn-production-home-cp-47e6a5f9-snvp4:production-home-cp-47e6a5f9-snvp4> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-vjpfl:production-home-cp-47e6a5f9-vjpfl|listener.cattle.io/cn-production-home-cp-47e6a5f9-vjpfl:production-home-cp-47e6a5f9-vjpfl> <http://listener.cattle.io/cn-production-home-cp-47e6a5f9-wbbwv:production-home-cp-47e6a5f9-wbbwv|listener.cattle.io/cn-production-home-cp-47e6a5f9-wbbwv:production-home-cp-47e6a5f9-wbbwv> <http://listener.cattle.io/cn-production-home-cp-b91dac01-2ckzj:production-home-cp-b91dac01-2ckzj|listener.cattle.io/cn-production-home-cp-b91dac01-2ckzj:production-home-cp-b91dac01-2ckzj> <http://listener.cattle.io/cn-production-home-cp-b91dac01-9b55j:production-home-cp-b91dac01-9b55j|listener.cattle.io/cn-production-home-cp-b91dac01-9b55j:production-home-cp-b91dac01-9b55j> <http://listener.cattle.io/cn-production-home-cp-b91dac01-g4kb9:production-home-cp-b91dac01-g4kb9|listener.cattle.io/cn-production-home-cp-b91dac01-g4kb9:production-home-cp-b91dac01-g4kb9> <http://listener.cattle.io/cn-production-home-cp-b91dac01-xrgrg:production-home-cp-b91dac01-xrgrg|listener.cattle.io/cn-production-home-cp-b91dac01-xrgrg:production-home-cp-b91dac01-xrgrg> <http://listener.cattle.io/cn-production-home-cp-bd9306a2-pw7h4:production-home-cp-bd9306a2-pw7h4|listener.cattle.io/cn-production-home-cp-bd9306a2-pw7h4:production-home-cp-bd9306a2-pw7h4> <http://listener.cattle.io/cn-production-home-cp-bd9306a2-vcdwb:production-home-cp-bd9306a2-vcdwb|listener.cattle.io/cn-production-home-cp-bd9306a2-vcdwb:production-home-cp-bd9306a2-vcdwb> <http://listener.cattle.io/cn-production-home-cp-bd9306a2-wvfgr:production-home-cp-bd9306a2-wvfgr|listener.cattle.io/cn-production-home-cp-bd9306a2-wvfgr:production-home-cp-bd9306a2-wvfgr> <http://listener.cattle.io/cn-production-home-cp-de9b0e1e-7r9t5:production-home-cp-de9b0e1e-7r9t5|listener.cattle.io/cn-production-home-cp-de9b0e1e-7r9t5:production-home-cp-de9b0e1e-7r9t5> <http://listener.cattle.io/cn-production-home-cp-de9b0e1e-8fwxt:production-home-cp-de9b0e1e-8fwxt|listener.cattle.io/cn-production-home-cp-de9b0e1e-8fwxt:production-home-cp-de9b0e1e-8fwxt> <http://listener.cattle.io/cn-production-home-etcd-646e4b00-dd7sd:production-home-etcd-646e4b00-dd7sd|listener.cattle.io/cn-production-home-etcd-646e4b00-dd7sd:production-home-etcd-646e4b00-dd7sd> <http://listener.cattle.io/cn-production-home-etcd-646e4b00-k5rhp:production-home-etcd-646e4b00-k5rhp|listener.cattle.io/cn-production-home-etcd-646e4b00-k5rhp:production-home-etcd-646e4b00-k5rhp> <http://listener.cattle.io/cn-production-home-etcd-646e4b00-p5p4n:production-home-etcd-646e4b00-p5p4n|listener.cattle.io/cn-production-home-etcd-646e4b00-p5p4n:production-home-etcd-646e4b00-p5p4n> <http://listener.cattle.io/cn-production-home-etcd-646e4b00-qvm24:production-home-etcd-646e4b00-qvm24|listener.cattle.io/cn-production-home-etcd-646e4b00-qvm24:production-home-etcd-646e4b00-qvm24> <http://listener.cattle.io/cn-production-home-etcd-646e4b00-wtmt2:production-home-etcd-646e4b00-wtmt2|listener.cattle.io/cn-production-home-etcd-646e4b00-wtmt2:production-home-etcd-646e4b00-wtmt2> <http://listener.cattle.io/cn-production-home-etcd-646e4b00-zk2sb:production-home-etcd-646e4b00-zk2sb|listener.cattle.io/cn-production-home-etcd-646e4b00-zk2sb:production-home-etcd-646e4b00-zk2sb> <http://listener.cattle.io/fingerprint:SHA1=D66CAF4BE3B4AA14BB9890D3FCF178F745D7020A]|listener.cattle.io/fingerprint:SHA1=D66CAF4BE3B4AA14BB9890D3FCF178F745D7020A]>"
Those nodes are the old ones, and there are no CP nodes.
heh
so i removed all the etcd nodes.
rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required.
I restored etcd only from snapshot. Error applying plan -- check rancher-system-agent.service logs on node for more information
haha
ok. strange. very strange I am back where i started here with being unable to deploy anything. It's waiting for stuff to get registred but won't actually spawn anything. No machine sets or anything
heh you require etcd to restore the cluster configuration, which lives within rancher?
m
Re "rkecontrolplane was already initialized but no etcd machines exist that have plans, indicating the etcd plane has been entirely replaced. Restoration from etcd snapshot is required." - This is when you would use
--cluster-reset
with the snapshot restore.
Ah, this is a downstream cluster you provisioned with Rancher and not a standalone RKE2 cluster. In the future, I would have first attempted the etcd restore from Cluster Management in Rancher. I think you can still save it though if you get etcd restored and then the rancher-system-agent running on the nodes.
w
restore in rancher is how i got into this mess πŸ™‚
hmm okl
m
Yeah, I understand πŸ™‚
I've had a Rancher restore go sideways because of an s3 problem, which is how I discovered the --etd-s3=false flag
w
lol
does it not download the file first then make changes?
i can see it now: 1. reset the cluster 2. download the backup, but it fails 3. ???
m
reset cluster and restore happen together
w
but the file needs to be downloaded first?
m
The local backups done through the Rancher UI should already be there
w
sure
m
In /var/lib/rancher/rke2/server/db/snapshots, iirc
w
i gotcha, just saying it's weird S3 can make a restore break a cluster.
if the file was downloaded first, before changes were made, it shouldn't be an issue
🀷🏼
I wonder if thats what happened to me
m
If you don't specify --etcd-s3=false and s3 download fails for some reason (bad cert, s3 is down), then what will happen is etcd will end up defaulting to empty and will come up with no pods, etc.
w
wow that is insane
why wouldn't the first step be to stage the file!?
"if file download fails, bail out and make no changes"
m
No idea - I don't grok why it defaults to etcd if the local backup already exists.
w
either way
why mutate state of the environment if the backup download fails or the backup fails to unzip or whatever
m
How it is and why it is are two different questions. I can give a crack at how it is from my experience. The why someone else will need to answer, lol.
w
heh
i'm just sayin.. doesn't seem safe
I've had every cluster I've ever made using rancher fail because it allowed me to break etcd.
I'm thinking rancher isn't the best way to manager a cluster unfortunately.
m
Best very much depends on your use case. When I discovered this was when my team was intentionally trying to break and restore a cluster we spun up strictly for that purpose. I've never had to do this in production or longterm Rancher clusters 🀞
πŸ˜… 1
But that said, you'll want to cluster-reset on one etcd node with --etc-s3=false, pointing to your local snapshot. Then join the other etcd nodes after clearing /var/lib/rancher/rke2/server/db/etcd. Then join the control plane nodes. You should also start/restart the rancher-system-agent on those nodes as you go as well
w
heh i have no nodes now lol
m
Yeah, this is why you're bootstrapping and cluster-resetting. --cluster-reset = ignore any etcd node history and just make a new cluster with this one node
w
right but can't run that if no nodes
πŸ˜‰
m
So, I'm hoping that the nodes will make it back on their own, but you can also do this manually with RKE2 and import the cluster into Rancher after the fact.
w
i reset, had a working etcd single node. It was still looking for old nodes, it didn't clean those up in the rke2 service logs. I then tried to restore a backup of etd only, and it broke the etcd node. I tried to create a new one, here i am
Don't use lose functionality because yyou are importing a cluster?
m
w
thanks Scott
m
"The ability to see a read-only version of the cluster's configuration arguments and environment variables used to launch each node in the cluster" - I think this is the biggest difference is you can't edit the cluster.yaml directly in the web view
Very much at your own risk, it's still technically possible:
kubectl edit <http://cluster.management.cattle.io|cluster.management.cattle.io> -n fleet-local
or
kubectl edit <http://cluster.management.cattle.io|cluster.management.cattle.io> -n fleet-default
I don't know the ramifications of touching it, though πŸ™‚
w
i have a backup lol
m
Lol
w
heh what should i change
m
Probably nothing πŸ™‚ .
At this point you just need to get your data back
w
yeah
it is strange how with 0 nodes rancher will not make a new one.
m
You have zero machines to work with, so I'm assuming you are not using Elemental for provisioning them.
w
no.
What is elemental?
I use vsphere in rancher
m
Elemental is an immutable operating system from SUSE where you define the OS with cloud-data, it generates an iso that you then boot the machine with and Rancher can manage the OS itself from there. So when you delete an Elemental managed machine, Elemental can be setup to reinstall the OS and reboot it so that it can be ready to be provisioned back to a cluster
w
i'll read about it
m
cloud-data is like cloud-init user-data. It doesn't mean it has to run on cloud. I'm deploying bare metal RKE2 nodes with it now.
w
I'm familiar.
how does that differ from using ubuntu images that I don't configure and letting rancher deploy rke2?
m
To be clear, I don't think it fixes your current situation at this very moment.
w
oh yea i get it πŸ˜„
or are you saying it's kinda like using a container image and storing state outside of containers in volumes?
m
They're immutable and Rancher can handle the updates. It's sort of like Red Hat CoreOS or Fedora CoreOS, but the under the hood architecture it significantly different.
w
and node identity is preserved
m
It's not even "like" - the OS is literally a container image.
w
"we can just swap out the chassis"
hmm where have i seen that before.. πŸ€”
m
TalosLinux?
w
rancher OS πŸ˜„
m
lol
w
I got burned by RancherOS, k3Os
m
I've never used those. I have used RHCOS and TalosLinux before, though
w
"welp folks, thanks, been fun, but this is now EOL"
m
Elemental is sort of a mash up of RHCOS and Talos.
w
i also wanna be clear-- I can appreciate how difficult orchestration and scheduling is.
I work at a public cloud company πŸ™‚
m
Like the deployment and architecture is similar to Talos, but it's still SUSE based and therefore resembles RHCOS in the sense that you can ssh to it and it still feels like a Linux environment until you try to make any changes to it, that is πŸ˜…
Nice!
w
and i don't want folks to think i'm trashing the team here either.
This stuff is extremely complex and there are SO MANY edge cases
m
I work for a very large public organization and I'm using Rancher now in no small part because public cloud costs scale with the usage πŸ˜…
You might appreciate this - We might be one of the last shops with Eucalyptus still running as well.
w
hehe
I haven't used it, but i've heard of it
m
Yeah - 10 years ago, Eucalyptus and OpenStack were to two major private cloud platforms. Eucalyptus got bought by HP and more or less died on the vine. Rackspace got bought by a VC and... technically still exists, but it's not the same company it once was.
Anyway, Euca is/was basically an on-prem, private AWS built on top of libvirt and ceph.
I have a couple of k3s instances running in Eucalyptus right now, hehe
w
hehe ok i restored the backup of rancher
it keeps spawning new machines because it can't find any. up to 300
m
I assume s/machines/pods - so that's good!
w
no
m
New VMs?
w
yeah so apparently if rancher can't find any nodes, it'll just keep making them
m
Makes sense - at this point, I would let Rancher trying to make it work from here
Ah, yeah, you're probably using the VMWare provisioner
w
huh
yeah
but still, if the max in a node pool is N, why would it try to do 300
m
It looks like Rancher has enough to work with to rebuild things.
And it looks like 256 etcd machines?
Also, congratulations on having a home cluster that can provision 300 VMs on the fly
w
You mean moneypit
Yeah, buying those old Dell servers are great. The servers are older than my kids
So like where are the docs to just do rke2 without rancher?
m
That one etcd server says 106 days
So I think the restore worked. I don't understand why it provisioned 200-something etcd nodes
That doesn't seem right
w
Oh I restored a backup of rancher db
Right lol
m
And you had 256 etcd nodes?
No wonder you broke etcd!
w
No 3
m
Ah, then the 256 etcd nodes remains a mystery
w
Yeah
Rancher did it
m
but if it can get down to 3, Rancher should have enough to go on to rebuild things from there
How big is the machinepool for etcd?
If you go to edit cluster, can you verify how many etcd nodes are in the pool?
w
I will when I get home. I went to go pick up the kids
How do you manage your r ke2 deployments? CI? Is there tooling for it?
Kustomize?
m
Deploying things on RKE2 or deploying RKE2 itself?
Deploying things on RKE2, mainly with helm/git with Jenkins for the CI/CD, leveraging different clusters for dev/staging/prod environments. For some basic stuff, like automating cert-manager, I've started using fleet from Rancher. For deploying RKE2 itself, mainly doing Elemental on bare metal. I define all the OS specific stuff in a registration endpoint and provision the baremetal nodes with them via BMC and use Rancher/Fleet/Elemental to manage it from there. https://github.com/rancher/elemental-operator#inventory-management
For k3s stuff, I deploy OpenSUSE Leap Transactional Server since it pretty well includes everything k3s needs right out of box and then use the https://github.com/rancher/system-upgrade-controller to automate the OS (based on the SLE Micro example) and k3s upgrades from there.
I've been running k3s much longer than Rancher itself and a lot of how something was deployed/managed came down to its particular use case.
Copy code
# kubectl get node
NAME                   STATUS   ROLES                  AGE    VERSION
<Redacted Hostname>    Ready    control-plane,master   656d   v1.27.6+k3s1
^^ This cluster is running on CentOS Stream 9.
At my previous two employers, I administered OpenShift and briefly did here as well, before replacing it with Rancher. It was interesting coming to k3s from OpenShift, since everything about k3s is so much simpler. Also do a whole lot of GKE at the moment, as well.
w
Cool
m
The GKE stuff is all currently deployed through Terraform, but we're moving away from it. Jenkins+Ansible AWX is where most of the GitOps is happening now.
w
Why use k3s over k8s?
m
Like vanilla k8s or another distribution?
I initially picked k3s for something where I needed to deploy something for infrastructure support that I wanted to live internally and needed to run on Kubernetes but didn't have a cluster handy yet. I also liked that k3s ran natively with a BYO Linux distro and didn't depend on bringing up a bunch of VMs. And since it was initially a prototype, it was nice to be able to start simple with one node and then scale it out later if we needed to. k3s, like OpenShift, is also opinionated out of box, but in sort of the exact opposite direction of OpenShift which throws every bell and whistle at you. k3s was a quick and easy way to get from scratch to
kubectl
on a fresh server.
w
in general k8s
like what was in the "5" that were removed
m
There are other tiny distributions now (k0s from Mirantis, for example), but at the time most other small, quick and dirty, kubernetes deals ran VMs and/or were more geared for a developer's laptop than a server environment.
AFAIK, the big thing k3s removes from k8s out of box is a bunch of cloud-provider specific drivers, that can be added back in. k8s upstream includes all the Google, Azure, Amazon, etc. stuff regardless of where you deploy it. k3s defaults to local storage and if you want more than that, deploy a CSI. But k3s is also opinionated in that it gives you a CNI (Flannel) and a default storageClass (host path) out of box, where you have to bring those yourself with default k8s.
w
ok
i'm gonna give rancher one more try
and i'm never touching the etcd node pool again.
m
lol
w
but for real though, you should not be able to accidently delete all etcd nodes in a cluster.
m
out of curiosity, why not just run etcd on the control plane nodes?
That's how most k8s setups are.
w
My CP nodes tend to run hot
i basically had no workloads running but thet ate up 8gb of ram fast and high cpu
m
Aren't they all VMs?
w
yea
m
Different ESXi hosts?
w
but so if they are eating ram and CPU, then etcd suffers
not all of them
but yea
i keep them on the smaller side so they can be migrated across the esx cluster
i run vsan at home πŸ˜„
m
Generally control plane nodes shouldn't be taking up that many resources.
Running the API and etcd is most all of what they should be doing if you have worker nodes.
w
i guess i can experiment
but also, if i had CP + etcd, i'd had been able to add nodes easier i imagine?
m
What are you doing for storage?
Is it a NAS?
w
no
vSan
m
Is there local affinity to the VM for the storage with vSan?
I ask because I recently had some interesting latency issues with ceph rbd backed disks for etcd under load.
w
oh
yeah i'm not seeing latency issues
m
cpu load and disk IO look the same at first glance when you're looking at top, etc. Unless you're scheduling a bunch of stuff on your control plane nodes and/or downloading a bunch of huge images on them for some reason, they shouldn't be using much in terms of resources.
w
tldr "yes"
nope not at all.
m
Yeah, at my last employer, we used VMWare with PureStorage and for the most part it was well-behaved, but taking snapshots with memory could definitely impact things.
w
oh vsan is insanely amazing
m
We're using Longhorn with RKE2 here with data affinity set and that's worked out well. It's pretty cool to have something like that on bare-metal where the pod storage is local to the host. Most everywhere else, it's a ceph shop, for better or worse. Ceph can handle a whole lot more data, but with a lot more complexity, latency, overhead, etc. For Rancher stuff, mainly using Ceph RGW for S3 backups of etcd, rancher, longhorn, etc.
w
ceph is very hard
we run very large ceph deployments
like "support millions of VMs" large
thankfully not my department
m
I'm guessing you're an OpenStack shop?
w
nope
m
FWIW, I just deleted ~3,000 GCP disks this past week with one of my SLA cleanup scripts. I'm glad I didn't try deleting that much from a ceph cluster all at once and possibly triggering a rebalance in the process.
w
i think our deletes are async
I know they are for object storage buckets
m
Yeah, I don't think RGW gets hit quite as hard as RBD does when you do that
RGW = S3 object store RBD = block storage (like EBS)
w
yep
m
LINSTOR (aka piraeus) is another one I'd like to check out. It does local storage but uses DRBD under the hood to handle the replication where Longhorn uses iscsi.
We had a support contract with LINBIT at my last employer and they were great to work with.
We also used Portworx by Pure and it worked... but it definitely had some pain points. Pure's model is very much about paying a bunch of money to make a problem go away. The Pure stuff played well with VMWare, though, and it was ridiculously fast.
w
Heh I gave up.
Even with a single node and seeing no other nodes in etcd, it showed running in rancher, but I couldn't join new nodes. I could add new etcd nodes and they'd join and replicate the data across the cluster however they wouldn't show running in rancher. I couldn't spin up control plane nodes ejther
m
Did you start the rancher-system-agent on them?
w
I just made a new cluster πŸ˜‚