sooo i went to upgrade my kubernetes version today...
# rke2
a
sooo i went to upgrade my kubernetes version today, and i think it just fubared my entire cluster.... any suggestions?
b
Do you have etcd backups?
a
i thought i did but they appear to be completely gone now
b
Cause I'm pretty sure you're going to need to restore a backup from at least one of the nodes. To simplify it, I'd probably scale down to 1 (since they're just VMs) get a backup - restore it, then scale up to three again.
You weren't backing up to S3 or nfs or something?
a
it looks like it deleted all the old nodes, and no i was not, thats my screw up
b
Do you have vmware backups of your old nodes?
a
No 😕
b
Well.
I don't see much else you can do.
a
we do have snapshots from the nimble of the entire datastore
but thats it
b
Unless the snapshots have the etcd backups, or disks from the VMs that use to be your nodes, then you have nothing to restore.
at least you identified some issues with your backup and DR strategies.
a
sounds like im making a new cluster then
that sucks
b
Sounds like it. If you have paid support through Suse, it might be worth reaching out to them.
a
we dont sadly
b
I'm sorry that sucks. This is the kind of event where that support is worth it's weight in gold.
a
deploying a new cluster now, definitely going to start backing up etcd externally.... hard lesson learned, luckily i have a database dump from last week of our mysql db
almost got everything back up.....
and we are back
definitely setting up s3 backup for etcd now
luckily i have a awesome ci/cd workflow which made redeploying slick
b
And you tested a worse case senario!
a
and were back! now to identify a s3 compatible location to back up etcd......
dumb question we just deployed a s3 compatible resource in our environment for endpoint if its not going to a dns name can we just put the ip and port or do we also need to do the protocol
b
let me look at how we have ours set up
We do have a dns name to our external ceph cluster, but from what it looks like it can just be
<ip>:<port>
ie:
10.10.1.150:8888
there's no
s3://
or other stuff in my configs
region is empty
a
awesome! so i just plopped in 192.168.70.66:9000 and deployed a quick test cluster to do some testing and we shall see, i just deployed a HA minio system with a couple ubuntu servers not perfect, but itll do(this is all non critical data also)
b
You could potentially do a local backup and have some sort of cron script to go grab the files and stash them somewhere. We did that for a while too. Much more hacky but if it works it works.
The bucket is probably better. 🙂
a
yea definitely going for s3
it seems it did not make a snapshot... time to debug a bit
okay think i solved it endpoint=http://192.168.70.66:9000 to communicate with s3 this is my load balancer ip for minio outside of kubernetes, seems to work when using awscli and boto3 in python, so now to test in rancher on my dev cluster
once satisfied i guess i can wrap it into s3.domain.com using nginx also to map to ip:port and save the headaches and add ssl
weird issue i can write files to s3 with python all day long but etcd backups never hit
Copy code
root@test-pool1-n2z98-4n6hn:/etc/rancher/rke2# rke2 etcd-snapshot save --config /dev/null --s3 --s3-endpoint 192.168.70.66:9000 --s3-skip-ssl-verify --s3-insecure --s3-bucket rke2-snapshots
INFO[0033] Snapshot on-demand-test-pool1-n2z98-4n6hn-1744227134 saved. 
INFO[0033] Snapshot on-demand-test-pool1-n2z98-4n6hn-1744227134 saved.
works when i test on a public bucket will add auth once i ge tthis workin. but when i use the gui it never seems to work