This message was deleted.
# general
a
This message was deleted.
m
bumping up this thread once again, any pointers on how to debug this ? we logged a ticket via rancher premium support as well but support closed the ticket saying velero based restore is not supported, but we want to understand how can we get to the root of it such that we can figure out some around because we want an vendor agnostic backup and restore solution
b
Hi, maybe raise an issue with velero. I see this old one: https://github.com/vmware-tanzu/velero/issues/3081 google velero+runtime.go+Observed a panic+"invalid memory address or nil pointer" produces additional hits...
c
It looks to me like Velero isn’t restoring some of the resource metadata, so when Rancher goes looking for related objects to track their relationships it crashes because some of the required fields are missing. We have our own backup-restore operator that’s intended to be used for backing up Rancher Manager state, and we also support etcd level backup/restore. Unless the Velero folks want to dig into it to figure out what they’re missing, I don’t think you’re going to get too much help from our side.
f
As far as I remember (former employee...) pieces of Rancher refer to/stores things by UUID, for.. some possibly good reason. And Valero doesn't/can't restore them, because they're automatically generated by the k8s API and can't be set without going down to etcd directly. So new objects get new uuids and the stored references to old ones no longer find the expected resource. Anyway, yes Rancher has its own backup and restore mechanism for a reason, use it.
👍 1
m
right now we were using rancher backup and restore mechanism itself but rancher backup and restore has serious limitations i.e. it’s a full cluster backup and restore, if we want to selectively restore a namespace from backup we can’t do that and if namespace level restoration is possible full cluster restore will become a very rare scenario ( datacenter crash) And we install our software on both cloud (aws and azure) and onprem (rke2) and today we had to deploy 2 different DR solutions i.e. Cloud (Velero) and onprem (rancher backup/restore) and we want to avoid that To overcome the rancher-server restore issue we are using this un-install script to cleanup rancher-server and do a fresh install again but the problem is rancher servers bundles a lot of cluster scoped CRD’s and the un-install process itself take more than 10 minutes plus to cleanup all the CRD’s and other resources and namespaces associated with rancher-server and we want to avoid this 10 minutes delay because DR has to be as fast as possible
btw thank you brandon, vincent and malcolm lewis for taking time and replying on this thread
@creamy-pencil-82913 can you please let us know the name of the resources that I need look for where there is a doubt that some fields aren’t getting restored by velero currently
c
it says in the error:
2022/06/20 16:15:14 [ERROR] error syncing 'git-webhook': handler apiservice: failed to update cattle-system/git-webhook-api-service /v1, Kind=ServiceAccount for apiservice git-webhook: ServiceAccount "git-webhook-api-service" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty, metadata.ownerReferences.name: Invalid value: "": name must not be empty], requeuing