This message was deleted.
# general
a
This message was deleted.
b
I wouldn't rely on etcd backup/restores especially when you don't have access to the nodes. Instead I would default to destroying and rebuilding the clusters with manifests stored externally. Like git. Etcd restores are a last resort 'brain transplant'. It's hit or miss as to whether they work because it depends on the state of the cluster at the last snapshot - and its current state. Cattle not pets is the way to go.
g
I would love to work with manifests stored in git as opposed to clicking through a web UI, but Rancher seems to prefer the UI thing. Curious that the documentation recommendations are something not to be relied upon. How would destroy/rebuild work and would this lead to downtimes?
b
Rancher actually has a tool built in that helps you follow this GitOps pattern. It's called 'fleet' and it's under the continuous delivery section of the UI. Cattle not pets is the reason our logo is a cow ;) Destroying and rebuilding would certainly not lead to any more downtime then you are seeing currently with etcd restores. As long as you have a repeatable process downtime will be minimal. You can also perform this in a rolling manning as opposed to all at once if needed. In the worse case you could even spin up a new env - fail everything over to it - destroy the old env. This is the benefit of infrastructure as code :)
g
Everything except Rancher specific stuff is currently setup as infrastructure as code. The cluster is being rebuilt and we might have to manually restore all our PVCs from backups. The thing that bothers me, is that despite the failed update the cluster was still happily working before I tried the restore. I didn't expect such a prominent UI feature to be so destructive. I'm not sure how to rollback a failed update at this point
b
You have to log into the node and debug from this point - likely the cluster is down. Yea etcd backup restores can very easily break a cluster. As for PVCs those are requests that are manifests you store in git as well. The actual data and volumes should exist outside of the cluster and should be backed up appropriately using correct tool for those resources.
And even the rancher stuff is just crds ;) including clusters
g
Yes they are backed up. That doesn't mean that losing all PVCs is a problem-free occurrence. Longhorn is apparently unable to restore a PVC that is part of a helm chart in a way that would make helm happy, so there's a lot of manual stuff.
And even the rancher stuff is just crds ;) including clusters
Yes, the problem is that the documentation doesn't really talk about this, to the best of my knowledge. I've had to search for the source yaml files in the rancher repositories more than once just to figure out which values I should define - and evidently I still managed to screw that up 😅
b
For longhorn - use longhorn backups. PVCs themselves are not needed for longhorn as they are just the request. PVCs are kubernetes manifests. Etcd backups will definitely not backup Longhorn volumes either. So I really hope you're using longhorn backups with longhorn. Longhorn works like this you create a PVC (request: I need a volume of 5gb to attach to pod XYZ) > Longhorn storage class > longhorn controller creates a new longhorn volume > longhorn creates a PV > PV is mounted to pod (reference to volume). As for figuring out what resources rancher creates behind the scene you can always look at the 'Edit as yaml' on the resources in the UI. Another tip is to open up the dev console (f12) in your browser and watch the network traffic. This will give you a json formated version of the manifest you can copy to repeat that exact request over and over again.
g
Thanks for the tips! Yes we have longhorn backups, unfortunately a day old, but better than nothing. When I previously attempted to restore from such backups I ended up in the situation described here: https://github.com/longhorn/longhorn/issues/2862 I'll have to try the suggested steps, if we have to go through this process.
What would be the way to install the rancher versions of helm charts through IaC? From what I could see, the repo is not remote, but it is rather referenced as
<file://path/to/something>
. For example the monitoring chart we install through the app catalog, because it integrates metrics with the Rancher UI. Other charts (e.g. logging) we installed outside the app catalog so that we could manage them as IaC
b
Helm charts are IaC. They are just manifests at the end of the day you can push with fleet. Here is all of our charts: https://github.com/rancher/charts/tree/dev-v2.9/charts
g
Yes but I can't do
helm install -n ....
for those charts, I would need to add the repo first and reference them. This means I can't have a pipeline managing those installation. Or am I missing something?
b
No you helm install a local directory you don't need a repo and fleet can actually do that for you: https://fleet.rancher.io/ref-fleet-yaml
g
Ok but this means checking out the repo you linked somewhere in my pipeline and the operating on those files, correct?
b
Well you can copy them to a local repo or just ref them directly - either way will work.
g
I wonder if this would be one of the few cases where git submodules could actually work....
b
So I think it's important to understand what fleet is and why I'm bringing it up. Fleet is a CD tool that work with your pipeline (CI)
You should not use your pipeline for deployments rather using it for testing merges into a repo and then have a dedicated CD tool like fleet, ArgoCD or flux take over the actual deployment
g
Maybe for context: we have a repo where we collect our charts and use helmfile to manage all this (helmfile is essentially a wrapper around helm that tells helm which charts to install with which values) So if we have to install Airflow we have a
helmfile.yaml
with:
Copy code
repositories:
- name: apache-airflow
  url: <https://airflow.apache.org>
releases:
- name: airflow
  namespace: airflow
  installed: true
  chart: apache-airflow/airflow
  version: 1.11.0
  values:
  - values.yaml
But I'm now seeing that it does support git repos:
Copy code
repositories:
 - name: polaris
   url: git+<https://github.com/reactiveops/polaris@deploy/helm?ref=master>
You should not use your pipeline for deployments rather using it for testing merges into a repo and then have a dedicated CD tool like fleet, ArgoCD or flux take over the actual deployment
Why is that? What would be the advantage of maintaining an extra CD tool?
b
Separation of work. And you aren't maintaining it separately in the case of fleet - it's built into rancher
Doing a helm apply from your pipeline is just throwing code over the fence. It's hard to see where it lands or what the end state is. What if someone messes with the cluster after you deployed. Shouldn't git be the source of truth?
g
And how is it different if done through fleet? Can't somebody mess with the deployment anyway?
b
Fleet has auto reconciliation features that will change it back based on what's defined in git
Fleet watches the deployment and assures that declared state == desired state
You can also turn off that feature and just have fleet let you know if it's been manually modified. None of this information is going to be available in your pipeline.
g
I'm honestly not 100% sure what happens in the pipeline if the resources have been manually modified
b
I'm positive your pipeline has no clue about the end state of the applications in the cluster (because they don't have internal access to the cluster). Like your pipeline isn't gonna know if your deployment pods are in crash loop back off. Or if there was an error applying a resource in the kubernetes cluster. Today you have to manually go look at the actual clusters themselves. The pipeline is just doing a helm install and good luck!
g
That is true, I generally have to open up k9s and see what the pods are doing
b
Also, I assume how you're doing the helm installs is with a collection of kubeconfigs outside of the cluster. Which is a security risk. Fleet runs from within the cluster and pulls in.
Having a dedicated deployment tool solves the problems of deployment. A CI tool (pipeline) is used for testing and validation of resources inside of a repo :)
g
The kubeconfigs are handled by Gitlab CI/CD, so they are stored there. Gitlab is hosted on-premise (but outside of the cluster)
b
Right but you still have to manually create and manage thos kubeconfigs and manually put them inside of the external secret store right?
g
Yes, but it's also a one-time setup thing
I would have to do the same but opposite setup with fleet to be able to pull the repositories anyway, so in terms of setup it doesn't change that much
b
Well technically they expire. But I think they last like 10 years or something crazy. But point being is you are storing security tokens (kubeconfigs) outside of the cluster which is technically additional security overhead. All just so you can deploy applications
Nope fleet looks at the upstream repo. You don't need to store kubeconfigs anywhere for fleet
g
Yes but I need to store the token for accessing the repo
b
Ahh yea that is a good point - if it's a private repo. But your pipelines need to as well
g
Well no, since it runs on the same system. The connection between Gitlab and Kubernetes needs to happen anyway, in one direction or the other
b
Eh? don't you need to give your runners tokens to access the repos they run against? GitLab might be doing magic in the background to do this automatically. Interesting but fair point for sure
g
The gitlab agent/runner will take care of checking out the repository it is running against. It also provides access tokens in environment variables if you really need to checkout other repositories (which we don't). And I was actually looking at the setup, I believe we don't really store any kubeconfig because the agent talks to gitlab and the runner is launched as a pod in the cluster, taking care of setting up the access