This message was deleted Rancher Users #general

Join Slack

This message was deleted.

# general

adamant-kite-43734

02/22/2024, 6:54 PM

This message was deleted.

bulky-sunset-52084

02/23/2024, 3:01 AM

I wouldn't rely on etcd backup/restores especially when you don't have access to the nodes. Instead I would default to destroying and rebuilding the clusters with manifests stored externally. Like git. Etcd restores are a last resort 'brain transplant'. It's hit or miss as to whether they work because it depends on the state of the cluster at the last snapshot - and its current state. Cattle not pets is the way to go.

gray-laptop-62433

02/23/2024, 9:10 AM

I would love to work with manifests stored in git as opposed to clicking through a web UI, but Rancher seems to prefer the UI thing. Curious that the documentation recommendations are something not to be relied upon. How would destroy/rebuild work and would this lead to downtimes?

bulky-sunset-52084

02/23/2024, 1:17 PM

Rancher actually has a tool built in that helps you follow this GitOps pattern. It's called 'fleet' and it's under the continuous delivery section of the UI. Cattle not pets is the reason our logo is a cow ;) Destroying and rebuilding would certainly not lead to any more downtime then you are seeing currently with etcd restores. As long as you have a repeatable process downtime will be minimal. You can also perform this in a rolling manning as opposed to all at once if needed. In the worse case you could even spin up a new env - fail everything over to it - destroy the old env. This is the benefit of infrastructure as code :)

gray-laptop-62433

02/23/2024, 1:24 PM

Everything except Rancher specific stuff is currently setup as infrastructure as code. The cluster is being rebuilt and we might have to manually restore all our PVCs from backups. The thing that bothers me, is that despite the failed update the cluster was still happily working before I tried the restore. I didn't expect such a prominent UI feature to be so destructive. I'm not sure how to rollback a failed update at this point

bulky-sunset-52084

02/23/2024, 1:28 PM

You have to log into the node and debug from this point - likely the cluster is down. Yea etcd backup restores can very easily break a cluster. As for PVCs those are requests that are manifests you store in git as well. The actual data and volumes should exist outside of the cluster and should be backed up appropriately using correct tool for those resources.

bulky-sunset-52084

02/23/2024, 1:29 PM

And even the rancher stuff is just crds ;) including clusters

gray-laptop-62433

02/23/2024, 1:32 PM

Yes they are backed up. That doesn't mean that losing all PVCs is a problem-free occurrence. Longhorn is apparently unable to restore a PVC that is part of a helm chart in a way that would make helm happy, so there's a lot of manual stuff.

And even the rancher stuff is just crds ;) including clusters

Yes, the problem is that the documentation doesn't really talk about this, to the best of my knowledge. I've had to search for the source yaml files in the rancher repositories more than once just to figure out which values I should define - and evidently I still managed to screw that up 😅

bulky-sunset-52084

02/23/2024, 1:42 PM

For longhorn - use longhorn backups. PVCs themselves are not needed for longhorn as they are just the request. PVCs are kubernetes manifests. Etcd backups will definitely not backup Longhorn volumes either. So I really hope you're using longhorn backups with longhorn. Longhorn works like this you create a PVC (request: I need a volume of 5gb to attach to pod XYZ) > Longhorn storage class > longhorn controller creates a new longhorn volume > longhorn creates a PV > PV is mounted to pod (reference to volume). As for figuring out what resources rancher creates behind the scene you can always look at the 'Edit as yaml' on the resources in the UI. Another tip is to open up the dev console (f12) in your browser and watch the network traffic. This will give you a json formated version of the manifest you can copy to repeat that exact request over and over again.

gray-laptop-62433

02/23/2024, 1:48 PM

Thanks for the tips! Yes we have longhorn backups, unfortunately a day old, but better than nothing. When I previously attempted to restore from such backups I ended up in the situation described here: https://github.com/longhorn/longhorn/issues/2862 I'll have to try the suggested steps, if we have to go through this process.

gray-laptop-62433

02/23/2024, 2:01 PM

What would be the way to install the rancher versions of helm charts through IaC? From what I could see, the repo is not remote, but it is rather referenced as

<file://path/to/something>

. For example the monitoring chart we install through the app catalog, because it integrates metrics with the Rancher UI. Other charts (e.g. logging) we installed outside the app catalog so that we could manage them as IaC

bulky-sunset-52084

02/23/2024, 2:03 PM

Helm charts are IaC. They are just manifests at the end of the day you can push with fleet. Here is all of our charts: https://github.com/rancher/charts/tree/dev-v2.9/charts

gray-laptop-62433

02/23/2024, 2:09 PM

Yes but I can't do

helm install -n ....

for those charts, I would need to add the repo first and reference them. This means I can't have a pipeline managing those installation. Or am I missing something?

bulky-sunset-52084

02/23/2024, 2:11 PM

No you helm install a local directory you don't need a repo and fleet can actually do that for you: https://fleet.rancher.io/ref-fleet-yaml

gray-laptop-62433

02/23/2024, 2:12 PM

Ok but this means checking out the repo you linked somewhere in my pipeline and the operating on those files, correct?

bulky-sunset-52084

02/23/2024, 2:13 PM

Well you can copy them to a local repo or just ref them directly - either way will work.

gray-laptop-62433

02/23/2024, 2:13 PM

I wonder if this would be one of the few cases where git submodules could actually work....

bulky-sunset-52084

02/23/2024, 2:15 PM

So I think it's important to understand what fleet is and why I'm bringing it up. Fleet is a CD tool that work with your pipeline (CI)

bulky-sunset-52084

02/23/2024, 2:16 PM

You should not use your pipeline for deployments rather using it for testing merges into a repo and then have a dedicated CD tool like fleet, ArgoCD or flux take over the actual deployment

gray-laptop-62433

02/23/2024, 2:23 PM

Maybe for context: we have a repo where we collect our charts and use helmfile to manage all this (helmfile is essentially a wrapper around helm that tells helm which charts to install with which values) So if we have to install Airflow we have a

helmfile.yaml

with:

Copy code

repositories:
- name: apache-airflow
  url: <https://airflow.apache.org>
releases:
- name: airflow
  namespace: airflow
  installed: true
  chart: apache-airflow/airflow
  version: 1.11.0
  values:
  - values.yaml

But I'm now seeing that it does support git repos:

Copy code

repositories:
 - name: polaris
   url: git+<https://github.com/reactiveops/polaris@deploy/helm?ref=master>

gray-laptop-62433

02/23/2024, 2:23 PM

You should not use your pipeline for deployments rather using it for testing merges into a repo and then have a dedicated CD tool like fleet, ArgoCD or flux take over the actual deployment

Why is that? What would be the advantage of maintaining an extra CD tool?

bulky-sunset-52084

02/23/2024, 2:24 PM

Separation of work. And you aren't maintaining it separately in the case of fleet - it's built into rancher

bulky-sunset-52084

02/23/2024, 2:26 PM

Doing a helm apply from your pipeline is just throwing code over the fence. It's hard to see where it lands or what the end state is. What if someone messes with the cluster after you deployed. Shouldn't git be the source of truth?

gray-laptop-62433

02/23/2024, 2:27 PM

And how is it different if done through fleet? Can't somebody mess with the deployment anyway?

bulky-sunset-52084

02/23/2024, 2:28 PM

Fleet has auto reconciliation features that will change it back based on what's defined in git

bulky-sunset-52084

02/23/2024, 2:28 PM

Fleet watches the deployment and assures that declared state == desired state

bulky-sunset-52084

02/23/2024, 2:29 PM

You can also turn off that feature and just have fleet let you know if it's been manually modified. None of this information is going to be available in your pipeline.

gray-laptop-62433

02/23/2024, 2:30 PM

I'm honestly not 100% sure what happens in the pipeline if the resources have been manually modified

bulky-sunset-52084

02/23/2024, 2:34 PM

I'm positive your pipeline has no clue about the end state of the applications in the cluster (because they don't have internal access to the cluster). Like your pipeline isn't gonna know if your deployment pods are in crash loop back off. Or if there was an error applying a resource in the kubernetes cluster. Today you have to manually go look at the actual clusters themselves. The pipeline is just doing a helm install and good luck!

gray-laptop-62433

02/23/2024, 2:35 PM

That is true, I generally have to open up k9s and see what the pods are doing

bulky-sunset-52084

02/23/2024, 2:36 PM

Also, I assume how you're doing the helm installs is with a collection of kubeconfigs outside of the cluster. Which is a security risk. Fleet runs from within the cluster and pulls in.

bulky-sunset-52084

02/23/2024, 2:37 PM

Having a dedicated deployment tool solves the problems of deployment. A CI tool (pipeline) is used for testing and validation of resources inside of a repo :)

gray-laptop-62433

02/23/2024, 2:37 PM

The kubeconfigs are handled by Gitlab CI/CD, so they are stored there. Gitlab is hosted on-premise (but outside of the cluster)

bulky-sunset-52084

02/23/2024, 2:38 PM

Right but you still have to manually create and manage thos kubeconfigs and manually put them inside of the external secret store right?

gray-laptop-62433

02/23/2024, 2:38 PM

Yes, but it's also a one-time setup thing

gray-laptop-62433

02/23/2024, 2:40 PM

I would have to do the same but opposite setup with fleet to be able to pull the repositories anyway, so in terms of setup it doesn't change that much

bulky-sunset-52084

02/23/2024, 2:40 PM

Well technically they expire. But I think they last like 10 years or something crazy. But point being is you are storing security tokens (kubeconfigs) outside of the cluster which is technically additional security overhead. All just so you can deploy applications

bulky-sunset-52084

02/23/2024, 2:41 PM

Nope fleet looks at the upstream repo. You don't need to store kubeconfigs anywhere for fleet

gray-laptop-62433

02/23/2024, 2:41 PM

Yes but I need to store the token for accessing the repo

bulky-sunset-52084

02/23/2024, 2:42 PM

Ahh yea that is a good point - if it's a private repo. But your pipelines need to as well

gray-laptop-62433

02/23/2024, 2:42 PM

Well no, since it runs on the same system. The connection between Gitlab and Kubernetes needs to happen anyway, in one direction or the other

bulky-sunset-52084

02/23/2024, 2:44 PM

Eh? don't you need to give your runners tokens to access the repos they run against? GitLab might be doing magic in the background to do this automatically. Interesting but fair point for sure

gray-laptop-62433

02/23/2024, 2:47 PM

The gitlab agent/runner will take care of checking out the repository it is running against. It also provides access tokens in environment variables if you really need to checkout other repositories (which we don't). And I was actually looking at the setup, I believe we don't really store any kubeconfig because the agent talks to gitlab and the runner is launched as a pod in the cluster, taking care of setting up the access

Open in Slack

Previous Next