This message was deleted Rancher Users #k3s

Join Slack

This message was deleted.

# k3s

adamant-kite-43734

11/07/2023, 5:41 PM

This message was deleted.

abundant-hair-58573

11/07/2023, 5:42 PM

This is in our test environment, I'd like to get this back into a working state so we can properly test the k3s upgrade before trying in production...

creamy-pencil-82913

11/07/2023, 6:06 PM

upgrading from v1.19 to v1.23.

Are you going directly from 1.19 to 1.23? You should be aware that this is not supported; you must step through each minor version. Go to the latest 1.19, then latest 1.20, then latest 1.21, and so on until you’re on the version you want.

creamy-pencil-82913

11/07/2023, 6:08 PM

The CA data is stored in a key within the kine/sql datastore, and is directly extracted from the database out to files on disk when the node starts. This was significantly more fragile on old releases though; if you make a mistake on an older release uninstalling and reinstalling the node in question to make sure things are cleaned up might be a good idea.

abundant-hair-58573

11/07/2023, 6:33 PM

Oh wow ok, a similar upgrade worked for us in our non air-gapped environment, aside from issues with the traefik v1 to v2 upgrade we didn't see any issues, guess we got lucky. We'll try the incremental approach going forward. I believe the correct approach is just upgrading one node at a time sequentially? Is it necessary to run the install script each time or can you just replace the k3s binary and air-gapped-images.tar and restart the k3s service? What's the best method to troubleshoot this? Any suggested commands to run in the database or things to check on the management nodes? I was following this thread and trying the suggested DB queries but I get nothing back. I even tried in our working prod cluster and didn't see anything. Sorry this was set up long before I got here - https://github.com/k3s-io/k3s/issues/4640#issuecomment-987641565

creamy-pencil-82913

11/07/2023, 6:37 PM

you need to do the whole cluster up to each minor, at the same time.

creamy-pencil-82913

11/07/2023, 6:37 PM

this is a Kubernetes requirement https://kubernetes.io/releases/version-skew-policy/#supported-version-skew

abundant-hair-58573

11/07/2023, 6:43 PM

Sorry, yes I meant just one node at a time until the entire cluster is upgraded to the next minor version, then repeat. I've seen it suggested to run the install script again, but I've also seen documentation that doesn't say to do that. https://docs.k3s.io/upgrades/manual#manually-upgrade-k3s-using-the-binary https://docs.k3s.io/installation/airgap#install-script-method

abundant-hair-58573

11/07/2023, 6:51 PM

Since those DB queries aren't working for me, is it possible that k3s isn't even using an external DB and it defaults back to an embedded data store? It created the kine table so at some point it must have done something...

creamy-pencil-82913

11/07/2023, 7:11 PM

not if you’ve got multiple servers in your cluster

👍 1

abundant-hair-58573

11/07/2023, 8:33 PM

All of my original nodes have cycled out, leaving 3 k3s management nodes with the new token and CA. My rancher deployment is in a weird state,

kubectl logs deployment/rancher -n cattle-system

shows a bunch of

Waiting for server to become available: the server has asked for the client to provide credentials

. Is there any way to salvage this?

abundant-hair-58573

11/07/2023, 9:05 PM

I tried redeploying rancher via helm with the new k3s.yaml, but the rancher-webhook is also in a bad state with the same error message

creamy-pencil-82913

11/07/2023, 9:16 PM

if you know the original token, and sql datastore connection string, you should be able to just start over with those same settings.

creamy-pencil-82913

11/07/2023, 9:16 PM

just uninstall k3s, and reinstall with the same token and datastore

abundant-hair-58573

11/07/2023, 9:53 PM

Ok thank you, that at least got k3s back into a good state. rancher-webhook is now saying

x509: certificate signed by unknown authority

. I tried redeploying the rancher-webhook deployment with

kubectl rollout restart deployment rancher-webhook -n cattle-system

but it gives the same error.

abundant-hair-58573

11/07/2023, 9:58 PM

we do use a private CA for ingress but not for the cattle-webhook-ca

creamy-pencil-82913

11/07/2023, 10:09 PM

if you have pods running that started up while things were confused, you might need to delete and recreate them in order to get them running with the correct CAs and such.

abundant-hair-58573

11/07/2023, 10:11 PM

Yea that's what I'm thinking, I just found this issue that seems very similar, same minor version of k3s and everything. Although my kubectl commands work fine - https://github.com/k3s-io/k3s/issues/2914

abundant-hair-58573

11/07/2023, 10:12 PM

I noticed the local-path-provisioner in kube-system is in CrashLoopBackOff with the same kind of error

abundant-hair-58573

11/07/2023, 10:17 PM

killing that pod didn't help, same error. There must be something somewhere with a different cert. That would explain why nodes joining k3s without specifying the token get a different token and CA in the k3s.yaml

abundant-hair-58573

11/07/2023, 10:18 PM

If there's documentation that describes the internals I'm happy to read that over instead of bugging you

abundant-hair-58573

11/07/2023, 10:25 PM

same thing with coredns in kube-system

abundant-hair-58573

11/07/2023, 11:12 PM

I've gone through all of the steps you mentioned in that issue above and am in pretty much the same place. Down to 1 node, tried deleting all the pods, dynamic-cert.json, etc.... I'm still getting unknown CA errors in all of my pods. If I were to build a new k3s cluster is it possible to import my existing Rancher cluster into the new k3s cluster?

creamy-pencil-82913

11/07/2023, 11:22 PM

not easily. But you’re several years behind on Kubernetes minor version updates, if you’re up for a rebuild it might be less time consuming than continuing to troubleshoot and stepping through minor version upgrades.

abundant-hair-58573

11/07/2023, 11:26 PM

Yea, the only reason I'm trying so hard to salvage this is in case I run into the same issue when upgrading our production k3s, if that goes down we will have lots of unhappy customers. We ran into an issue trying this upgrade about a year ago, we're severely understaffed so we put it off until we had time to dig into it again... and here we are after fat fingering a token lol. Ideally I'd like to understand what is actually going on here and why one bad node completely broke the cluster. But if I can just build a new, up to date k3s cluster and import my existing RKE cluster into it then that's fine

abundant-hair-58573

11/07/2023, 11:27 PM

If you have any suggestions or docs you can point me to I can take it from there. I really appreciate all the help

creamy-pencil-82913

11/07/2023, 11:40 PM

old versions like you’re using had very weak protections around ensuring the consistency of the token used alongside the external sql datastore. If you pointed multiple nodes at the same datastore, with different tokens, they could all come up and start managing serviceaccounts and such with different CA certificates. If you wanted, you could probably go through and delete/recreate all the service accounts now that you’ve fixed the servers to have the same token and CAs.

creamy-pencil-82913

11/07/2023, 11:42 PM

Newer releases will notice that the data in the datastore wasn’t encrypted with the same token, and will fail to start… instead of hosing the cluster.

👍 1

abundant-hair-58573

11/08/2023, 7:21 PM

Well, I tried restoring my RDS postgres db snapshot and started up a brand new node, installing k3s fresh with the old token, and I'm still seeing all kinds of issues. kube-system pods won't start on the new node and

journalctl -u k3s

is showing "Unable to authenticate the request due to an error: [invalid bearer token, square/go-jose: error in cryptographic primitive]". I've tried deleting and recreating all of the deployments in the manifests directory but they never come up, complaining about missing secrets. I even deleted the coredns secret which I thought should be automatically recreated but it wasn't. About the only thing that works is

kubectl get nodes

and other similar commands. Guess I just need to start fresh and try and import my currently running Rancher RKE cluster

abundant-hair-58573

11/08/2023, 8:22 PM

Actually, it's fixed! I tried one last time after restoring the database, this time completely removing the

--token

option from the server command since it wasn't necessary in v1.19 and we didn't use it in the first place. It definitely was the right token but for whatever reason I guess v1.19 doesn't like that if you didn't use it previously?

2 Views

Open in Slack

Previous Next