This message was deleted.
# k3s
a
This message was deleted.
a
This is in our test environment, I'd like to get this back into a working state so we can properly test the k3s upgrade before trying in production...
c
upgrading from v1.19 to v1.23.
Are you going directly from 1.19 to 1.23? You should be aware that this is not supported; you must step through each minor version. Go to the latest 1.19, then latest 1.20, then latest 1.21, and so on until you’re on the version you want.
The CA data is stored in a key within the kine/sql datastore, and is directly extracted from the database out to files on disk when the node starts. This was significantly more fragile on old releases though; if you make a mistake on an older release uninstalling and reinstalling the node in question to make sure things are cleaned up might be a good idea.
a
Oh wow ok, a similar upgrade worked for us in our non air-gapped environment, aside from issues with the traefik v1 to v2 upgrade we didn't see any issues, guess we got lucky. We'll try the incremental approach going forward. I believe the correct approach is just upgrading one node at a time sequentially? Is it necessary to run the install script each time or can you just replace the k3s binary and air-gapped-images.tar and restart the k3s service? What's the best method to troubleshoot this? Any suggested commands to run in the database or things to check on the management nodes? I was following this thread and trying the suggested DB queries but I get nothing back. I even tried in our working prod cluster and didn't see anything. Sorry this was set up long before I got here - https://github.com/k3s-io/k3s/issues/4640#issuecomment-987641565
c
you need to do the whole cluster up to each minor, at the same time.
a
Sorry, yes I meant just one node at a time until the entire cluster is upgraded to the next minor version, then repeat. I've seen it suggested to run the install script again, but I've also seen documentation that doesn't say to do that. https://docs.k3s.io/upgrades/manual#manually-upgrade-k3s-using-the-binary https://docs.k3s.io/installation/airgap#install-script-method
Since those DB queries aren't working for me, is it possible that k3s isn't even using an external DB and it defaults back to an embedded data store? It created the kine table so at some point it must have done something...
c
not if you’ve got multiple servers in your cluster
👍 1
a
All of my original nodes have cycled out, leaving 3 k3s management nodes with the new token and CA. My rancher deployment is in a weird state,
kubectl logs deployment/rancher -n cattle-system
shows a bunch of
Waiting for server to become available: the server has asked for the client to provide credentials
. Is there any way to salvage this?
I tried redeploying rancher via helm with the new k3s.yaml, but the rancher-webhook is also in a bad state with the same error message
c
if you know the original token, and sql datastore connection string, you should be able to just start over with those same settings.
just uninstall k3s, and reinstall with the same token and datastore
a
Ok thank you, that at least got k3s back into a good state. rancher-webhook is now saying
x509: certificate signed by unknown authority
. I tried redeploying the rancher-webhook deployment with
kubectl rollout restart deployment rancher-webhook -n cattle-system
but it gives the same error.
we do use a private CA for ingress but not for the cattle-webhook-ca
c
if you have pods running that started up while things were confused, you might need to delete and recreate them in order to get them running with the correct CAs and such.
a
Yea that's what I'm thinking, I just found this issue that seems very similar, same minor version of k3s and everything. Although my kubectl commands work fine - https://github.com/k3s-io/k3s/issues/2914
I noticed the local-path-provisioner in kube-system is in CrashLoopBackOff with the same kind of error
killing that pod didn't help, same error. There must be something somewhere with a different cert. That would explain why nodes joining k3s without specifying the token get a different token and CA in the k3s.yaml
If there's documentation that describes the internals I'm happy to read that over instead of bugging you
same thing with coredns in kube-system
I've gone through all of the steps you mentioned in that issue above and am in pretty much the same place. Down to 1 node, tried deleting all the pods, dynamic-cert.json, etc.... I'm still getting unknown CA errors in all of my pods. If I were to build a new k3s cluster is it possible to import my existing Rancher cluster into the new k3s cluster?
c
not easily. But you’re several years behind on Kubernetes minor version updates, if you’re up for a rebuild it might be less time consuming than continuing to troubleshoot and stepping through minor version upgrades.
a
Yea, the only reason I'm trying so hard to salvage this is in case I run into the same issue when upgrading our production k3s, if that goes down we will have lots of unhappy customers. We ran into an issue trying this upgrade about a year ago, we're severely understaffed so we put it off until we had time to dig into it again... and here we are after fat fingering a token lol. Ideally I'd like to understand what is actually going on here and why one bad node completely broke the cluster. But if I can just build a new, up to date k3s cluster and import my existing RKE cluster into it then that's fine
If you have any suggestions or docs you can point me to I can take it from there. I really appreciate all the help
c
old versions like you’re using had very weak protections around ensuring the consistency of the token used alongside the external sql datastore. If you pointed multiple nodes at the same datastore, with different tokens, they could all come up and start managing serviceaccounts and such with different CA certificates. If you wanted, you could probably go through and delete/recreate all the service accounts now that you’ve fixed the servers to have the same token and CAs.
Newer releases will notice that the data in the datastore wasn’t encrypted with the same token, and will fail to start… instead of hosing the cluster.
👍 1
a
Well, I tried restoring my RDS postgres db snapshot and started up a brand new node, installing k3s fresh with the old token, and I'm still seeing all kinds of issues. kube-system pods won't start on the new node and
journalctl -u k3s
is showing "Unable to authenticate the request due to an error: [invalid bearer token, square/go-jose: error in cryptographic primitive]". I've tried deleting and recreating all of the deployments in the manifests directory but they never come up, complaining about missing secrets. I even deleted the coredns secret which I thought should be automatically recreated but it wasn't. About the only thing that works is
kubectl get nodes
and other similar commands. Guess I just need to start fresh and try and import my currently running Rancher RKE cluster
Actually, it's fixed! I tried one last time after restoring the database, this time completely removing the
--token
option from the server command since it wasn't necessary in v1.19 and we didn't use it in the first place. It definitely was the right token but for whatever reason I guess v1.19 doesn't like that if you didn't use it previously?