This message was deleted.
# harvester
a
This message was deleted.
b
It was my birthday, so I yolo'd it https://github.com/harvester/upgrade-helpers/pull/15
I found and added some additional test cases for known issues: 1. For the missing label on the local-kubeconfig secret, which I already added to the PR, which causes the upgrade to stall after the first node. 2. A cert check because of this bug. This isn't in the PR yet. I'm slightly concerned as there are PRs that are coming up on the 1 year mark that don't seem to have any interaction from anyone at Suse, let alone the previous contributors or repo maintainers.
r
Thank you for submitting the PR 👍 I’ll check this out.
b
@red-king-19196 I think I got all of your concerns addressed, but if there's something I'm missing and it's easier to chat here, let me know. 🙂
r
I missed your message here 😅 I’ve replied on the PR page. If you have further questions we can have discussion here.
b
Hey!
So you're kinda right, but also kinda wrong?
Copy code
cf8:~ # ./check.sh 
==============================

Starting Host check... 
W1021 15:14:30.188424   70976 loader.go:222] Config not found: /etc/rancher/rke2/rke2.yaml
E1021 15:14:30.189691   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
E1021 15:14:30.190114   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
E1021 15:14:30.192123   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
E1021 15:14:30.192835   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
E1021 15:14:30.194542   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?
This script is intended to be run from one of the Harvester cluster's Control Plane nodes.
It seems like you're running this script from cf8 and not one of the following nodes:

Do you want to contiue running this script even though it seems like you're not on a node? [Y/N] y
Host Test: Skipped

==============================

Starting Certificates check...
find: '/var/lib/rancher/rke2/server/tls/': No such file or directory
The script, while on a worker node does break bail out, but not where you thought it would.
I think it might make more sense to just get rid of the -e and use some additional try loops for error handling? idk What do you think?
r
Aha I got fooled by the pipes 🙃 So this:
Copy code
cp_nodes=$(kubectl get nodes |grep control-plane |awk '{ print $1 }')
does not break because the last command returns 0 even if nothing comes in through the pipe. But this:
Copy code
prom_ip=$(kubectl get services/rancher-monitoring-prometheus -n cattle-monitoring-system -o yaml | yq -e '.spec.clusterIP')
does break because the last command
yq
got a flag
-e
, which returns non-zero code if the specified key doesn’t exist.
I prefer leaving the
-e
at the top of the script as is. Otherwise, we’ll have to handle all the possible errors that might be encountered during the execution and show them to the user at the end. This helper script gives users confidence that they can initiate the cluster upgrade. If something needs to be put in a retry loop, that implies cluster instability. The script should point that out immediately.
b
I wasn't really thinking of a retry loop, more like a try/catch with immediate failure for the test. The old scripts are still able to fail in ways but the scripts wouldn't actually tell you what parts were failing. Having worked on support end for these types of scripts, I think it might be useful to have the script complete with indicators of which sections fail being really obvious in the log and at the end statement, vs potentially seeing a bunch of random text and the script just bailing out with no message or statement.
Buuuuuut... That being said I'm not one supporting this, 😅
Y'all are, so I'm happy to leave it more like this if you'd prefer.
It DOES make me think that we should probably add another test to make sure
kubectl
can talk to an API and bail out early if it can't (ie it's running on a worker node or something went super wrong with the kubeapi.
👍 1
r
Yeah, I agree. The PR makes excellent improvements in error handling and user experiences. We appreciate your contribution. I think we can polish the script in future PRs. For now, I’m happy to give an LGTM if there are no major issues 🙂
Adding a check to ensure the script can talk to the API server in the first place is a good idea.
b
For now I think we can add a check/exception for if kubectl doesn't get a list of nodes back from that command and have it fail and bail out then.
👍 1