This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

09/16/2024, 9:51 PM

This message was deleted.

bland-article-62755

09/18/2024, 1:41 PM

It was my birthday, so I yolo'd it https://github.com/harvester/upgrade-helpers/pull/15

bland-article-62755

09/20/2024, 3:17 PM

I found and added some additional test cases for known issues: 1. For the missing label on the local-kubeconfig secret, which I already added to the PR, which causes the upgrade to stall after the first node. 2. A cert check because of this bug. This isn't in the PR yet. I'm slightly concerned as there are PRs that are coming up on the 1 year mark that don't seem to have any interaction from anyone at Suse, let alone the previous contributors or repo maintainers.

red-king-19196

09/23/2024, 8:33 AM

Thank you for submitting the PR 👍 I’ll check this out.

bland-article-62755

10/16/2024, 3:31 PM

@red-king-19196 I think I got all of your concerns addressed, but if there's something I'm missing and it's easier to chat here, let me know. 🙂

red-king-19196

10/21/2024, 10:47 AM

I missed your message here 😅 I’ve replied on the PR page. If you have further questions we can have discussion here.

bland-article-62755

10/21/2024, 3:15 PM

Hey!

bland-article-62755

10/21/2024, 3:15 PM

So you're kinda right, but also kinda wrong?

bland-article-62755

10/21/2024, 3:16 PM

Copy code

cf8:~ # ./check.sh 
==============================

Starting Host check... 
W1021 15:14:30.188424   70976 loader.go:222] Config not found: /etc/rancher/rke2/rke2.yaml
E1021 15:14:30.189691   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
E1021 15:14:30.190114   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
E1021 15:14:30.192123   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
E1021 15:14:30.192835   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
E1021 15:14:30.194542   70976 memcache.go:265] couldn't get current server API group list: Get "<http://localhost:8080/api?timeout=32s>": dial tcp 127.0.0.1:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?
This script is intended to be run from one of the Harvester cluster's Control Plane nodes.
It seems like you're running this script from cf8 and not one of the following nodes:

Do you want to contiue running this script even though it seems like you're not on a node? [Y/N] y
Host Test: Skipped

==============================

Starting Certificates check...
find: '/var/lib/rancher/rke2/server/tls/': No such file or directory

bland-article-62755

10/21/2024, 3:16 PM

The script, while on a worker node does ~~break~~ bail out, but not where you thought it would.

bland-article-62755

10/21/2024, 3:38 PM

I think it might make more sense to just get rid of the -e and use some additional try loops for error handling? idk What do you think?

red-king-19196

10/22/2024, 10:03 AM

Aha I got fooled by the pipes 🙃 So this:

Copy code

cp_nodes=$(kubectl get nodes |grep control-plane |awk '{ print $1 }')

does not break because the last command returns 0 even if nothing comes in through the pipe. But this:

Copy code

prom_ip=$(kubectl get services/rancher-monitoring-prometheus -n cattle-monitoring-system -o yaml | yq -e '.spec.clusterIP')

does break because the last command

yq

got a flag

-e

, which returns non-zero code if the specified key doesn’t exist.

red-king-19196

10/22/2024, 10:10 AM

I prefer leaving the

-e

at the top of the script as is. Otherwise, we’ll have to handle all the possible errors that might be encountered during the execution and show them to the user at the end. This helper script gives users confidence that they can initiate the cluster upgrade. If something needs to be put in a retry loop, that implies cluster instability. The script should point that out immediately.

bland-article-62755

10/22/2024, 2:14 PM

I wasn't really thinking of a retry loop, more like a try/catch with immediate failure for the test. The old scripts are still able to fail in ways but the scripts wouldn't actually tell you what parts were failing. Having worked on support end for these types of scripts, I think it might be useful to have the script complete with indicators of which sections fail being really obvious in the log and at the end statement, vs potentially seeing a bunch of random text and the script just bailing out with no message or statement.

bland-article-62755

10/22/2024, 2:15 PM

Buuuuuut... That being said I'm not one supporting this, 😅

bland-article-62755

10/22/2024, 2:16 PM

Y'all are, so I'm happy to leave it more like this if you'd prefer.

bland-article-62755

10/22/2024, 2:19 PM

It DOES make me think that we should probably add another test to make sure

kubectl

can talk to an API and bail out early if it can't (ie it's running on a worker node or something went super wrong with the kubeapi.

👍 1

red-king-19196

10/23/2024, 6:34 AM

Yeah, I agree. The PR makes excellent improvements in error handling and user experiences. We appreciate your contribution. I think we can polish the script in future PRs. For now, I’m happy to give an LGTM if there are no major issues 🙂

red-king-19196

10/23/2024, 6:40 AM

Adding a check to ensure the script can talk to the API server in the first place is a good idea.

bland-article-62755

10/23/2024, 3:41 PM

For now I think we can add a check/exception for if kubectl doesn't get a list of nodes back from that command and have it fail and bail out then.

👍 1

7 Views

Open in Slack

Previous Next