This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

06/27/2024, 8:10 PM

This message was deleted.

clean-cpu-90380

06/28/2024, 2:53 AM

I think you could try to extend the volume size. BTW, how large is your disk size? Some informations https://github.com/kubernetes/kubernetes/issues/71869

quaint-alarm-7893

06/28/2024, 5:46 PM

so i ended up tracking it down to a ton of support bundles being stored in the /usr/local/.state/var-log.bind/

quaint-alarm-7893

06/28/2024, 5:47 PM

i moved them to another disk, and rebooted the node and it came back online. any idea why this dir is filling up with bundles?

quaint-alarm-7893

06/28/2024, 5:48 PM

Copy code

-rw-------   1 root  root                  33 Jun 28 16:05 scc_supportconfig_harvester-02_faf36df2-e2a8-4c08-92e2-2b341c287278.txz.md5
-rw-------   1 root  root            16195552 Jun 28 13:45 scc_supportconfig_harvester-02_fc975c22-2bdd-4ab9-a403-560057516f56.txz
-rw-------   1 root  root                  33 Jun 28 13:45 scc_supportconfig_harvester-02_fc975c22-2bdd-4ab9-a403-560057516f56.txz.md5
-rw-------   1 root  root            16181484 Jun 28 13:31 scc_supportconfig_harvester-02_ff0e5197-d223-44f5-9189-e54fa9d679f9.txz
-rw-------   1 root  root                  33 Jun 28 13:31 scc_supportconfig_harvester-02_ff0e5197-d223-44f5-9189-e54fa9d679f9.txz.md5
-rw-------   1 root  root            16527476 Jun 28 17:37 scc_supportconfig_harvester-02_ffa512f9-5929-4e0b-bbd9-66b4c8fd6172.txz
-rw-------   1 root  root                  33 Jun 28 17:37 scc_supportconfig_harvester-02_ffa512f9-5929-4e0b-bbd9-66b4c8fd6172.txz.md5

quaint-alarm-7893

06/28/2024, 5:49 PM

since clearing it out this morning, i already have probably 50+ in there

clean-cpu-90380

07/01/2024, 6:05 AM

I’m guessing it’s not directly related to this one, may I know how large is your disk? And could you provide the support bundle by following this doc https://docs.harvesterhci.io/v0.3/troubleshooting/harvester/?

salmon-city-57654

07/03/2024, 3:27 PM

Hi @quaint-alarm-7893, Could you get the following result for investigation?

Copy code

$ kubectl get <http://nodes.longhorn.io|nodes.longhorn.io> -n longhorn-system -o yaml

Also, I am curious why the

supportconfig

is still here. Do you have a running support bundle collector?

quaint-alarm-7893

07/03/2024, 4:56 PM

@clean-cpu-90380 and @salmon-city-57654, the disk space consumption was for sure the support bundles in the folder. this cluster had been running for 230-ish days. each node at this point is holding about 102G of 203G (provisioned by harvester at install) on the /usr/local folder. all these files were in the /usr/local/.state/var-log.bind/ folder and filling up the disk to 93%, which seemed to cause the disk pressure. i have a support bundle, but it's several days after i cleaned out all the nodes.

supportbundle_c53f87ef-6f0a-4fae-825b-06cc3091b637_2024-07-01T19-59-28Z.zip getnodes.longhorn.io.yaml

quaint-alarm-7893

07/03/2024, 4:57 PM

re: your question on a support bundle collector. how can i check?

quaint-alarm-7893

07/03/2024, 4:57 PM

i didnt intentionally set anything though, if that's something i would have had to configure.

salmon-city-57654

07/04/2024, 1:44 AM

Hi @quaint-alarm-7893, I can do more checks with the support bundle. I will update maybe today or tomorrow. Thanks!

quaint-alarm-7893

07/04/2024, 2:03 AM

@salmon-city-57654 sounds good. thanks!

clean-cpu-90380

07/04/2024, 6:17 AM

@quaint-alarm-7893 I noticed most replicas are in the harvester-03 with disk 0ab2811e-35ba-4460-9d7f-e4254243e191(/var/lib/harvester/extra-disks/52bbeb3f093c421d7312970052f3ff8f), but another one disk 5f01bb7e-a10a-4201-9eba-d5639f17cfd8(/var/lib/harvester/extra-disks/70b3d41d058a1ae36a29f5c34c7fe6bc) just has a few replicas. I think you could move some replicas to another disk. You could try to mark that disk unschedulable, then delete some replicas after volume is healthy. After that, mark schedulable again. @salmon-city-57654 Based on your advices. Please feel free to add more explanation, thank you.

👍 1

quaint-alarm-7893

07/04/2024, 4:17 PM

@clean-cpu-90380 i can rebalance some of the pvcs on harvester-03, for sure. i'll do that, thanks. but just to be clear, it wasnt an issue on the extra disks, it was all on the /usr/local mount, which is one of the install / system partitions created by harvester when it was initially setup. that disk was full of support bundles in the dir mentioned earlier in this thread.

salmon-city-57654

07/04/2024, 4:37 PM

Hi @quaint-alarm-7893, Could you check on your cluster with following command?

Copy code

kubectl get pods -n harvester-system |grep supportbundle

salmon-city-57654

07/04/2024, 4:39 PM

I thought @clean-cpu-90380 pointing out is one of the

Diskpressure

issues from your SB. So, you still need to rebalance some volume on harvester-03 to resolve this.

quaint-alarm-7893

07/04/2024, 4:39 PM

Copy code

supportbundle-agent-bundle-bful3-6s289                 1/1     Running            4632 (8m16s ago)   98d
supportbundle-agent-bundle-bful3-9qdjj                 1/1     Running            1068 (3m2s ago)    6d3h
supportbundle-agent-bundle-bful3-mdsz2                 1/1     Running            1178 (7m23s ago)   6d20h
supportbundle-agent-bundle-cylq3-72b42                 1/1     Running            1178 (7m8s ago)    6d20h
supportbundle-agent-bundle-cylq3-jsxk7                 0/1     CrashLoopBackOff   1067 (2m19s ago)   6d3h
supportbundle-agent-bundle-cylq3-mntvn                 1/1     Running            4546 (8m20s ago)   98d
supportbundle-manager-bundle-bful3-6794b8cd4c-dgt54    1/1     Running            1 (46d ago)        98d
supportbundle-manager-bundle-cylq3-68b6cf5fdc-89vxk    1/1     Running            1 (46d ago)        98d

salmon-city-57654

07/04/2024, 4:40 PM

What is the result of

Copy code

kubectl get supportbundle -A

quaint-alarm-7893

07/04/2024, 4:43 PM

Copy code

NAMESPACE          NAME           ISSUE_URL   DESCRIPTION   AGE
harvester-system   bundle-bful3               primarydc    264d
harvester-system   bundle-cylq3               primarydc    264d

quaint-alarm-7893

07/04/2024, 4:45 PM

also, fyi, i cleared out harvester-01 folder mentioned earlier probably 2/3 days ago. there are already 1142 files with the filename like: scc_supportconfig_harvester-01_{uid}.txz in that folder. (harvester-01:/usr/local/.state/var-log.bind) just seems odd that it's generating that many files. is this normal?

salmon-city-57654

07/04/2024, 4:47 PM

scc_supportconfig would be generated with support-bundle-kit when generating the support bundle

salmon-city-57654

07/04/2024, 4:48 PM

could you remove the redundant supportbundle ? like you mentioned

Copy code

NAMESPACE          NAME           ISSUE_URL   DESCRIPTION   AGE
harvester-system   bundle-bful3               primarydc    264d
harvester-system   bundle-cylq3               primarydc    264d

salmon-city-57654

07/04/2024, 4:48 PM

this two

salmon-city-57654

07/04/2024, 4:48 PM

then check the corresponding pod should also be removed

quaint-alarm-7893

07/04/2024, 4:55 PM

here's the result:

Copy code

supportbundle-agent-bundle-cylq3-72b42                 1/1     Running   1180 (4m48s ago)   6d20h   10.52.1.213     harvester-03   <none>           <none>
supportbundle-agent-bundle-cylq3-jsxk7                 1/1     Running   1069 (8m42s ago)   6d3h    10.52.0.112     harvester-02   <none>           <none>
supportbundle-agent-bundle-cylq3-mntvn                 1/1     Running   4547 (8m57s ago)   98d     10.52.2.199     harvester-01   <none>           <none>
supportbundle-manager-bundle-cylq3-68b6cf5fdc-89vxk    1/1     Running   1 (46d ago)        98d     10.52.2.208     harvester-01   <none>           <none>

quaint-alarm-7893

07/04/2024, 4:55 PM

looks like each node has a supportbundle-agent pod running, i assume that's expected... anything else?

salmon-city-57654

07/04/2024, 4:58 PM

No, it should not

salmon-city-57654

07/04/2024, 4:58 PM

you can see the previous

bundle-bful3

related pods were deleted after you remove the supportbundle resource

quaint-alarm-7893

07/04/2024, 4:59 PM

yeah, looks like that's what happened. i had 4, now only 3.

salmon-city-57654

07/04/2024, 4:59 PM

could you also check the ds?

Copy code

kubectl get ds -n harvester-system |grep supportbundle

salmon-city-57654

07/04/2024, 5:00 PM

and deployment also

Copy code

kubectl get deployment -n harvester-system |grep supportbundle

quaint-alarm-7893

07/04/2024, 5:01 PM

Copy code

supportbundle-agent-bundle-cylq3   3         3         3       3            3           <http://harvesterhci.io/managed=true|harvesterhci.io/managed=true>                 98d

quaint-alarm-7893

07/04/2024, 5:01 PM

Copy code

supportbundle-manager-bundle-cylq3     1/1     1            1           264d

salmon-city-57654

07/04/2024, 5:02 PM

the

cylq3

was already removed?

Copy code

kubectl get supportbundle -n harvester-system

salmon-city-57654

07/04/2024, 5:03 PM

The

supportbundle-manager

is from deployment CRD, the daemonset is from

supportbundle-manager

quaint-alarm-7893

07/04/2024, 5:04 PM

i deleted one of them. did you want me to delete both?

salmon-city-57654

07/04/2024, 5:04 PM

Yeah

quaint-alarm-7893

07/04/2024, 5:04 PM

i'm sorry, i misunderstood. i thought you wanted me to delete the redundant one, as in one of the two.

salmon-city-57654

07/04/2024, 5:05 PM

This CRD was created on-demand

quaint-alarm-7893

07/04/2024, 5:05 PM

okay, no more support bundle pods running now.

salmon-city-57654

07/04/2024, 5:05 PM

i’m sorry, i misunderstood. i thought you wanted me to delete the redundant one, as in one of the two.

No worry, just delete another one 😄

salmon-city-57654

07/04/2024, 5:07 PM

Cool, could you check again to see if the

scc_supportconfig-xxx

files still exist?

quaint-alarm-7893

07/04/2024, 5:07 PM

so in a nut-shell, supportbundle pods should only be running if it's generating a bundle

salmon-city-57654

07/04/2024, 5:07 PM

yeah, exactly

quaint-alarm-7893

07/04/2024, 5:07 PM

if it's running after i've gotten my bundle, something messed up and it's running perpetually.

quaint-alarm-7893

07/04/2024, 5:07 PM

files are still in the folder. should i nuke them?

salmon-city-57654

07/04/2024, 5:09 PM

Hmm…

salmon-city-57654

07/04/2024, 5:10 PM

could you try following command?

Copy code

lsof +D <scc_supportconfig-xxx file>

quaint-alarm-7893

07/04/2024, 5:10 PM

i'm guessing in this case, they are there, so they will stay there. but new ones wont be generated? kinda like orphanded logs more or less.

quaint-alarm-7893

07/04/2024, 5:11 PM

i'm getting errors on that command, so i must be doing something wrong. i tried:

Copy code

lsof +D scc_supportconfig_harvester-01_7e9b8e58-dc66-4642-926f-b03a3c886ecb.txz

salmon-city-57654

07/04/2024, 5:13 PM

Copy code

lsof +D /usr/local/.state/var-log.bind/ -P |grep scc_supportconfig

salmon-city-57654

07/04/2024, 5:13 PM

use this command

salmon-city-57654

07/04/2024, 5:13 PM

ah… grep

scc_supportconfig

quaint-alarm-7893

07/04/2024, 5:14 PM

nothing comes up

salmon-city-57654

07/04/2024, 5:14 PM

OK… nuke them

quaint-alarm-7893

07/04/2024, 5:14 PM

before grepping, i got logs 🙂 (containers and what-not) using the files

salmon-city-57654

07/04/2024, 5:15 PM

yeah, but we should focus on the

scc_supportconfig-xxx

salmon-city-57654

07/04/2024, 5:15 PM

please donot nuke the whole folder, just remove the prefix with

scc_supportconfig

quaint-alarm-7893

07/04/2024, 5:16 PM

yeah, i got you 🙂

quaint-alarm-7893

07/04/2024, 5:16 PM

more just confirming the command worked, just nothing accessing the support files

salmon-city-57654

07/04/2024, 5:17 PM

yeah, I just want to confirm again no any other process is using them

quaint-alarm-7893

07/04/2024, 5:17 PM

rinse and repeat on other nodes?

salmon-city-57654

07/04/2024, 5:18 PM

yeah,

quaint-alarm-7893

07/04/2024, 5:18 PM

great, thanks for the help!

quaint-alarm-7893

07/04/2024, 5:18 PM

once i figured out those were the problem, i was able to move them, but good to know how to resolve it entirely 🙂

salmon-city-57654

07/04/2024, 5:20 PM

BTW, I found you are running on v1.2.0. Is there any chance to upgrade to a stable release like v1.2.2?

salmon-city-57654

07/04/2024, 5:22 PM

once i figured out those were the problem, i was able to move them, but good to know how to resolve it entirely 🙂

NP! I thought it might be a bug in the support bundle kit. I will check with the team to see if we have ever had this issue. (For me, it was the first time I ever seen)

quaint-alarm-7893

07/04/2024, 5:23 PM

well, i'm happy to bring real issues to you guys, instead of simple things. i like to try to nail those down myself

🙌 1

quaint-alarm-7893

07/04/2024, 5:24 PM

as for upgrade, yeah, i'm going to probably do that next week. i've been in and out of town lately, so i havnt had a chance. i like to be in the office when doing the upgrade.

quaint-alarm-7893

07/04/2024, 5:24 PM

can i ask you a question about the upgrade?

salmon-city-57654

07/04/2024, 5:24 PM

sure, please

quaint-alarm-7893

07/04/2024, 5:26 PM

if i remember right when i was looking at it, the support matrix for a downstream rancher changed right? i have a "custom" rke2 cluster running on k3s rancher instance (k3s is outside of harvester cluster) where i got the csi and what-not tied into harvester. (hopefully that all makes sense)

quaint-alarm-7893

07/04/2024, 5:26 PM

what's the process to upgrade harvester, and rancher, so i dont break my rke2 cluster (that's the one running with harvester pvcs) rke2 is 1.24 i believe. it's a production cluster, so down time has to be minimal

salmon-city-57654

07/04/2024, 5:31 PM

I would like to check with some teammates who are more familiar with the downstream cluster mechanism about this question. Once we have the comment/suggestion, we will update it here.

quaint-alarm-7893

07/04/2024, 5:32 PM

thanks!

salmon-city-57654

07/04/2024, 5:42 PM

Hi @prehistoric-balloon-31801, @great-bear-19718, Could you help to give some suggestions for the above upgrade situation? Thanks!

great-bear-19718

07/04/2024, 10:06 PM

what is your version of rancher? and is k3s cluster a downstream cluster running on harvester?

quaint-alarm-7893

07/05/2024, 5:35 AM

@great-bear-19718, rancher is 2.7.6. and no. the k3s cluster is running on machines outside of harvester. it's only the rke2 cluster (managed by rancher) that's downstream. i basically stood up 3 k3s nodes, installed rancher, installed harvester (on 3 different nodes) then setup a "custom" rke2 1.24 cluster in rancher, and manually added the csi driver and what-not for the longhorn volume provisioning in harvester.

great-bear-19718

07/05/2024, 5:56 AM

upgrading harvester will not upgrade downstream clusters

👍 1

great-bear-19718

07/05/2024, 5:57 AM

and neither will it bump the csi.. that will only happen when you trigger upgrade from rancher of the downstream custom cluster

quaint-alarm-7893

07/05/2024, 4:16 PM

thanks @great-bear-19718

7 Views

Open in Slack

Previous Next