This message was deleted.
# harvester
a
This message was deleted.
c
I think you could try to extend the volume size. BTW, how large is your disk size? Some informations https://github.com/kubernetes/kubernetes/issues/71869
q
so i ended up tracking it down to a ton of support bundles being stored in the /usr/local/.state/var-log.bind/
i moved them to another disk, and rebooted the node and it came back online. any idea why this dir is filling up with bundles?
Copy code
-rw-------   1 root  root                  33 Jun 28 16:05 scc_supportconfig_harvester-02_faf36df2-e2a8-4c08-92e2-2b341c287278.txz.md5
-rw-------   1 root  root            16195552 Jun 28 13:45 scc_supportconfig_harvester-02_fc975c22-2bdd-4ab9-a403-560057516f56.txz
-rw-------   1 root  root                  33 Jun 28 13:45 scc_supportconfig_harvester-02_fc975c22-2bdd-4ab9-a403-560057516f56.txz.md5
-rw-------   1 root  root            16181484 Jun 28 13:31 scc_supportconfig_harvester-02_ff0e5197-d223-44f5-9189-e54fa9d679f9.txz
-rw-------   1 root  root                  33 Jun 28 13:31 scc_supportconfig_harvester-02_ff0e5197-d223-44f5-9189-e54fa9d679f9.txz.md5
-rw-------   1 root  root            16527476 Jun 28 17:37 scc_supportconfig_harvester-02_ffa512f9-5929-4e0b-bbd9-66b4c8fd6172.txz
-rw-------   1 root  root                  33 Jun 28 17:37 scc_supportconfig_harvester-02_ffa512f9-5929-4e0b-bbd9-66b4c8fd6172.txz.md5
since clearing it out this morning, i already have probably 50+ in there
c
I’m guessing it’s not directly related to this one, may I know how large is your disk? And could you provide the support bundle by following this doc https://docs.harvesterhci.io/v0.3/troubleshooting/harvester/?
s
Hi @quaint-alarm-7893, Could you get the following result for investigation?
Copy code
$ kubectl get <http://nodes.longhorn.io|nodes.longhorn.io> -n longhorn-system -o yaml
Also, I am curious why the
supportconfig
is still here. Do you have a running support bundle collector?
q
@clean-cpu-90380 and @salmon-city-57654, the disk space consumption was for sure the support bundles in the folder. this cluster had been running for 230-ish days. each node at this point is holding about 102G of 203G (provisioned by harvester at install) on the /usr/local folder. all these files were in the /usr/local/.state/var-log.bind/ folder and filling up the disk to 93%, which seemed to cause the disk pressure. i have a support bundle, but it's several days after i cleaned out all the nodes.
re: your question on a support bundle collector. how can i check?
i didnt intentionally set anything though, if that's something i would have had to configure.
s
Hi @quaint-alarm-7893, I can do more checks with the support bundle. I will update maybe today or tomorrow. Thanks!
q
@salmon-city-57654 sounds good. thanks!
c
@quaint-alarm-7893 I noticed most replicas are in the harvester-03 with disk 0ab2811e-35ba-4460-9d7f-e4254243e191(/var/lib/harvester/extra-disks/52bbeb3f093c421d7312970052f3ff8f), but another one disk 5f01bb7e-a10a-4201-9eba-d5639f17cfd8(/var/lib/harvester/extra-disks/70b3d41d058a1ae36a29f5c34c7fe6bc) just has a few replicas. I think you could move some replicas to another disk. You could try to mark that disk unschedulable, then delete some replicas after volume is healthy. After that, mark schedulable again. @salmon-city-57654 Based on your advices. Please feel free to add more explanation, thank you.
👍 1
q
@clean-cpu-90380 i can rebalance some of the pvcs on harvester-03, for sure. i'll do that, thanks. but just to be clear, it wasnt an issue on the extra disks, it was all on the /usr/local mount, which is one of the install / system partitions created by harvester when it was initially setup. that disk was full of support bundles in the dir mentioned earlier in this thread.
s
Hi @quaint-alarm-7893, Could you check on your cluster with following command?
Copy code
kubectl get pods -n harvester-system |grep supportbundle
I thought @clean-cpu-90380 pointing out is one of the
Diskpressure
issues from your SB. So, you still need to rebalance some volume on harvester-03 to resolve this.
q
Copy code
supportbundle-agent-bundle-bful3-6s289                 1/1     Running            4632 (8m16s ago)   98d
supportbundle-agent-bundle-bful3-9qdjj                 1/1     Running            1068 (3m2s ago)    6d3h
supportbundle-agent-bundle-bful3-mdsz2                 1/1     Running            1178 (7m23s ago)   6d20h
supportbundle-agent-bundle-cylq3-72b42                 1/1     Running            1178 (7m8s ago)    6d20h
supportbundle-agent-bundle-cylq3-jsxk7                 0/1     CrashLoopBackOff   1067 (2m19s ago)   6d3h
supportbundle-agent-bundle-cylq3-mntvn                 1/1     Running            4546 (8m20s ago)   98d
supportbundle-manager-bundle-bful3-6794b8cd4c-dgt54    1/1     Running            1 (46d ago)        98d
supportbundle-manager-bundle-cylq3-68b6cf5fdc-89vxk    1/1     Running            1 (46d ago)        98d
s
What is the result of
Copy code
kubectl get supportbundle -A
q
Copy code
NAMESPACE          NAME           ISSUE_URL   DESCRIPTION   AGE
harvester-system   bundle-bful3               primarydc    264d
harvester-system   bundle-cylq3               primarydc    264d
also, fyi, i cleared out harvester-01 folder mentioned earlier probably 2/3 days ago. there are already 1142 files with the filename like: scc_supportconfig_harvester-01_{uid}.txz in that folder. (harvester-01:/usr/local/.state/var-log.bind) just seems odd that it's generating that many files. is this normal?
s
scc_supportconfig would be generated with support-bundle-kit when generating the support bundle
could you remove the redundant supportbundle ? like you mentioned
Copy code
NAMESPACE          NAME           ISSUE_URL   DESCRIPTION   AGE
harvester-system   bundle-bful3               primarydc    264d
harvester-system   bundle-cylq3               primarydc    264d
this two
then check the corresponding pod should also be removed
q
here's the result:
Copy code
supportbundle-agent-bundle-cylq3-72b42                 1/1     Running   1180 (4m48s ago)   6d20h   10.52.1.213     harvester-03   <none>           <none>
supportbundle-agent-bundle-cylq3-jsxk7                 1/1     Running   1069 (8m42s ago)   6d3h    10.52.0.112     harvester-02   <none>           <none>
supportbundle-agent-bundle-cylq3-mntvn                 1/1     Running   4547 (8m57s ago)   98d     10.52.2.199     harvester-01   <none>           <none>
supportbundle-manager-bundle-cylq3-68b6cf5fdc-89vxk    1/1     Running   1 (46d ago)        98d     10.52.2.208     harvester-01   <none>           <none>
looks like each node has a supportbundle-agent pod running, i assume that's expected... anything else?
s
No, it should not
you can see the previous
bundle-bful3
related pods were deleted after you remove the supportbundle resource
q
yeah, looks like that's what happened. i had 4, now only 3.
s
could you also check the ds?
Copy code
kubectl get ds -n harvester-system |grep supportbundle
and deployment also
Copy code
kubectl get deployment -n harvester-system |grep supportbundle
q
Copy code
supportbundle-agent-bundle-cylq3   3         3         3       3            3           <http://harvesterhci.io/managed=true|harvesterhci.io/managed=true>                 98d
Copy code
supportbundle-manager-bundle-cylq3     1/1     1            1           264d
s
the
cylq3
was already removed?
Copy code
kubectl get supportbundle -n harvester-system
The
supportbundle-manager
is from deployment CRD, the daemonset is from
supportbundle-manager
q
i deleted one of them. did you want me to delete both?
s
Yeah
q
i'm sorry, i misunderstood. i thought you wanted me to delete the redundant one, as in one of the two.
s
This CRD was created on-demand
q
okay, no more support bundle pods running now.
s
i’m sorry, i misunderstood. i thought you wanted me to delete the redundant one, as in one of the two.
No worry, just delete another one 😄
Cool, could you check again to see if the
scc_supportconfig-xxx
files still exist?
q
so in a nut-shell, supportbundle pods should only be running if it's generating a bundle
s
yeah, exactly
q
if it's running after i've gotten my bundle, something messed up and it's running perpetually.
files are still in the folder. should i nuke them?
s
Hmm…
could you try following command?
Copy code
lsof +D <scc_supportconfig-xxx file>
q
i'm guessing in this case, they are there, so they will stay there. but new ones wont be generated? kinda like orphanded logs more or less.
i'm getting errors on that command, so i must be doing something wrong. i tried:
Copy code
lsof +D scc_supportconfig_harvester-01_7e9b8e58-dc66-4642-926f-b03a3c886ecb.txz
s
Copy code
lsof +D /usr/local/.state/var-log.bind/ -P |grep scc_supportconfig
use this command
ah… grep
scc_supportconfig
q
nothing comes up
s
OK… nuke them
q
before grepping, i got logs 🙂 (containers and what-not) using the files
s
yeah, but we should focus on the
scc_supportconfig-xxx
please donot nuke the whole folder, just remove the prefix with
scc_supportconfig
q
yeah, i got you 🙂
more just confirming the command worked, just nothing accessing the support files
s
yeah, I just want to confirm again no any other process is using them
q
rinse and repeat on other nodes?
s
yeah,
q
great, thanks for the help!
once i figured out those were the problem, i was able to move them, but good to know how to resolve it entirely 🙂
s
BTW, I found you are running on v1.2.0. Is there any chance to upgrade to a stable release like v1.2.2?
once i figured out those were the problem, i was able to move them, but good to know how to resolve it entirely 🙂
NP! I thought it might be a bug in the support bundle kit. I will check with the team to see if we have ever had this issue. (For me, it was the first time I ever seen)
q
well, i'm happy to bring real issues to you guys, instead of simple things. i like to try to nail those down myself
🙌 1
as for upgrade, yeah, i'm going to probably do that next week. i've been in and out of town lately, so i havnt had a chance. i like to be in the office when doing the upgrade.
can i ask you a question about the upgrade?
s
sure, please
q
if i remember right when i was looking at it, the support matrix for a downstream rancher changed right? i have a "custom" rke2 cluster running on k3s rancher instance (k3s is outside of harvester cluster) where i got the csi and what-not tied into harvester. (hopefully that all makes sense)
what's the process to upgrade harvester, and rancher, so i dont break my rke2 cluster (that's the one running with harvester pvcs) rke2 is 1.24 i believe. it's a production cluster, so down time has to be minimal
s
I would like to check with some teammates who are more familiar with the downstream cluster mechanism about this question. Once we have the comment/suggestion, we will update it here.
q
thanks!
s
Hi @prehistoric-balloon-31801, @great-bear-19718, Could you help to give some suggestions for the above upgrade situation? Thanks!
g
what is your version of rancher? and is k3s cluster a downstream cluster running on harvester?
q
@great-bear-19718, rancher is 2.7.6. and no. the k3s cluster is running on machines outside of harvester. it's only the rke2 cluster (managed by rancher) that's downstream. i basically stood up 3 k3s nodes, installed rancher, installed harvester (on 3 different nodes) then setup a "custom" rke2 1.24 cluster in rancher, and manually added the csi driver and what-not for the longhorn volume provisioning in harvester.
g
upgrading harvester will not upgrade downstream clusters
👍 1
and neither will it bump the csi.. that will only happen when you trigger upgrade from rancher of the downstream custom cluster
q
thanks @great-bear-19718