This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

12/05/2022, 6:16 AM

This message was deleted.

salmon-city-57654

12/05/2022, 4:23 PM

Hi @quaint-alarm-7893, please refer to this https://github.com/harvester/harvester/issues/3238#issuecomment-1337656637

quaint-alarm-7893

12/05/2022, 5:06 PM

@salmon-city-57654 Thanks again!

quaint-alarm-7893

12/05/2022, 9:29 PM

@salmon-city-57654 if i power off all the vms, and it hangs for some reason, do you know if i'll be able to turn back on vms on other nodes until i can work out why the node that was upgrading hung up? also, i suspect i have an issue issue with prometheus, SchedulableNode harvester-01 The disk 4fd48c41e7c49a49fec748a73bcbc8d0(/var/lib/harvester/extra-disks/4fd48c41e7c49a49fec748a73bcbc8d0) on the node harvester-01 has 244632780800 available, but requires reserved 0, minimal 25% to schedule more replicas

salmon-city-57654

12/06/2022, 5:50 PM

Hi @quaint-alarm-7893, is this volume related to the VM hangs for poweroff? for this volume, are node

harvester-03

and

harvester-02

down? for

harvester-01

, this extra-disk is already full? or that could not schedule due to the overcommit setup?

quaint-alarm-7893

12/06/2022, 6:13 PM

harvester-01 was the node that had the disk corruption issue w/ the ext-4 vm i mentioned the other day. also the same node that had the DAS detach randomly and caused issues (in the past) since i cleaned all that up, i've just noticed i have no metrics, so i wanna clean that up before i upgrade to 1.1.1 all 3 nodes (harvester-01 - 03) are all up that disk, does show as not schedulable. shoudl i just delete the replica and let l.h. move it to another disk?

quaint-alarm-7893

12/06/2022, 6:14 PM

also what's odd, HCI shows the disk has 184gib avail, 850 scheduled, and has a max of 916gb, but it it's not schedulable... odd.

quaint-alarm-7893

12/06/2022, 6:15 PM

does it want 25% free to do anything w/ the drive? that what it sounds like form the tooltip it shows.

quaint-alarm-7893

12/06/2022, 6:17 PM

@salmon-city-57654 ^

salmon-city-57654

12/07/2022, 2:47 PM

that disk, does show as not schedulable. shoudl i just delete the replica and let l.h. move it to another disk?

If the related volume is healthy, you could manually delete replica to trigger reschedule

salmon-city-57654

12/07/2022, 2:48 PM

also what’s odd, HCI shows the disk has 184gib avail, 850 scheduled, and has a max of 916gb, but it it’s not schedulable... odd.

what is your overcommit setting? Or could you check the more information from longhorn UI? (Maybe there are some errors on it)

quaint-alarm-7893

12/07/2022, 9:27 PM

@salmon-city-57654 it wont let me delete the failed replica. (via longhorn) and i cant seem to get it to do anything. if i look in rancher, i found this under the prometheus pod:

Copy code

Warning	FailedAttachVolume (3429)	3.3 mins ago	AttachVolume.Attach failed for volume "pvc-7616922b-a530-4bcc-b281-0a0438955d4d" : rpc error: code = DeadlineExceeded desc = volume pvc-7616922b-a530-4bcc-b281-0a0438955d4d failed to attach to node harvester-04
Warning	FailedMount (3075)	2.1 mins ago	(combined from similar events): Unable to attach or mount volumes: unmounted volumes=[prometheus-rancher-monitoring-prometheus-db], unattached volumes=[web-config prometheus-nginx nginx-home config config-out prometheus-rancher-monitoring-prometheus-rulefiles-0 kube-api-access-g4vsb tls-assets prometheus-rancher-monitoring-prometheus-db]: timed out waiting for the condition

quaint-alarm-7893

12/07/2022, 9:29 PM

the volume in LH looks odd too, i cant detach it (not available as an option) cant delete the replicas, and i cant "attach it" because it says its already attached to 3) is there a pod i can delete in kubectl or something to re-init this?

Copy code

State: Detached
Health:
Unknown
Ready for workload:Ready
Conditions:
restore
scheduled
Frontend:Block Device
Attached Node & Endpoint:
Size:
50 Gi
Actual Size:64.7 Gi
Data Locality:disabled
Access Mode:ReadWriteOnce
Engine Image:longhornio/longhorn-engine:v1.2.4
Created:2 months ago
Encrypted:False
Node Tags:
Disk Tags:
Last Backup:
Last Backup At:
Replicas Auto Balance:ignored
Instance Manager:
Namespace:cattle-monitoring-system
PVC Name:prometheus-rancher-monitoring-prometheus-db-prometheus-rancher-monitoring-prometheus-0
PV Name:pvc-7616922b-a530-4bcc-b281-0a0438955d4d
PV Status:Bound
Revision Counter Disabled:False
Pod Name:prometheus-rancher-monitoring-prometheus-0
Pod Status:Pending
Workload Name:prometheus-rancher-monitoring-prometheus
Workload Type:StatefulSet

salmon-city-57654

12/08/2022, 3:17 PM

hi @quaint-alarm-7893, could you open an issue on GitHub for this Prometheus issue and attach the Support-bundle for it? I checked some known issues with prometheus, but didn’t find the same problem as you.

quaint-alarm-7893

12/08/2022, 4:27 PM

@salmon-city-57654 sure will. thanks! 🙂

👍 1

quaint-alarm-7893

12/08/2022, 4:51 PM

@salmon-city-57654 fyi: https://github.com/harvester/harvester/issues/3260

quaint-alarm-7893

12/10/2022, 9:49 PM

@salmon-city-57654 fyi, i found the pod in kubectl and deleted it. came up fine after.

salmon-city-57654

12/12/2022, 4:57 PM

hi @quaint-alarm-7893, nice! So now looks like all volumes of your environment are healthy? The unhealthy Prometheus volume is a known issue. The quick workaround is like your operation.

quaint-alarm-7893

12/12/2022, 4:58 PM

@salmon-city-57654 yup, i think i'm ready to run the upgrade on my prod cluster, but i go on vacation tomorrow, so i'm going to wait till i get back. just incase things hang up and i have to work through it. thanks again for your help, i appreciate it!

salmon-city-57654

12/12/2022, 5:01 PM

OK, feel free to open another for the upgrade issue. Also, you could refer to this https://docs.harvesterhci.io/v1.1/upgrade/v1-0-3-to-v1-1-1 for the upgrade.

quaint-alarm-7893

12/24/2022, 5:29 AM

@salmon-city-57654 i started my upgrade and it's hung on a pre-drain. i went through the issues and i dont think what i have going on is the same as the knowns issues. idk where to even start though 😞 i have 4 nodes, system finished, images are preloaded, one node succeeded the upgrade, and the 2nd is stuck in pre-drain.

quaint-alarm-7893

12/24/2022, 5:29 AM

see https://rancher-users.slack.com/team/U0462V7B2SW

quaint-alarm-7893

12/24/2022, 6:45 AM

seems like rebooting the node manually made it work through. all is well.

17 Views

Open in Slack

Previous Next