This message was deleted.
# harvester
a
This message was deleted.
s
q
@salmon-city-57654 Thanks again!
@salmon-city-57654 if i power off all the vms, and it hangs for some reason, do you know if i'll be able to turn back on vms on other nodes until i can work out why the node that was upgrading hung up? also, i suspect i have an issue issue with prometheus, SchedulableNode harvester-01 The disk 4fd48c41e7c49a49fec748a73bcbc8d0(/var/lib/harvester/extra-disks/4fd48c41e7c49a49fec748a73bcbc8d0) on the node harvester-01 has 244632780800 available, but requires reserved 0, minimal 25% to schedule more replicas
s
Hi @quaint-alarm-7893, is this volume related to the VM hangs for poweroff? for this volume, are node
harvester-03
and
harvester-02
down? for
harvester-01
, this extra-disk is already full? or that could not schedule due to the overcommit setup?
q
harvester-01 was the node that had the disk corruption issue w/ the ext-4 vm i mentioned the other day. also the same node that had the DAS detach randomly and caused issues (in the past) since i cleaned all that up, i've just noticed i have no metrics, so i wanna clean that up before i upgrade to 1.1.1 all 3 nodes (harvester-01 - 03) are all up that disk, does show as not schedulable. shoudl i just delete the replica and let l.h. move it to another disk?
also what's odd, HCI shows the disk has 184gib avail, 850 scheduled, and has a max of 916gb, but it it's not schedulable... odd.
does it want 25% free to do anything w/ the drive? that what it sounds like form the tooltip it shows.
@salmon-city-57654 ^
s
that disk, does show as not schedulable. shoudl i just delete the replica and let l.h. move it to another disk?
If the related volume is healthy, you could manually delete replica to trigger reschedule
also what’s odd, HCI shows the disk has 184gib avail, 850 scheduled, and has a max of 916gb, but it it’s not schedulable... odd.
what is your overcommit setting? Or could you check the more information from longhorn UI? (Maybe there are some errors on it)
q
@salmon-city-57654 it wont let me delete the failed replica. (via longhorn) and i cant seem to get it to do anything. if i look in rancher, i found this under the prometheus pod:
Copy code
Warning	FailedAttachVolume (3429)	3.3 mins ago	AttachVolume.Attach failed for volume "pvc-7616922b-a530-4bcc-b281-0a0438955d4d" : rpc error: code = DeadlineExceeded desc = volume pvc-7616922b-a530-4bcc-b281-0a0438955d4d failed to attach to node harvester-04
Warning	FailedMount (3075)	2.1 mins ago	(combined from similar events): Unable to attach or mount volumes: unmounted volumes=[prometheus-rancher-monitoring-prometheus-db], unattached volumes=[web-config prometheus-nginx nginx-home config config-out prometheus-rancher-monitoring-prometheus-rulefiles-0 kube-api-access-g4vsb tls-assets prometheus-rancher-monitoring-prometheus-db]: timed out waiting for the condition
the volume in LH looks odd too, i cant detach it (not available as an option) cant delete the replicas, and i cant "attach it" because it says its already attached to 3) is there a pod i can delete in kubectl or something to re-init this?
Copy code
State: Detached
Health:
Unknown
Ready for workload:Ready
Conditions:
restore
scheduled
Frontend:Block Device
Attached Node & Endpoint:
Size:
50 Gi
Actual Size:64.7 Gi
Data Locality:disabled
Access Mode:ReadWriteOnce
Engine Image:longhornio/longhorn-engine:v1.2.4
Created:2 months ago
Encrypted:False
Node Tags:
Disk Tags:
Last Backup:
Last Backup At:
Replicas Auto Balance:ignored
Instance Manager:
Namespace:cattle-monitoring-system
PVC Name:prometheus-rancher-monitoring-prometheus-db-prometheus-rancher-monitoring-prometheus-0
PV Name:pvc-7616922b-a530-4bcc-b281-0a0438955d4d
PV Status:Bound
Revision Counter Disabled:False
Pod Name:prometheus-rancher-monitoring-prometheus-0
Pod Status:Pending
Workload Name:prometheus-rancher-monitoring-prometheus
Workload Type:StatefulSet
s
hi @quaint-alarm-7893, could you open an issue on GitHub for this Prometheus issue and attach the Support-bundle for it? I checked some known issues with prometheus, but didn’t find the same problem as you.
q
@salmon-city-57654 sure will. thanks! 🙂
👍 1
@salmon-city-57654 fyi, i found the pod in kubectl and deleted it. came up fine after.
s
hi @quaint-alarm-7893, nice! So now looks like all volumes of your environment are healthy? The unhealthy Prometheus volume is a known issue. The quick workaround is like your operation.
q
@salmon-city-57654 yup, i think i'm ready to run the upgrade on my prod cluster, but i go on vacation tomorrow, so i'm going to wait till i get back. just incase things hang up and i have to work through it. thanks again for your help, i appreciate it!
s
OK, feel free to open another for the upgrade issue. Also, you could refer to this https://docs.harvesterhci.io/v1.1/upgrade/v1-0-3-to-v1-1-1 for the upgrade.
q
@salmon-city-57654 i started my upgrade and it's hung on a pre-drain. i went through the issues and i dont think what i have going on is the same as the knowns issues. idk where to even start though 😞 i have 4 nodes, system finished, images are preloaded, one node succeeded the upgrade, and the 2nd is stuck in pre-drain.
seems like rebooting the node manually made it work through. all is well.