I have a node with a drive that periodically seems...
# harvester
g
I have a node with a drive that periodically seems to have an issue. BIOS sees the disk just fine, however Harvester doesn't see it, so I'm trying to figure which disk in the node is problematic. Both data disks are 2TB NVMe , so I guess what I'm asking is... how do I determine the specific disk, so I can pull and replace?
h
Harvester doesn’t see it. As the UI or via the CLI?
g
shows as an error in UI. I removed the disk from the node in the UI, rebooted, but there's no disk available to add in. Not sure exactly how I'd check this in the CLI.
Copy code
rancher@harvester1:~> lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0         7:0    0     3G  1 loop /
sr0          11:0    1  1024M  0 rom
nvme0n1     259:0    0   1.8T  0 disk
nvme2n1     259:1    0   1.8T  0 disk /var/lib/harvester/defaultdisk
nvme1n1     259:2    0 465.8G  0 disk
├─nvme1n1p1 259:3    0    64M  0 part
├─nvme1n1p2 259:4    0    50M  0 part /oem
├─nvme1n1p3 259:5    0     8G  0 part
├─nvme1n1p4 259:6    0    15G  0 part /run/initramfs/cos-state
└─nvme1n1p5 259:7    0 442.6G  0 part /var/lib/longhorn
                                      /var/crash
                                      /var/lib/third-party
                                      /var/lib/cni
                                      /var/lib/wicked
                                      /var/lib/kubelet
                                      /var/lib/rancher
                                      /var/log
                                      /usr/libexec
                                      /root
                                      /opt
                                      /home
                                      /etc/pki/trust/anchors
                                      /etc/cni
                                      /etc/nvme
                                      /etc/iscsi
                                      /etc/ssh
                                      /etc/rancher
                                      /etc/systemd
                                      /usr/local
the device with issues is nvme0n1
when I try to add the disk back in the UI, this is what I see
h
Yeah, wipe it. Assuming that’s it and it’s empty.
wipefs -a /dev/nvme0n1
Then wait. Can’t remember how often that is checked
g
thanks - I actually reinstalled this node a couple of months back after seeing this thing, so I suspect that there's an actual hardware issue
Hmm, so I did
Copy code
sudo wipefs -a /dev/nvme0n1
yesterday, still don't see the option to add the disk back into the UI
any way to get the serial number for the device from the Harvester CLI?
of course, I could just do
Copy code
sudo smartctl -a /dev/nvme0n1
interestingly, not seeing any errors in SMART for the device
already on the latest firmware, so maybe there's something else amiss