This message was deleted.
# harvester
a
This message was deleted.
h
I’ve started having this issue earlier this week. I had some kind of event, I’m not sure what, possibly a network failure, and now none of my existing
harvester
storage class volumes will attach. Even creating a new storage class in both harvester and rancher doesn’t work either. I even created a new k8s cluster in rancher, and it has the same error. It seems that whatever happened caused the CSI driver to stop working?
Oh, I can create new VM, and the PV creates, it just won’t attach.
m
can you do me a favor and see if you can run
echo $KUBECONFIG
on your nodes?
h
Logging in via the rancher ssh shell,
$KUBECONFIG
is empty
m
is that directly in your harvester nodes? I ssh directly into them
h
Sorry, I thought you meant the rancher nodes
On the harvester node, it returns
/etc/rancher/rke2/rke2.yaml
m
you only have a single node running?
h
No, it’s a 3 node cluster
This has been running for most of this year with no problems. Something happened (this is my home lab) last weekend and this hasn’t worked since
One of the nodes became unresponsive, and when I logged in it had a load of over 70 (on a 24 core server)
m
one issue I noticed is the node I was performing failover testing on now has the rke2.yaml file missing
h
Hm, let me check the other nodes
Mine are all still there
Looks like the content is identical between them all
I’ve been digging at this for a day or two, and what I’ve found is that the CSI attacher doesn’t seem to even attempt to attach. Nothing in the logs, I started to look at if there is a debugging mode for the attacher. I can provision, but not attach
m
Ok, so that means my thoughts on it are gone. Hmmm
yeah, the fact there's nothing to follow through on logs with is very frustrating
h
I’ve been debating if I should spend effort on tracking this down or just rebuild. Since someone else is having problems, I might go for the debugging.
m
I'd like to figure it out - this isn't the first time we've had this happen, I rebuilt after trying to figure it out the first time or two. What is your TOR switches?
h
There is a
Debug
bool in the config object for the driver. I just need to figure out how to flip it.
Remember this is a home lab, TOR=core for me 😛
Unifi brand managed switch
m
so ... I think we may have found the issue. mine are unifi too
h
Ah, there’s a debug flag on the binary. I’ll kick mine into debugging mode
m
k
h
That…. seems… weird?
m
do you have a unifi gateway?
h
Yes
m
so, one thing I've had issues with in the past are stale entries for VMW objects preventing connectivity/confusing stuff. I wonder if that is happening here.
1
are you running anything bonded?
h
Yes. Two port bond
m
wonder if that could be it too
wonder what would happen if you reconfigured it to just use one NIC
h
I can understand how that would be an immediate cause that broke something, and it’s in a weird state now, but since I have some storage classes that are still working, I’m not sure it’s the ongoing problem? But I don’t really have another great idea at the moment.
I got the attacher into debug mode, now I have to sort through the wall of text 🙂
There is nothing in the logs other than leader election/renewal. I think it’s before we get to the attacher
m
Well, that's just fun
h
I’ve found the code path that’s causing the error, and the message we’re seeing is silently swallowing the root error. I don’t think the root error had any additional information as it’s a fall-through. I’m going to work on investigating the conditions that aren’t being met
Oh, that entire block has been removed in 1.0
Never mind, that’s an old tag (I think)
m
dang
h
At least I’ve got some avenues of testing now.
It’s not horribly more informative than what we had, but here’s a debug-mode capture of the attacher failing
m
Yeah. I wonder why building VMs can attach a pvc fine but trying to add them it doesn't like it
h
It’s the csi-attacher running inside the rancher cluster. Not the one running in harvester.
There are debugging statements that might help me track down what’s going on, there’s a debug flag, but I don’t see anywhere that the debug flag actually changes the logging level to include debug statements. I’m probably going to have to compile/run my own version to test that.
I need to head out for some errands, but I’ll try to look at it later
m
ah mmk - I appreciate you looking further into it. hopefully we can hear something from them
h
Well, this is interesting!
time="2024-11-27T22:44:52Z" level=warning msg="waitForVolumeSettled: error while waiting for volume pvc-558bc901-e048-49c3-a302-851512a7def9 to be settled. Err: <http://volumes.longhorn.io|volumes.longhorn.io> \"pvc-558bc901-e048-49c3-a302-851512a7def9\" is forbidden: User \"system:serviceaccount:default:test\" cannot get resource \"volumes\" in API group \"<http://longhorn.io|longhorn.io>\" in the namespace \"longhorn-system\""
Could this just be a permissions problem?
m
possibly, but why would network isolation cause a permissions issue? my node that crashed lost it's default local user completely.
h
That’s it! Adding the permissions to the ClusterRole on harvester caused it to start working again!
Now, the question becomes how that happens in a network partition situation.
m
heh, baby steps. I'll take a look at those on mine
h
Do you have a harvester cluster that is still working?
m
no
h
OK, I’ve got a friend who probably does
m
which permissions did you set?
h
So, the service account
test
in the
default
namespace is the account running the VMs in the rancher cluster. The clusterrolebinding
default-test
binds that SA to
ClusterRole/harvesterhci.io:csi-driver
The contents of the cluster role were
I added
Copy code
- apiGroups:
  - longhorn.io
  resources:
  - volumes
  verbs:
  - get
  - list
  - watch
to the bottom and it started working
m
thanks again! I'll see if I can use it to restore my stuff
h
That fixed all my clusters at once, since they are all bound to the same ClusterRole
The helm chart that owns that ClusterRole hasn’t been applied since September 7th.
The workaround is to update the RBAC
BTW, you have to undo this fix to get the upgrade to 1.4.0 to run