This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

11/26/2024, 5:25 PM

This message was deleted.

hundreds-easter-25520

11/27/2024, 5:34 PM

I’ve started having this issue earlier this week. I had some kind of event, I’m not sure what, possibly a network failure, and now none of my existing

harvester

storage class volumes will attach. Even creating a new storage class in both harvester and rancher doesn’t work either. I even created a new k8s cluster in rancher, and it has the same error. It seems that whatever happened caused the CSI driver to stop working?

hundreds-easter-25520

11/27/2024, 5:44 PM

Oh, I can create new VM, and the PV creates, it just won’t attach.

microscopic-accountant-76829

11/27/2024, 5:48 PM

can you do me a favor and see if you can run

echo $KUBECONFIG

on your nodes?

hundreds-easter-25520

11/27/2024, 5:49 PM

Logging in via the rancher ssh shell,

$KUBECONFIG

is empty

microscopic-accountant-76829

11/27/2024, 5:59 PM

is that directly in your harvester nodes? I ssh directly into them

hundreds-easter-25520

11/27/2024, 5:59 PM

Sorry, I thought you meant the rancher nodes

hundreds-easter-25520

11/27/2024, 5:59 PM

On the harvester node, it returns

/etc/rancher/rke2/rke2.yaml

microscopic-accountant-76829

11/27/2024, 6:00 PM

you only have a single node running?

hundreds-easter-25520

11/27/2024, 6:01 PM

No, it’s a 3 node cluster

hundreds-easter-25520

11/27/2024, 6:02 PM

This has been running for most of this year with no problems. Something happened (this is my home lab) last weekend and this hasn’t worked since

hundreds-easter-25520

11/27/2024, 6:03 PM

One of the nodes became unresponsive, and when I logged in it had a load of over 70 (on a 24 core server)

microscopic-accountant-76829

11/27/2024, 6:05 PM

one issue I noticed is the node I was performing failover testing on now has the rke2.yaml file missing

hundreds-easter-25520

11/27/2024, 6:05 PM

Hm, let me check the other nodes

hundreds-easter-25520

11/27/2024, 6:06 PM

Mine are all still there

hundreds-easter-25520

11/27/2024, 6:07 PM

Looks like the content is identical between them all

hundreds-easter-25520

11/27/2024, 6:10 PM

I’ve been digging at this for a day or two, and what I’ve found is that the CSI attacher doesn’t seem to even attempt to attach. Nothing in the logs, I started to look at if there is a debugging mode for the attacher. I can provision, but not attach

microscopic-accountant-76829

11/27/2024, 6:10 PM

Ok, so that means my thoughts on it are gone. Hmmm

microscopic-accountant-76829

11/27/2024, 6:11 PM

yeah, the fact there's nothing to follow through on logs with is very frustrating

hundreds-easter-25520

11/27/2024, 6:12 PM

I’ve been debating if I should spend effort on tracking this down or just rebuild. Since someone else is having problems, I might go for the debugging.

microscopic-accountant-76829

11/27/2024, 6:12 PM

I'd like to figure it out - this isn't the first time we've had this happen, I rebuilt after trying to figure it out the first time or two. What is your TOR switches?

hundreds-easter-25520

11/27/2024, 6:13 PM

There is a

Debug

bool in the config object for the driver. I just need to figure out how to flip it.

hundreds-easter-25520

11/27/2024, 6:13 PM

Remember this is a home lab, TOR=core for me 😛

hundreds-easter-25520

11/27/2024, 6:13 PM

Unifi brand managed switch

microscopic-accountant-76829

11/27/2024, 6:13 PM

so ... I think we may have found the issue. mine are unifi too

hundreds-easter-25520

11/27/2024, 6:14 PM

Ah, there’s a debug flag on the binary. I’ll kick mine into debugging mode

microscopic-accountant-76829

11/27/2024, 6:14 PM

hundreds-easter-25520

11/27/2024, 6:14 PM

That…. seems… weird?

microscopic-accountant-76829

11/27/2024, 6:14 PM

do you have a unifi gateway?

hundreds-easter-25520

11/27/2024, 6:14 PM

Yes

microscopic-accountant-76829

11/27/2024, 6:16 PM

so, one thing I've had issues with in the past are stale entries for VMW objects preventing connectivity/confusing stuff. I wonder if that is happening here.

✅ 1

microscopic-accountant-76829

11/27/2024, 6:18 PM

are you running anything bonded?

hundreds-easter-25520

11/27/2024, 6:19 PM

Yes. Two port bond

microscopic-accountant-76829

11/27/2024, 6:19 PM

wonder if that could be it too

microscopic-accountant-76829

11/27/2024, 6:20 PM

wonder what would happen if you reconfigured it to just use one NIC

hundreds-easter-25520

11/27/2024, 6:21 PM

I can understand how that would be an immediate cause that broke something, and it’s in a weird state now, but since I have some storage classes that are still working, I’m not sure it’s the ongoing problem? But I don’t really have another great idea at the moment.

hundreds-easter-25520

11/27/2024, 6:22 PM

I got the attacher into debug mode, now I have to sort through the wall of text 🙂

hundreds-easter-25520

11/27/2024, 6:25 PM

There is nothing in the logs other than leader election/renewal. I think it’s before we get to the attacher

microscopic-accountant-76829

11/27/2024, 6:28 PM

Well, that's just fun

hundreds-easter-25520

11/27/2024, 6:34 PM

I’ve found the code path that’s causing the error, and the message we’re seeing is silently swallowing the root error. I don’t think the root error had any additional information as it’s a fall-through. I’m going to work on investigating the conditions that aren’t being met

hundreds-easter-25520

11/27/2024, 6:35 PM

Oh, that entire block has been removed in 1.0

hundreds-easter-25520

11/27/2024, 6:36 PM

Never mind, that’s an old tag (I think)

microscopic-accountant-76829

11/27/2024, 6:42 PM

dang

hundreds-easter-25520

11/27/2024, 6:42 PM

At least I’ve got some avenues of testing now.

hundreds-easter-25520

11/27/2024, 7:21 PM

It’s not horribly more informative than what we had, but here’s a debug-mode capture of the attacher failing

microscopic-accountant-76829

11/27/2024, 7:28 PM

Yeah. I wonder why building VMs can attach a pvc fine but trying to add them it doesn't like it

hundreds-easter-25520

11/27/2024, 7:29 PM

It’s the csi-attacher running inside the rancher cluster. Not the one running in harvester.

hundreds-easter-25520

11/27/2024, 7:30 PM

There are debugging statements that might help me track down what’s going on, there’s a debug flag, but I don’t see anywhere that the debug flag actually changes the logging level to include debug statements. I’m probably going to have to compile/run my own version to test that.

hundreds-easter-25520

11/27/2024, 7:31 PM

I need to head out for some errands, but I’ll try to look at it later

microscopic-accountant-76829

11/27/2024, 7:35 PM

ah mmk - I appreciate you looking further into it. hopefully we can hear something from them

hundreds-easter-25520

11/27/2024, 10:55 PM

Well, this is interesting!

time="2024-11-27T22:44:52Z" level=warning msg="waitForVolumeSettled: error while waiting for volume pvc-558bc901-e048-49c3-a302-851512a7def9 to be settled. Err: <http://volumes.longhorn.io|volumes.longhorn.io> \"pvc-558bc901-e048-49c3-a302-851512a7def9\" is forbidden: User \"system:serviceaccount:default:test\" cannot get resource \"volumes\" in API group \"<http://longhorn.io|longhorn.io>\" in the namespace \"longhorn-system\""

hundreds-easter-25520

11/27/2024, 10:55 PM

Could this just be a permissions problem?

microscopic-accountant-76829

11/27/2024, 11:01 PM

possibly, but why would network isolation cause a permissions issue? my node that crashed lost it's default local user completely.

hundreds-easter-25520

11/27/2024, 11:11 PM

That’s it! Adding the permissions to the ClusterRole on harvester caused it to start working again!

hundreds-easter-25520

11/27/2024, 11:11 PM

Now, the question becomes how that happens in a network partition situation.

microscopic-accountant-76829

11/27/2024, 11:12 PM

heh, baby steps. I'll take a look at those on mine

hundreds-easter-25520

11/27/2024, 11:12 PM

Do you have a harvester cluster that is still working?

microscopic-accountant-76829

11/27/2024, 11:12 PM

hundreds-easter-25520

11/27/2024, 11:12 PM

OK, I’ve got a friend who probably does

microscopic-accountant-76829

11/27/2024, 11:14 PM

which permissions did you set?

hundreds-easter-25520

11/27/2024, 11:17 PM

So, the service account

test

in the

default

namespace is the account running the VMs in the rancher cluster. The clusterrolebinding

default-test

binds that SA to

ClusterRole/harvesterhci.io:csi-driver

hundreds-easter-25520

11/27/2024, 11:19 PM

The contents of the cluster role were

Untitled

hundreds-easter-25520

11/27/2024, 11:19 PM

I added

Copy code

- apiGroups:
  - longhorn.io
  resources:
  - volumes
  verbs:
  - get
  - list
  - watch

hundreds-easter-25520

11/27/2024, 11:19 PM

to the bottom and it started working

microscopic-accountant-76829

11/27/2024, 11:21 PM

thanks again! I'll see if I can use it to restore my stuff

hundreds-easter-25520

11/27/2024, 11:22 PM

That fixed all my clusters at once, since they are all bound to the same ClusterRole

hundreds-easter-25520

11/27/2024, 11:24 PM

The helm chart that owns that ClusterRole hasn’t been applied since September 7th.

hundreds-easter-25520

11/28/2024, 1:21 AM

I totally missed the issue on this: https://github.com/harvester/harvester/issues/6849#issuecomment-2462545795

hundreds-easter-25520

11/28/2024, 1:22 AM

The workaround is to update the RBAC

hundreds-easter-25520

11/28/2024, 9:40 PM

BTW, you have to undo this fix to get the upgrade to 1.4.0 to run

29 Views

Open in Slack

Previous Next