This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

03/02/2024, 12:34 PM

This message was deleted.

👍 1

wooden-area-49191

03/02/2024, 1:21 PM

I have the same issue. It worked fine a few days ago but now when I try to deploy a RKE2 cluster it just stalls. K3s works fine.

blue-kitchen-51801

03/02/2024, 1:21 PM

yes, and also, we have only NVME disks, the storage network is 10GB

blue-kitchen-51801

03/02/2024, 1:22 PM

*from what I know, K3s doesn’t use etcd

wooden-area-49191

03/02/2024, 1:47 PM

true

wooden-area-49191

03/02/2024, 1:53 PM

Could it be an issue with etcd docker image?

blue-kitchen-51801

03/02/2024, 1:54 PM

i really don’t know.. i suspect being a problem with longhorn, but before this, we should investigate if disk operations have huge delays..

blue-kitchen-51801

03/02/2024, 1:55 PM

I observe these problems, when I deploy a few hundred pods, when the cluster is in idle, we don’t observe the problem

wooden-area-49191

03/02/2024, 2:02 PM

in my case I have only one cluster at this point and ssd disks with almost no load so I don’t think it is a performance issue.

👍 1

blue-kitchen-51801

03/02/2024, 2:04 PM

if you have time next week, we can do some pair programming to try to investigate the root cause of the problem

wooden-area-49191

03/03/2024, 9:48 AM

When drilling down the errors I found in the end that the reason for the error has to do with the storage provider that is supposed to be provided to the node (Harvester/Longhorn) it requires the nodes to read the internal VIP address which is internal and therefore its not accessible from the public ip:s. Anyone knows how to correctly set this up? If I add more network cards than one - only the first one gets an IP and that will mean the node has no external IP.

blue-kitchen-51801

03/03/2024, 11:36 AM

hmm, then why the error is transient?

wooden-area-49191

03/03/2024, 12:02 PM

That is a very good question. I started a cluster and this is what I found when spending a lot of time troubleshooting. I guess it has to do with some form of autodhcp that works some time and not.

wooden-area-49191

03/03/2024, 12:02 PM

It could have been an update to etcd package or canal or anything else?

wooden-area-49191

03/03/2024, 12:06 PM

It’s not consistent for me even with this information. The cluster is not reaching every node and they are stuck waiting for etcd

blue-kitchen-51801

03/03/2024, 7:33 PM

I've noticed that the kubeapi sometimes is down or is timing out.. when the API is under heavy use (for eg, when deploying hundred of pods..)

great-bear-19718

03/04/2024, 9:50 PM

a support bundle and issue with details would be useful to try and figure out what is going on

👍 1

wooden-area-49191

03/04/2024, 10:11 PM

I think I have found the reason for the transient errors. The default lease time in the dhcp was five minutes which means sometimes the restart happens after more than five minutes and the node can’t connect anymore. The ip assignment strategy could be described better in the documentation. I hope this lesson could be beneficial for anyone else. I’ve spent my weekend on digging into the logs to find this one :)

wooden-area-49191

03/04/2024, 10:14 PM

This is my rancher diagnose package

rancher-diagnostic-data.json

3 Views

Open in Slack

Previous Next