This message was deleted.
# harvester
a
This message was deleted.
👍 1
w
I have the same issue. It worked fine a few days ago but now when I try to deploy a RKE2 cluster it just stalls. K3s works fine.
b
yes, and also, we have only NVME disks, the storage network is 10GB
*from what I know, K3s doesn’t use etcd
w
true
Could it be an issue with etcd docker image?
b
i really don’t know.. i suspect being a problem with longhorn, but before this, we should investigate if disk operations have huge delays..
I observe these problems, when I deploy a few hundred pods, when the cluster is in idle, we don’t observe the problem
w
in my case I have only one cluster at this point and ssd disks with almost no load so I don’t think it is a performance issue.
👍 1
b
if you have time next week, we can do some pair programming to try to investigate the root cause of the problem
w
When drilling down the errors I found in the end that the reason for the error has to do with the storage provider that is supposed to be provided to the node (Harvester/Longhorn) it requires the nodes to read the internal VIP address which is internal and therefore its not accessible from the public ip:s. Anyone knows how to correctly set this up? If I add more network cards than one - only the first one gets an IP and that will mean the node has no external IP.
b
hmm, then why the error is transient?
w
That is a very good question. I started a cluster and this is what I found when spending a lot of time troubleshooting. I guess it has to do with some form of autodhcp that works some time and not.
It could have been an update to etcd package or canal or anything else?
It’s not consistent for me even with this information. The cluster is not reaching every node and they are stuck waiting for etcd
b
I've noticed that the kubeapi sometimes is down or is timing out.. when the API is under heavy use (for eg, when deploying hundred of pods..)
g
a support bundle and issue with details would be useful to try and figure out what is going on
👍 1
w
I think I have found the reason for the transient errors. The default lease time in the dhcp was five minutes which means sometimes the restart happens after more than five minutes and the node can’t connect anymore. The ip assignment strategy could be described better in the documentation. I hope this lesson could be beneficial for anyone else. I’ve spent my weekend on digging into the logs to find this one :)
This is my rancher diagnose package