Hi here I m experiencing provisioning issues with RKE2 clust Rancher Users #rke2

Hi @here I’m experiencing provisioning issues wit...

future-gigabyte-33261

09/21/2025, 9:35 PM

Hi @here I’m experiencing provisioning issues with ***RKE2 clusters*** on the following distributions: • openSUSE Leap 15.6 • openSUSE Leap 16 • openSUSE Leap Micro 6 Clusters are being provisioned via Rancher on Harvester. Observed Behavior 1. Slow cluster availability Machines are created successfully, but clusters take an unusually long time to become ready: • >1 hour when using Calico, Cilium, or Canal • 30–40 minutes when using Flannel 2. Recurring update loop Once available, some clusters enter an ***Updating*** state every day for 1–2 hours. During this time, the clusters are completely ***inaccessible***. Expected Behavior 1. Clusters should become available in a reasonable timeframe after node creation. 2. Clusters should remain stable and not fall into a daily updating/inaccessible cycle.

creamy-pencil-82913

09/21/2025, 11:38 PM

... have you looked at the logs and pod states to see what's actually going on during these time periods? Sounds like under resourced nodes, tbh. What is the CPU/memory for these nodes? How much disk and of what type?

creamy-pencil-82913

09/21/2025, 11:39 PM

Also, there are 2.5k people in this channel, why would you try to <!here> that many people.

future-gigabyte-33261

09/22/2025, 12:52 AM

: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown, waiting for probes: etcd, kube-apiserver, kube-controller-manager

future-gigabyte-33261

09/22/2025, 12:54 AM

im sorry for '@here' but im used with discord

future-gigabyte-33261

09/22/2025, 12:58 AM

image.png

future-gigabyte-33261

09/22/2025, 12:59 AM

it has 31 GB free in the nodes

future-gigabyte-33261

09/22/2025, 1:01 AM

Copy code

namespace: default 
resourceVersion: '31587430' 
uid: c6683a4d-2023-4bbe-a3fc-19d3d8e785f6 
reason: InvalidDiskCapacity 
reportingComponent: kubelet

future-gigabyte-33261

09/22/2025, 1:13 AM

this are straight forward just click create fill the form and next next next using the default values

creamy-pencil-82913

09/22/2025, 3:44 AM

Go log into the node and look at the actual rke2 service logs, kubelet log, control plane pod logs, etc. Node conditions tell you basically nothing.

👍 1

creamy-pencil-82913

09/22/2025, 3:46 AM

Also, 2 cores and 4g is barely enough to run a control plane node. You really should have more than that. Do you want to run any workload or anything at all?

👍 1

future-gigabyte-33261

09/22/2025, 5:07 AM

Thanks i will provide you this as soon as im in my pc

few-appointment-23216

09/22/2025, 8:36 PM

Hello @creamy-pencil-82913. when viewing the logs, we noticed that calico isn't starting. upon further investigation, we found out that the calico-kube-controller was failing due to a taint in the master node:

Copy code

<http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>

however, we are not setting this taint during the deployment via TFR2P. Any ideas why this happens?

creamy-pencil-82913

09/22/2025, 8:44 PM

… did you disable the built-in cloud controller manager? or does it perhaps not have enough resources to run?

creamy-pencil-82913

09/22/2025, 8:44 PM

that taint will remain until a ccm runs and initializes the node

few-appointment-23216

09/22/2025, 8:54 PM

@creamy-pencil-82913 the harvester cluster has plenty of resources available, with the only bottleneck being HDD drives. we're using the TFR2P examples for deploying the RKE2 child cluster using Harvester as infra and cloud provider:

Copy code

machine_selector_config {
      config = jsonencode({
        cloud-provider-config: local_file.harvester-kube-config.content
        cloud-provider-name: "harvester"
      })
    }

creamy-pencil-82913

09/22/2025, 8:55 PM

if you have plenty of resources then why are you only giving your nodes 2 cores?

🤣 1

few-appointment-23216

09/22/2025, 8:56 PM

our Rancher version is 2.12 and Harvester is 1.5.1 and we're using version 8 of the provider

few-appointment-23216

09/22/2025, 8:56 PM

that was our initial testing with the code

creamy-pencil-82913

09/22/2025, 8:56 PM

ok, if you’re using the harvester cloud provider then the harvester CCM should be running to initialize the nodes. if it’s not getting deployed, or is failing to run, then see if the logs suggest why.

few-appointment-23216

09/22/2025, 8:56 PM

we destroyed that cluster and deployed a new one with 8 cores

creamy-pencil-82913

09/22/2025, 8:57 PM

you need a CCM one way or another, to clear the uninitialized taint

few-appointment-23216

09/22/2025, 9:04 PM

so, using the above code will result in a cluster configuration that uses Harvester as the cloud provider in the Rancher GUI: I am afraid I don't fully understand your point.

creamy-pencil-82913

09/22/2025, 9:04 PM

like I said, go look on the nodes to see why the harvester cloud provider isn’t functional

creamy-pencil-82913

09/22/2025, 9:05 PM

is the harvester ccm pod running? are there errors in its logs?

creamy-pencil-82913

09/22/2025, 9:06 PM

this would not be related to whatever instability you were seeing earlier though. I’m having a hard time really tracking what’s going on here - you said you deleted the clusters that were having the initial problems with instability, and now you’ve got a new cluster with a totally different problem?

future-gigabyte-33261

09/22/2025, 9:08 PM

Let me make it more clear we provision a new cluster and it takes 1 - 3 hour to come up then we face diffrent issues

future-gigabyte-33261

09/22/2025, 9:09 PM

clusters become not accessible

creamy-pencil-82913

09/22/2025, 9:09 PM

The problems are probably related?

creamy-pencil-82913

09/22/2025, 9:10 PM

Figure out what it’s doing for those hours that it is taking to come up. Whatever is making it take that long, is probably the same thing that makes it unstable later.

creamy-pencil-82913

09/22/2025, 9:10 PM

You are running your Harvester cluster on SSD or NVME with 10GB ethernet between nodes, right? No rotational storage?

🤐 1

creamy-pencil-82913

09/22/2025, 9:11 PM

See https://docs.harvesterhci.io/v1.5/install/requirements/

few-appointment-23216

09/22/2025, 9:16 PM

uhm... rotational storage 😞

creamy-pencil-82913

09/22/2025, 9:17 PM

that is not supported even for dev environments

creamy-pencil-82913

09/22/2025, 9:17 PM

that is way too slow to be running longhorn on, no wonder everything takes forever and is unstable

👍 2

few-appointment-23216

09/22/2025, 9:17 PM

the error we see in the harvester-cloud-provider pod is:

Copy code

Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "test-cluster-allroles-fgqkm-lbsvp" not found

creamy-pencil-82913

09/22/2025, 9:18 PM

everything is going to run very slowly, and likely be unstable, if you don’t meet the basic system requirements

few-appointment-23216

09/22/2025, 9:21 PM

yes, we're planning for upgraded hardware in the future, but right now it's not within our possibilities

few-appointment-23216

09/22/2025, 9:25 PM

and you are correct when saying that it takes forever, but we can deploy a cluster manually using the GUI and it is generally ready in less than 30 minutes.

future-gigabyte-33261

09/22/2025, 9:30 PM

when we deploy as vms manually and install manually the rke2 everything is up un ±30 min

creamy-pencil-82913

09/22/2025, 9:31 PM

on capable hardware it should be up within minutes. As part of dev work I regularly spin up multi-node clusters (3 etcd, 2 cp, 1 worker) all running on a single physical host with 64 GB of ram, 16 cores, and a single enterprise NVME drive - and they are all running within 3-4 minutes.

future-gigabyte-33261

09/22/2025, 9:33 PM

oke but does this translate to the issue we are facing with CCM

future-gigabyte-33261

09/22/2025, 9:34 PM

Copy code

tofu-test-allroles-fgqkm-lbsvp:~ # /var/lib/rancher/rke2/bin/kubectl logs harvester-cloud-provider-78d55bc78d-psfz4 -n kube-system
I0922 20:35:31.885816       1 serving.go:348] Generated self-signed cert in-memory
W0922 20:35:31.885910       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
W0922 20:35:32.246871       1 main.go:84] detected a cluster without a ClusterID.  A ClusterID will be required in the future.  Please tag your cluster to avoid any future issues
I0922 20:35:32.246900       1 controllermanager.go:152] Version: v0.0.0-master+$Format:%H$
I0922 20:35:32.248191       1 secure_serving.go:213] Serving securely on [::]:10258
I0922 20:35:32.248297       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0922 20:35:32.248504       1 leaderelection.go:248] attempting to acquire leader lease kube-system/cloud-controller-manager...
I0922 20:35:55.023176       1 leaderelection.go:258] successfully acquired lease kube-system/cloud-controller-manager
I0922 20:35:55.023385       1 event.go:294] "Event occurred" object="kube-system/cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="<http://coordination.k8s.io/v1|coordination.k8s.io/v1>" type="Normal" reason="LeaderElection" message="tofu-test-allroles-fgqkm-lbsvp_d6cf7c8f-962f-4284-a1bc-27d502ec0875 became leader"
time="2025-09-22T20:35:55Z" level=info msg="start watching virtual machine instance" controller=harvester-cloudprovider-resync-topology namespace=default
W0922 20:35:55.080327       1 core.go:111] --configure-cloud-routes is set, but cloud provider does not support routes. Will not configure cloud provider routes.
W0922 20:35:55.080338       1 controllermanager.go:299] Skipping "route"
I0922 20:35:55.080613       1 controllermanager.go:311] Started "cloud-node"
I0922 20:35:55.080779       1 controllermanager.go:311] Started "cloud-node-lifecycle"
I0922 20:35:55.080828       1 node_controller.go:157] Sending events to api server.
I0922 20:35:55.080884       1 node_controller.go:166] Waiting for informer caches to sync
I0922 20:35:55.080940       1 node_lifecycle_controller.go:113] Sending events to api server
I0922 20:35:55.081035       1 controllermanager.go:311] Started "service"
I0922 20:35:55.081167       1 controller.go:227] Starting service controller
I0922 20:35:55.081248       1 shared_informer.go:270] Waiting for caches to sync for service
I0922 20:35:55.181595       1 shared_informer.go:277] Caches are synced for service
E0922 20:35:55.360526       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
time="2025-09-22T20:35:55Z" level=info msg="Starting <http://kubevirt.io/v1|kubevirt.io/v1>, Kind=VirtualMachineInstance controller"
time="2025-09-22T20:35:55Z" level=info msg="Starting /v1, Kind=Service controller"
time="2025-09-22T20:35:55Z" level=info msg="Starting /v1, Kind=Node controller"
E0922 20:37:59.542014       1 leaderelection.go:367] Failed to update lock: Put "<https://10.43.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeout=5s>": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0922 20:40:55.393928       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 20:45:55.447995       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 20:50:55.478969       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 20:55:55.512430       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 21:00:55.545766       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 21:05:55.577520       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 21:10:55.613498       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found

creamy-pencil-82913

09/22/2025, 9:34 PM

I don’t know. Everything is going to run slow when you don’t meet requirements. That is very likely to include creation of resources in the harvester cluster, that the downstream clusters need to exist before they can finish initializing.

future-gigabyte-33261

09/22/2025, 9:35 PM

vms are up in 3 min and everything is provisioned including the disks in longhorn

creamy-pencil-82913

09/22/2025, 9:35 PM

you might look on the harvester cluster and see what is preventing the

<http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io>

resources from being created. I suspect it just running very slowly and/or crashing due to poor datastore performance.

future-gigabyte-33261

09/22/2025, 9:37 PM

the vms come up in 5 minutes we are able to ssh into them

future-gigabyte-33261

09/22/2025, 9:38 PM

let us check harvester cluster

few-appointment-23216

09/22/2025, 10:09 PM

the cluster is finally provisioned, but we had to manually patch each node using the following:

Copy code

kubectl patch node tofu-test-allroles-fgqkm-xxxxx -p '{"spec":{"providerID":"<harvester://harvester-public/tofu-test-allroles-fgqkm-xxxxx>"}}'

once this was done, the harvester-cloud-manager was able to complete whatever stuff was pending and finally all nodes joined the cluster.

few-appointment-23216

09/22/2025, 10:10 PM

my question now is: why does it need the

providerId

field? shouldn't this be computed by the values that are used in machine_config_v2?

creamy-pencil-82913

09/22/2025, 10:15 PM

that would be a better question for #C01GKHKAG0K. I am not a harvester dev.

few-appointment-23216

09/22/2025, 10:19 PM

I disagree; we're talking about rancher2 terraform provider, which does not interface with harvester on its own, but uses the built-in RKE2.

few-appointment-23216

09/22/2025, 10:20 PM

on the other hand, it makes sense to ask this question in the terraform provider channel 🙂

future-gigabyte-33261

09/22/2025, 10:24 PM

in the other hand we dont face these provisioning issues with ubuntu images

creamy-pencil-82913

09/22/2025, 10:28 PM

Rancher is not RKE2 either. If you have questions about Harvester stuff, ask in the Harvester channel.

👍 1

future-gigabyte-33261

09/22/2025, 10:30 PM

Thanks a lot for your support @creamy-pencil-82913 we will reply back here if we face any issue with rke2

few-appointment-23216

10/02/2025, 7:20 AM

Hello @creamy-pencil-82913! The issue we are facing here is due to the namespace not being included in the kubeconfig file we generate using

.../<cluster name>?action=generateKubeconfig

endpoint. Due to this, the harvester-cloud-provider pod searches but fails to recognize the child cluster nodes, thus calico, the rke2 agent, and a bunch of other pods are not able to initialize/run properly. Is there a way to include the namespace in the kubeconfig file? I tried both using Terraform http provider to generate the file, as well as manually using curl, both times I supply the namespace in the request body, along with the service principal account. In neither case the resulting kubeconfig contains the namespace. P.s. The service principal has the relevant cluster role binding Regards, Ronald

creamy-pencil-82913

10/02/2025, 7:40 AM

... the namespace of what? Where in the kubeconfig would you put a namespace?

few-appointment-23216

10/02/2025, 9:00 AM

In the contexts[0].context

future-gigabyte-33261

10/06/2025, 12:53 PM

hi @creamy-pencil-82913 the idea is that when we provision new clusters when new config is generated the namespace is missing

future-gigabyte-33261

10/06/2025, 12:53 PM

maybe you can suggest us were to open as a issue

future-gigabyte-33261

10/06/2025, 12:54 PM

rancher / harvester / here or some github repo idk

creamy-pencil-82913

10/06/2025, 4:23 PM

That'd be harvester...

creamy-pencil-82913

10/06/2025, 4:23 PM

All I see is the path setting in rancher, can you show what the contents are actually missing?

future-gigabyte-33261

10/08/2025, 3:39 PM

@few-appointment-23216

few-appointment-23216

10/08/2025, 8:23 PM

Hello @creamy-pencil-82913 and sorry for the late reply. In the kubeconfig file I generate using the following Terraform code:

Copy code

data "http" "kubeconfig" {
  url = "<https://rancher.domain.com/v3/clusters/c-j9pl8?action=generateKubeconfig>"
  method = "POST"
  request_headers = {
    Authorization = "Bearer ${var.token}"
    Accept        = "application/json"
  }
  request_body = jsonencode({
    "clusterRoleName"     = "harvesterhci.io:cloudprovider"
    "namespace"           = "harvester-public"
    "serviceAccountName"  = "tofu-test"
  })
}

resource "local_file" "harvester-kube-config" {
  filename = "${path.module}/tofu-test-kubeconfig"
  content = jsondecode(data.http.kubeconfig.response_body).config
}

which results in the following content:

Copy code

apiVersion: v1
kind: Config
clusters:
- name: "core"
  cluster:
    server: "<https://rancher.domain.com/k8s/clusters/c-j9pl8>"

users:
- name: "core"
  user:
    token: "kubeconfig-user-1ff2gx566s:hf9vftgtq5<REDACTED>d2tglp8m8b8vq26"


contexts:
- name: "core"
  context:
    user: "core"
    cluster: "core"

current-context: "core"

However, the Harvester CCM requires the namespace, which I did add manually under

contexts[0].context.namespace

with the value

harvester-public

. Once I did this, the Harvester CCM pod (harvester-cloud-provider) was able to locate the VM and properly initialize it. So, my question is this: is there a way to include the namespace information in the kubeconfig that is generated? I also tried all of the above using

curl

and the result was the same (missing namespace information).

creamy-pencil-82913

10/08/2025, 8:43 PM

I don’t think that’s the expected way to provide a kubeconfig for the harvester CCM, why are you manually doing that with terraform?

few-appointment-23216

10/08/2025, 8:44 PM

That's per the providers' documentation/example how to deploy a child RKE2 cluster using Rancher and Harvester as cloud provider.

creamy-pencil-82913

10/08/2025, 8:46 PM

that seems backwards, this is generating a kubeconfig for the downstream cluster but it’ll go through rancher to get there - so despite it talking to the cluster the CCM is running in, its got to round-trip through the rancher manager cluster to do so. Why not just use a serviceaccount?

creamy-pencil-82913

10/08/2025, 8:47 PM

Or is it talking to the rancher local cluster? I’ll see if I can get one of the Harvester folks to take a look

👍 1

few-appointment-23216

10/08/2025, 8:48 PM

I am, notice the

serviceAccountName = tofu-test

in the request body

creamy-pencil-82913

10/08/2025, 8:51 PM

It does not look to me like rancher even has a namespace field in the struct it uses to generate the kubeconfig. so no, I don’t think this is something that you’re going to be able to get from Rancher. https://github.com/rancher/rancher/blob/main/pkg/kubeconfig/kubeconfig.go#L56-L60

creamy-pencil-82913

10/08/2025, 8:56 PM

Are these the docs your looking at? https://docs.harvesterhci.io/v1.6/rancher/cloud-provider#manually-deploying-to-the-rke2-cluster

creamy-pencil-82913

10/08/2025, 8:57 PM

I will note that it says: >

Copy code

curl -sfL <https://raw.githubusercontent.com/harvester/cloud-provider-harvester/master/deploy/generate_addon.sh> | bash -s <serviceaccount name> <namespace>

> You must specify the namespace in which the guest cluster will be created.

creamy-pencil-82913

10/08/2025, 8:57 PM

To me that says, it is expected that you have to manually specify the namespace.

creamy-pencil-82913

10/08/2025, 8:58 PM

But also this does not use TF or the rancher API, only kubectl. which seems like a much more reasonable approach than what you had above.

few-appointment-23216

10/08/2025, 8:59 PM

I am using the following docs: https://search.opentofu.org/provider/rancher/rancher2/v8.0.0/docs/resources/cluster_v2#create-a-node-driver-cluster-[…]e-provider-and-cloud-provider

few-appointment-23216

10/08/2025, 9:00 PM

the idea behind using Terraform was to have it manage the entire lifecycle of the cluster as well as resources/files that are required to deploy it

few-appointment-23216

10/08/2025, 9:01 PM

the example they are providing there uses

curl

with some env vars

few-appointment-23216

10/08/2025, 9:02 PM

image.png

creamy-pencil-82913

10/08/2025, 9:27 PM

Ok well I would probably follow the harvester docs instead of the opentofu docs

creamy-pencil-82913

10/08/2025, 9:29 PM

cc @bumpy-tomato-36167 in case this is something you're familiar with

bumpy-tomato-36167

10/08/2025, 9:48 PM

Yeah, this was already brought up in the #C07M052K9D0 channel. Since this is what they are getting from a simple curl I can't see how it is Terraform related. I also suggested generating a service account. I think in the end this is going to need attention from the Harvester folks.

bumpy-tomato-36167

10/08/2025, 9:51 PM

This is almost definitely an opportunity to improve the documentation either way. Once a solution is found I will make sure to get the proper team involved in getting the docs updated.

creamy-pencil-82913

10/08/2025, 9:52 PM

I think the answer is just “no, you need to add the namespace yourself”. It appears that there is no other way to pass the ns to the harvester CCM, it is just hardcoded to read it from the kubeconfig.

creamy-pencil-82913

10/08/2025, 9:53 PM

https://github.com/harvester/cloud-provider-harvester/blob/d5a60eb8005083c9d36c8783509e6b34b53922de/pkg/cloud-controller-manager/ccm.go#L72-L75

bumpy-tomato-36167

10/08/2025, 9:54 PM

@few-appointment-23216, you originally created this issue to address this problem, right? You closed it due to not being Terraform related, but if there is more that we can do to help please feel free to open a new one with more details.

few-appointment-23216

10/08/2025, 10:53 PM

Hello @bumpy-tomato-36167! Yes, that was me and I still think it is not a Terraform provider issue. From @creamy-pencil-82913's feedback it looks more like an issue with the Harvester CCM. What do you guys think?

creamy-pencil-82913

10/08/2025, 10:55 PM

I think you could perhaps make the case for an enhancement request in allowing the namespace to be configured via CLI arg or env var instead of only via kubeconfig, but I think as far as harvester team is concerned it is working as designed - the kubeconfig IS their config.

28 Views

Open in Slack

Previous Next