Hi @here I’m experiencing provisioning issues wit...
# rke2
f
Hi @here I’m experiencing provisioning issues with ***RKE2 clusters*** on the following distributions: • openSUSE Leap 15.6 • openSUSE Leap 16 • openSUSE Leap Micro 6 Clusters are being provisioned via Rancher on Harvester. Observed Behavior 1. Slow cluster availability Machines are created successfully, but clusters take an unusually long time to become ready: • >1 hour when using Calico, Cilium, or Canal • 30–40 minutes when using Flannel 2. Recurring update loop Once available, some clusters enter an ***Updating*** state every day for 1–2 hours. During this time, the clusters are completely ***inaccessible***. Expected Behavior 1. Clusters should become available in a reasonable timeframe after node creation. 2. Clusters should remain stable and not fall into a daily updating/inaccessible cycle.
c
... have you looked at the logs and pod states to see what's actually going on during these time periods? Sounds like under resourced nodes, tbh. What is the CPU/memory for these nodes? How much disk and of what type?
Also, there are 2.5k people in this channel, why would you try to <!here> that many people.
f
: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown, waiting for probes: etcd, kube-apiserver, kube-controller-manager
im sorry for '@here' but im used with discord
image.png
it has 31 GB free in the nodes
Copy code
namespace: default 
resourceVersion: '31587430' 
uid: c6683a4d-2023-4bbe-a3fc-19d3d8e785f6 
reason: InvalidDiskCapacity 
reportingComponent: kubelet
this are straight forward just click create fill the form and next next next using the default values
c
Go log into the node and look at the actual rke2 service logs, kubelet log, control plane pod logs, etc. Node conditions tell you basically nothing.
👍 1
Also, 2 cores and 4g is barely enough to run a control plane node. You really should have more than that. Do you want to run any workload or anything at all?
👍 1
f
Thanks i will provide you this as soon as im in my pc
f
Hello @creamy-pencil-82913. when viewing the logs, we noticed that calico isn't starting. upon further investigation, we found out that the calico-kube-controller was failing due to a taint in the master node:
Copy code
<http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule|node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule>
however, we are not setting this taint during the deployment via TFR2P. Any ideas why this happens?
c
… did you disable the built-in cloud controller manager? or does it perhaps not have enough resources to run?
that taint will remain until a ccm runs and initializes the node
f
@creamy-pencil-82913 the harvester cluster has plenty of resources available, with the only bottleneck being HDD drives. we're using the TFR2P examples for deploying the RKE2 child cluster using Harvester as infra and cloud provider:
Copy code
machine_selector_config {
      config = jsonencode({
        cloud-provider-config: local_file.harvester-kube-config.content
        cloud-provider-name: "harvester"
      })
    }
c
if you have plenty of resources then why are you only giving your nodes 2 cores?
🤣 1
f
our Rancher version is 2.12 and Harvester is 1.5.1 and we're using version 8 of the provider
that was our initial testing with the code
c
ok, if you’re using the harvester cloud provider then the harvester CCM should be running to initialize the nodes. if it’s not getting deployed, or is failing to run, then see if the logs suggest why.
f
we destroyed that cluster and deployed a new one with 8 cores
c
you need a CCM one way or another, to clear the uninitialized taint
f
so, using the above code will result in a cluster configuration that uses Harvester as the cloud provider in the Rancher GUI: I am afraid I don't fully understand your point.
c
like I said, go look on the nodes to see why the harvester cloud provider isn’t functional
is the harvester ccm pod running? are there errors in its logs?
this would not be related to whatever instability you were seeing earlier though. I’m having a hard time really tracking what’s going on here - you said you deleted the clusters that were having the initial problems with instability, and now you’ve got a new cluster with a totally different problem?
f
Let me make it more clear we provision a new cluster and it takes 1 - 3 hour to come up then we face diffrent issues
clusters become not accessible
c
The problems are probably related?
Figure out what it’s doing for those hours that it is taking to come up. Whatever is making it take that long, is probably the same thing that makes it unstable later.
You are running your Harvester cluster on SSD or NVME with 10GB ethernet between nodes, right? No rotational storage?
🤐 1
f
uhm... rotational storage 😞
c
that is not supported even for dev environments
that is way too slow to be running longhorn on, no wonder everything takes forever and is unstable
👍 2
f
the error we see in the harvester-cloud-provider pod is:
Copy code
Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "test-cluster-allroles-fgqkm-lbsvp" not found
c
everything is going to run very slowly, and likely be unstable, if you don’t meet the basic system requirements
f
yes, we're planning for upgraded hardware in the future, but right now it's not within our possibilities
and you are correct when saying that it takes forever, but we can deploy a cluster manually using the GUI and it is generally ready in less than 30 minutes.
f
when we deploy as vms manually and install manually the rke2 everything is up un ±30 min
c
on capable hardware it should be up within minutes. As part of dev work I regularly spin up multi-node clusters (3 etcd, 2 cp, 1 worker) all running on a single physical host with 64 GB of ram, 16 cores, and a single enterprise NVME drive - and they are all running within 3-4 minutes.
f
oke but does this translate to the issue we are facing with CCM
Copy code
tofu-test-allroles-fgqkm-lbsvp:~ # /var/lib/rancher/rke2/bin/kubectl logs harvester-cloud-provider-78d55bc78d-psfz4 -n kube-system
I0922 20:35:31.885816       1 serving.go:348] Generated self-signed cert in-memory
W0922 20:35:31.885910       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
W0922 20:35:32.246871       1 main.go:84] detected a cluster without a ClusterID.  A ClusterID will be required in the future.  Please tag your cluster to avoid any future issues
I0922 20:35:32.246900       1 controllermanager.go:152] Version: v0.0.0-master+$Format:%H$
I0922 20:35:32.248191       1 secure_serving.go:213] Serving securely on [::]:10258
I0922 20:35:32.248297       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0922 20:35:32.248504       1 leaderelection.go:248] attempting to acquire leader lease kube-system/cloud-controller-manager...
I0922 20:35:55.023176       1 leaderelection.go:258] successfully acquired lease kube-system/cloud-controller-manager
I0922 20:35:55.023385       1 event.go:294] "Event occurred" object="kube-system/cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="<http://coordination.k8s.io/v1|coordination.k8s.io/v1>" type="Normal" reason="LeaderElection" message="tofu-test-allroles-fgqkm-lbsvp_d6cf7c8f-962f-4284-a1bc-27d502ec0875 became leader"
time="2025-09-22T20:35:55Z" level=info msg="start watching virtual machine instance" controller=harvester-cloudprovider-resync-topology namespace=default
W0922 20:35:55.080327       1 core.go:111] --configure-cloud-routes is set, but cloud provider does not support routes. Will not configure cloud provider routes.
W0922 20:35:55.080338       1 controllermanager.go:299] Skipping "route"
I0922 20:35:55.080613       1 controllermanager.go:311] Started "cloud-node"
I0922 20:35:55.080779       1 controllermanager.go:311] Started "cloud-node-lifecycle"
I0922 20:35:55.080828       1 node_controller.go:157] Sending events to api server.
I0922 20:35:55.080884       1 node_controller.go:166] Waiting for informer caches to sync
I0922 20:35:55.080940       1 node_lifecycle_controller.go:113] Sending events to api server
I0922 20:35:55.081035       1 controllermanager.go:311] Started "service"
I0922 20:35:55.081167       1 controller.go:227] Starting service controller
I0922 20:35:55.081248       1 shared_informer.go:270] Waiting for caches to sync for service
I0922 20:35:55.181595       1 shared_informer.go:277] Caches are synced for service
E0922 20:35:55.360526       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
time="2025-09-22T20:35:55Z" level=info msg="Starting <http://kubevirt.io/v1|kubevirt.io/v1>, Kind=VirtualMachineInstance controller"
time="2025-09-22T20:35:55Z" level=info msg="Starting /v1, Kind=Service controller"
time="2025-09-22T20:35:55Z" level=info msg="Starting /v1, Kind=Node controller"
E0922 20:37:59.542014       1 leaderelection.go:367] Failed to update lock: Put "<https://10.43.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeout=5s>": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0922 20:40:55.393928       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 20:45:55.447995       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 20:50:55.478969       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 20:55:55.512430       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 21:00:55.545766       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 21:05:55.577520       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
E0922 21:10:55.613498       1 node_controller.go:258] Error getting instance metadata for node addresses: <http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io> "tofu-test-allroles-fgqkm-lbsvp" not found
c
I don’t know. Everything is going to run slow when you don’t meet requirements. That is very likely to include creation of resources in the harvester cluster, that the downstream clusters need to exist before they can finish initializing.
f
vms are up in 3 min and everything is provisioned including the disks in longhorn
c
you might look on the harvester cluster and see what is preventing the
<http://virtualmachines.kubevirt.io|virtualmachines.kubevirt.io>
resources from being created. I suspect it just running very slowly and/or crashing due to poor datastore performance.
f
the vms come up in 5 minutes we are able to ssh into them
let us check harvester cluster
f
the cluster is finally provisioned, but we had to manually patch each node using the following:
Copy code
kubectl patch node tofu-test-allroles-fgqkm-xxxxx -p '{"spec":{"providerID":"<harvester://harvester-public/tofu-test-allroles-fgqkm-xxxxx>"}}'
once this was done, the harvester-cloud-manager was able to complete whatever stuff was pending and finally all nodes joined the cluster.
my question now is: why does it need the
providerId
field? shouldn't this be computed by the values that are used in machine_config_v2?
c
that would be a better question for #C01GKHKAG0K. I am not a harvester dev.
f
I disagree; we're talking about rancher2 terraform provider, which does not interface with harvester on its own, but uses the built-in RKE2.
on the other hand, it makes sense to ask this question in the terraform provider channel 🙂
f
in the other hand we dont face these provisioning issues with ubuntu images
c
Rancher is not RKE2 either. If you have questions about Harvester stuff, ask in the Harvester channel.
👍 1
f
Thanks a lot for your support @creamy-pencil-82913 we will reply back here if we face any issue with rke2
f
Hello @creamy-pencil-82913! The issue we are facing here is due to the namespace not being included in the kubeconfig file we generate using
.../<cluster name>?action=generateKubeconfig
endpoint. Due to this, the harvester-cloud-provider pod searches but fails to recognize the child cluster nodes, thus calico, the rke2 agent, and a bunch of other pods are not able to initialize/run properly. Is there a way to include the namespace in the kubeconfig file? I tried both using Terraform http provider to generate the file, as well as manually using curl, both times I supply the namespace in the request body, along with the service principal account. In neither case the resulting kubeconfig contains the namespace. P.s. The service principal has the relevant cluster role binding Regards, Ronald
c
... the namespace of what? Where in the kubeconfig would you put a namespace?
f
In the contexts[0].context
f
hi @creamy-pencil-82913 the idea is that when we provision new clusters when new config is generated the namespace is missing
maybe you can suggest us were to open as a issue
rancher / harvester / here or some github repo idk
c
That'd be harvester...
All I see is the path setting in rancher, can you show what the contents are actually missing?
f
@few-appointment-23216
f
Hello @creamy-pencil-82913 and sorry for the late reply. In the kubeconfig file I generate using the following Terraform code:
Copy code
data "http" "kubeconfig" {
  url = "<https://rancher.domain.com/v3/clusters/c-j9pl8?action=generateKubeconfig>"
  method = "POST"
  request_headers = {
    Authorization = "Bearer ${var.token}"
    Accept        = "application/json"
  }
  request_body = jsonencode({
    "clusterRoleName"     = "harvesterhci.io:cloudprovider"
    "namespace"           = "harvester-public"
    "serviceAccountName"  = "tofu-test"
  })
}

resource "local_file" "harvester-kube-config" {
  filename = "${path.module}/tofu-test-kubeconfig"
  content = jsondecode(data.http.kubeconfig.response_body).config
}
which results in the following content:
Copy code
apiVersion: v1
kind: Config
clusters:
- name: "core"
  cluster:
    server: "<https://rancher.domain.com/k8s/clusters/c-j9pl8>"

users:
- name: "core"
  user:
    token: "kubeconfig-user-1ff2gx566s:hf9vftgtq5<REDACTED>d2tglp8m8b8vq26"


contexts:
- name: "core"
  context:
    user: "core"
    cluster: "core"

current-context: "core"
However, the Harvester CCM requires the namespace, which I did add manually under
contexts[0].context.namespace
with the value
harvester-public
. Once I did this, the Harvester CCM pod (harvester-cloud-provider) was able to locate the VM and properly initialize it. So, my question is this: is there a way to include the namespace information in the kubeconfig that is generated? I also tried all of the above using
curl
and the result was the same (missing namespace information).
c
I don’t think that’s the expected way to provide a kubeconfig for the harvester CCM, why are you manually doing that with terraform?
f
That's per the providers' documentation/example how to deploy a child RKE2 cluster using Rancher and Harvester as cloud provider.
c
that seems backwards, this is generating a kubeconfig for the downstream cluster but it’ll go through rancher to get there - so despite it talking to the cluster the CCM is running in, its got to round-trip through the rancher manager cluster to do so. Why not just use a serviceaccount?
Or is it talking to the rancher local cluster? I’ll see if I can get one of the Harvester folks to take a look
👍 1
f
I am, notice the
serviceAccountName = tofu-test
in the request body
c
It does not look to me like rancher even has a namespace field in the struct it uses to generate the kubeconfig. so no, I don’t think this is something that you’re going to be able to get from Rancher. https://github.com/rancher/rancher/blob/main/pkg/kubeconfig/kubeconfig.go#L56-L60
I will note that it says: >
Copy code
curl -sfL <https://raw.githubusercontent.com/harvester/cloud-provider-harvester/master/deploy/generate_addon.sh> | bash -s <serviceaccount name> <namespace>
> You must specify the namespace in which the guest cluster will be created.
To me that says, it is expected that you have to manually specify the namespace.
But also this does not use TF or the rancher API, only kubectl. which seems like a much more reasonable approach than what you had above.
the idea behind using Terraform was to have it manage the entire lifecycle of the cluster as well as resources/files that are required to deploy it
the example they are providing there uses
curl
with some env vars
image.png
c
Ok well I would probably follow the harvester docs instead of the opentofu docs
cc @bumpy-tomato-36167 in case this is something you're familiar with
b
Yeah, this was already brought up in the #C07M052K9D0 channel. Since this is what they are getting from a simple curl I can't see how it is Terraform related. I also suggested generating a service account. I think in the end this is going to need attention from the Harvester folks.
This is almost definitely an opportunity to improve the documentation either way. Once a solution is found I will make sure to get the proper team involved in getting the docs updated.
c
I think the answer is just “no, you need to add the namespace yourself”. It appears that there is no other way to pass the ns to the harvester CCM, it is just hardcoded to read it from the kubeconfig.
b
@few-appointment-23216, you originally created this issue to address this problem, right? You closed it due to not being Terraform related, but if there is more that we can do to help please feel free to open a new one with more details.
f
Hello @bumpy-tomato-36167! Yes, that was me and I still think it is not a Terraform provider issue. From @creamy-pencil-82913's feedback it looks more like an issue with the Harvester CCM. What do you guys think?
c
I think you could perhaps make the case for an enhancement request in allowing the namespace to be configured via CLI arg or env var instead of only via kubeconfig, but I think as far as harvester team is concerned it is working as designed - the kubeconfig IS their config.