This message was deleted Rancher Users #rke2

Join Slack

This message was deleted.

# rke2

adamant-kite-43734

11/04/2024, 3:28 PM

This message was deleted.

creamy-pencil-82913

11/04/2024, 7:06 PM

Have you looked at the logs on the node to see why it’s not coming up? Just saying that the probes are failing doesn’t give anyone anything to work off of. Log in and see whats wrong.

aloof-umbrella-36367

11/04/2024, 7:07 PM

mentions the kube certs directory, etc not being found.

aloof-umbrella-36367

11/04/2024, 7:07 PM

It does not even create the folder?

creamy-pencil-82913

11/04/2024, 7:09 PM

the directories are created when rke2 starts. Is it being installed successfully? It the rke2-server service failing to start?

aloof-umbrella-36367

11/04/2024, 7:10 PM

the /var/lib/rankcher/rke folders are being set up, thats about it.

aloof-umbrella-36367

11/04/2024, 7:10 PM

Ill need to see if the service is started....

aloof-umbrella-36367

11/04/2024, 7:26 PM

Here is latest provision log: [INFO ] configuring bootstrap node(s) dev8-cluster-ecw-drrqc-bqhxs: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler

creamy-pencil-82913

11/04/2024, 7:27 PM

Check the rancher-system-agent logs to see if there are errors running the rke2 installer. Check the rke2-server logs to see if there are errors starting.

creamy-pencil-82913

11/04/2024, 7:29 PM

Also ensure that you have sufficient resources for all the pods to run. How many CPU cores and how much memory does this node have?

aloof-umbrella-36367

11/04/2024, 7:29 PM

Here is the service log: systemctl status rke2-server.service ● rke2-server.service - Rancher Kubernetes Engine v2 (server) Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; enabled; vendor preset: disabled) Active: activating (start) since Mon 2024-11-04 191946 UTC; 2min 6s ago Docs: https://github.com/rancher/rke2#readme Main PID: 2138 (rke2) Tasks: 71 Memory: 1.7G CGroup: /system.slice/rke2-server.service ├─2138 /usr/local/bin/rke2 server ├─2380 containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd ├─2583 kubelet --volume-plugin-dir=/var/lib/kubelet/volumeplugins --file-check-frequency=5s --sync-frequency=30s --cloud-provider=aws --cloud-config= --address=0.0.0.0 --anonymous-auth=false --authen> ├─2634 /var/lib/rancher/rke2/data/v1.28.14-rke2r1-00f05d9dc660/bin/containerd-shim-runc-v2 -namespace k8s.io -id deba8d1639daad240ac8a2bc0ff22a40d191cbcef4de89d738354e99a1eafa39 -address /run/k3s/con> ├─2638 /var/lib/rancher/rke2/data/v1.28.14-rke2r1-00f05d9dc660/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3f06d3c42baa8eadfbc4a5f0921939288cd6287c3c3f988dfecd3f1127d4f9bb -address /run/k3s/con> └─2813 /var/lib/rancher/rke2/data/v1.28.14-rke2r1-00f05d9dc660/bin/containerd-shim-runc-v2 -namespace k8s.io -id 6e01c045a32cc346b9b9c0d6e5f9dde4ec38231ad1e89503ace4b9c6c0e36321 -address /run/k3s/con> Nov 04 192122 ip-*****-195-49 rke2[2138]: time="2024-11-04T192122Z" level=info msg="Defragmenting etcd database" Nov 04 192127 ip-*****-195-49 rke2[2138]: {"level":"warn","ts":"2024-11-04T192127.611701Z","logger":"etcd-client","caller":v3@v3.5.13-k3s1/retry_interceptor.go:62,"msg":"retrying of unary invoker failed> Nov 04 192127 ip-**-195-49 rke2[2138]: time="2024-11-04T192127Z" level=info msg="Failed to test data store connection: failed to report and disarm etcd alarms: etcd alarm list failed: context deadline > Nov 04 192129 ip-*****-195-49 rke2[2138]: time="2024-11-04T192129Z" level=info msg="Pod for etcd is synced" Nov 04 192132 ip-******-195-49 rke2[2138]: time="2024-11-04T192132Z" level=info msg="Defragmenting etcd database" Nov 04 192140 ip-******-195-49 rke2[2138]: time="2024-11-04T192140Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready" Nov 04 192140 ip-******-195-49 rke2[2138]: time="2024-11-04T192140Z" level=info msg="etcd data store connection OK" Nov 04 192140 ip-*******-195-49 rke2[2138]: time="2024-11-04T192140Z" level=info msg="Saving cluster bootstrap data to datastore" Nov 04 192140 ip-******-195-49 rke2[2138]: time="2024-11-04T192140Z" level=info msg="Waiting for API server to become available" Nov 04 192141 ip-*******-195-49 rke2[2138]: time="2024-11-04T192141Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"

aloof-umbrella-36367

11/04/2024, 7:29 PM

The node is a m5.large

aloof-umbrella-36367

11/04/2024, 7:30 PM

Keep in mind, I have everything on the same node at this point for testing....(etc, control-plane, worker)

creamy-pencil-82913

11/04/2024, 7:31 PM

are you able to

kubectl describe node

? The apiserver is up, so you should.

aloof-umbrella-36367

11/04/2024, 7:33 PM

I exported the kubeconfig to point to the rke2.yaml path

aloof-umbrella-36367

11/04/2024, 7:33 PM

But, it still says connection to server 127.0.0.1:6443 was refused

creamy-pencil-82913

11/04/2024, 7:37 PM

ok, so apiserver is not running. Check the apiserver pod logs under /var/log/pods

aloof-umbrella-36367

11/04/2024, 7:42 PM

Ill get the log here in a sec.....

aloof-umbrella-36367

11/04/2024, 7:42 PM

But following this docuement from Rancher: https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/kubernetes-clusters-in-rancher-setup/set-up-cloud-providers/amazon

aloof-umbrella-36367

11/04/2024, 7:42 PM

Starting with Kubernetes 1.23, you must deactivate the

CSIMigrationAWS

feature gate to use the in-tree AWS cloud provider. You can do this by setting

feature-gates=CSIMigrationAWS=false

as an additional argument for the cluster's Kubelet, Controller Manager, API Server and Scheduler in the advanced cluster configuration.

aloof-umbrella-36367

11/04/2024, 7:45 PM

2024-11-04T193753.45531168Z stderr F Error: invalid argument "CSIMigrationAWS=false" for "--feature-gates" flag: unrecognized feature gate: CSIMigrationAWS

aloof-umbrella-36367

11/04/2024, 7:45 PM

looks like its complaining about it??

aloof-umbrella-36367

11/04/2024, 7:51 PM

I took out the configs it told me to put in.....

aloof-umbrella-36367

11/04/2024, 7:52 PM

Provision logs now say: [INFO ] configuring bootstrap node(s) dev8-cluster-ecw-drrqc-bqhxs: waiting for cluster agent to connect25124 pm[INFO ] configuring bootstrap node(s) dev8-cluster-ecw-drrqc-bqhxs: waiting for probes: kube-controller-manager

aloof-umbrella-36367

11/04/2024, 7:53 PM

in the kube-controller-manager pod logs.....

aloof-umbrella-36367

11/04/2024, 7:53 PM

1 leaderelection.go:260] successfully acquired lease kube-system/kube-controller-manager 2024-11-04T195042.984948375Z stderr F I1104 195042.981921 1 event.go:307] "Event occurred" object="kube-system/kube-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="ip-*****-195-49_bba5b434-1da8-4b2a-bf29-44b8c30abeec became leader" 2024-11-04T195042.994929167Z stderr F E1104 195042.994202 1 controllermanager.go:235] "Error building controller context" err="cloud provider could not be initialized: unknown cloud provider \"aws\""

creamy-pencil-82913

11/04/2024, 8:03 PM

You’re not running any of the affected versions. Don’t do any of that stuff. The AWS cloud provider is no longer built into Kubernetes and that feature gate is long removed.

aloof-umbrella-36367

11/04/2024, 8:04 PM

So, dont do ANY of that stuff on the Rancher page?

creamy-pencil-82913

11/04/2024, 8:05 PM

Starting with Kubernetes 1.23 you would need to set that feature gate. Starting with 1.27 the cloud provider is completely removed from Kubernetes so you can’t use it at all (nor can you enable that feature-gate), you need to deploy the out-of-tree cloud provider as described in the first link.

aloof-umbrella-36367

11/04/2024, 8:06 PM

I started Kube 1.28

aloof-umbrella-36367

11/04/2024, 8:06 PM

so, just to be straight......I DO NOT do this step: 2. Rancher managed RKE2/K3s clusters don't support configuring

providerID

. However, the engine will set the node name correctly if the following configuration is set on the provisioning cluster object:

creamy-pencil-82913

11/04/2024, 8:06 PM

https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/kubernetes-clusters-in-rancher-setup/set-up-cloud-providers/amazon#helm[…]-cli

creamy-pencil-82913

11/04/2024, 8:07 PM

or you can try using UI if you prefer, as described further down the page.

aloof-umbrella-36367

11/04/2024, 8:07 PM

Yes, I am trying to do the "out of tree" specified on the page.....

aloof-umbrella-36367

11/04/2024, 8:07 PM

In Kubernetes 1.27 and later, you must use an out-of-tree AWS cloud provider.

creamy-pencil-82913

11/04/2024, 8:08 PM

yes. So don’t set cloud provider name or that csi migration feature gate.

aloof-umbrella-36367

11/04/2024, 8:08 PM

So....your saying....DO NOT do the stuff in the start of that page, only do the stuff at the bottom?

aloof-umbrella-36367

11/04/2024, 8:08 PM

Lemme try that.....

creamy-pencil-82913

11/04/2024, 8:08 PM

yes. those are instructions for older versions of Kubernetes. If you are on 1.27 or newer, set cloud provider in the Rancher UI to “External”, and then deploy the aws cloud provider helm chart.

aloof-umbrella-36367

11/04/2024, 8:10 PM

Yep, Ive tried that option as well.....

aloof-umbrella-36367

11/04/2024, 8:10 PM

I have that on another cluster.....

aloof-umbrella-36367

11/04/2024, 8:10 PM

But, not sure its set up all good just yet.

aloof-umbrella-36367

11/04/2024, 8:12 PM

But, your saying..... I need to when setting up a cluster from scratch, do the following: Set this for the IMDSv2:

aloof-umbrella-36367

11/04/2024, 8:14 PM

Set the Cloud Provider to "External"?

aloof-umbrella-36367

11/04/2024, 8:15 PM

Will I also have to do the section for setting the over rides Etcd/control-plane/worker still?

aloof-umbrella-36367

11/04/2024, 8:15 PM

the section starting with "Override on Etcd:"

creamy-pencil-82913

11/04/2024, 8:30 PM

no, you don’t need to override anything

aloof-umbrella-36367

11/04/2024, 8:30 PM

OK....building cluster now.

aloof-umbrella-36367

11/04/2024, 8:30 PM

lets see how this one goes!

aloof-umbrella-36367

11/04/2024, 8:31 PM

AND....Thankyou soooo much in advance!

creamy-pencil-82913

11/04/2024, 8:31 PM

note that you will need to deploy the cloud provider before it’ll finish coming up. the CNI won’t deploy until the cloud provider is available.

aloof-umbrella-36367

11/04/2024, 8:32 PM

You mean this step>> 3. Specify the

aws-cloud-controller-manager

Helm chart as an additional manifest to install:

aloof-umbrella-36367

11/04/2024, 8:33 PM

Ill need to do that only using the .yaml editor, cant through the GUI......right?

aloof-umbrella-36367

11/04/2024, 8:36 PM

I manually put that in the cluster while edigin the .yaml file for it.....

aloof-umbrella-36367

11/04/2024, 8:36 PM

Now it says "[INFO ] configuring bootstrap node(s) dev8-cluster-ecw-2mr4t-8dj7t: waiting for cluster agent to connect"

aloof-umbrella-36367

11/04/2024, 8:37 PM

I put in the al the settings verbatum from the install page.....

aloof-umbrella-36367

11/04/2024, 8:37 PM

Was there something that I needed to change from exactly what was stated on the page?

aloof-umbrella-36367

11/04/2024, 8:38 PM

'''spec: rkeConfig: additionalManifest: |- apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: aws-cloud-controller-manager namespace: kube-system spec: chart: aws-cloud-controller-manager repo: https://kubernetes.github.io/cloud-provider-aws targetNamespace: kube-system bootstrap: true valuesContent: |- hostNetworking: true nodeSelector: node-role.kubernetes.io/control-plane: "true" args: - --configure-cloud-routes=false - --v=5 - --cloud-provider=aws

aloof-umbrella-36367

11/04/2024, 8:44 PM

I see this in the a

aloof-umbrella-36367

11/04/2024, 8:44 PM

"Error syncing pod, skipping" err="failed to \"StartContainer\" for \"helm\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=helm pod=helm-install-aws-cloud-controller-manager-wnvdx_kube-system(dcf32f16-71f5-41d1-9c9e-987aebefd4f9)\"" pod="kube-system/helm-install-aws-cloud-controller-manager-wnvdx" podUID="dcf32f16-71f5-41d1-9c9e-987aebefd4f9"

aloof-umbrella-36367

11/04/2024, 8:51 PM

Im trying this.....

aloof-umbrella-36367

11/04/2024, 8:52 PM

Undo what I just did with the cluster yaml file.

aloof-umbrella-36367

11/04/2024, 8:52 PM

Went onto the ECW server and did a helm install.

aloof-umbrella-36367

11/04/2024, 8:52 PM

restarting node now

aloof-umbrella-36367

11/04/2024, 9:01 PM

in the aws cloud provider logs......Im getting this: error syncing 'ip--195-20': failed to get provider ID for node ip-*-195-20 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing

aloof-umbrella-36367

11/04/2024, 9:02 PM

That is the IP of the only server I have in teh cluster at this time

aloof-umbrella-36367

11/04/2024, 9:06 PM

kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml get po -A NAMESPACE NAME READY STATUS RESTARTS AGE cattle-system cattle-cluster-agent-67646db497-xnfcl 0/1 Pending 0 29m kube-system aws-cloud-controller-manager-7p6vk 1/1 Running 2 (8m29s ago) 13m kube-system etcd-ip-*****-195-20 1/1 Running 1 29m kube-system helm-install-rke2-coredns-46thb 0/1 Completed 0 29m kube-system helm-install-rke2-flannel-xlsbd 0/1 Completed 0 29m kube-system helm-install-rke2-ingress-nginx-srz2x 0/1 Pending 0 29m kube-system helm-install-rke2-metrics-server-q75vt 0/1 Pending 0 29m kube-system helm-install-rke2-snapshot-controller-crd-4dpht 0/1 Pending 0 29m kube-system helm-install-rke2-snapshot-controller-d8sdt 0/1 Pending 0 29m kube-system helm-install-rke2-snapshot-validation-webhook-d68c8 0/1 Pending 0 29m kube-system kube-apiserver-ip-******-195-20 1/1 Running 2 29m kube-system kube-controller-manager-ip-*******-195-20 1/1 Running 1 (8m50s ago) 8m46s kube-system kube-flannel-ds-km2pr 1/1 Running 1 (9m39s ago) 29m kube-system kube-proxy-ip-********-195-20 1/1 Running 1 (9m39s ago) 29m kube-system kube-scheduler-ip-********-195-20 1/1 Running 0 8m45s kube-system rke2-coredns-rke2-coredns-7875c9c6b7-x5ltn 0/1 Pending 0 29m kube-system rke2-coredns-rke2-coredns-autoscaler-564964dcd5-7cx85 0/1 Pending 0 29m

creamy-pencil-82913

11/04/2024, 9:14 PM

did you tag all the instances correctly? And is the hostname set correctly? There is configuration that needs to be done. that is covered in the cloud provider docs.

creamy-pencil-82913

11/04/2024, 9:16 PM

https://github.com/kubernetes/cloud-provider-aws/issues/285

aloof-umbrella-36367

11/04/2024, 9:22 PM

yes, they are tagged

aloof-umbrella-36367

11/04/2024, 9:23 PM

I have to run for the day.....Are you around tomorrow?

aloof-umbrella-36367

11/05/2024, 3:05 PM

Question.....

aloof-umbrella-36367

11/05/2024, 3:07 PM

If Rancher is spinning up these nodes.......It is spinning them up using the FQDN (DNS) from AWS.....(ie. ip-XXX.XXX-195-20.us-gov-west-1.compute.internal)

aloof-umbrella-36367

11/05/2024, 3:08 PM

However, the hostname of the server is set to only (ip-XXX.XXX-195-20)

aloof-umbrella-36367

11/05/2024, 3:08 PM

I am able to ping either name internally to that VPC????

aloof-umbrella-36367

11/05/2024, 4:14 PM

I have also tried the same steps above using another RHEL8 AMI.......Same results as well

creamy-pencil-82913

11/05/2024, 6:42 PM

I believe the AWS cloud provider wants the hostname set to the FQDN

aloof-umbrella-36367

11/05/2024, 7:01 PM

is there a way possible for Rancher to do that when it spins up the cluster?

aloof-umbrella-36367

11/05/2024, 7:29 PM

pushing up another cluster.....that seemed to make it happy....

aloof-umbrella-36367

11/05/2024, 7:30 PM

The only thing I dont like was I had to redo the hostname as soon as the server came up, then restart it.....

aloof-umbrella-36367

11/05/2024, 7:30 PM

If I could get get Rancher or AWS to automatically set the hostname to the DNS, Id be in better shape........Any thoughts????

creamy-pencil-82913

11/05/2024, 7:33 PM

yeah I’m not sure about that :/

aloof-umbrella-36367

11/05/2024, 7:34 PM

me either....lol

aloof-umbrella-36367

11/05/2024, 7:34 PM

im lookin up now.

creamy-pencil-82913

11/05/2024, 7:34 PM

you could perhaps do it with cloud-init? userdata script?

aloof-umbrella-36367

11/05/2024, 7:34 PM

maybe??? Although Rancher is just pulling from an AMI.

aloof-umbrella-36367

11/05/2024, 7:34 PM

I wish there was a way from there....

aloof-umbrella-36367

11/19/2024, 8:00 PM

@creamy-pencil-82913 Thanks for all your help......we finally found the .py script that was setting the IP.....We added in the FQDN to the script and are now able to spin up a cluster.

aloof-umbrella-36367

11/19/2024, 8:01 PM

However, now when I add another worker into the cluster using the Rancher UI......I can see this in the logs of the server that I am trying to tie in as a worker into the cluster:

aloof-umbrella-36367

11/19/2024, 8:02 PM

"/var/log/pods/cattle-system_apply-system-agent-upgrader-on-ip-100-103-XXX-XXX-with-a3b-828xx_bb243443-ff88-46fa-8d89-b2a1328989e5/upgrade"

aloof-umbrella-36367

11/19/2024, 8:05 PM

2024-11-19T195713.09523291Z stderr F + CATTLE_AGENT_VAR_DIR=/var/lib/rancher/agent 2024-11-19T195713.095275555Z stderr F + TMPDIRBASE=/var/lib/rancher/agent/tmp 2024-11-19T195713.095280668Z stderr F + mkdir -p /host/var/lib/rancher/agent/tmp 2024-11-19T195713.105346926Z stderr F ++ chroot /host /bin/sh -c 'mktemp -d -p /var/lib/rancher/agent/tmp' 2024-11-19T195713.109409085Z stderr F + TMPDIR=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ 2024-11-19T195713.109431242Z stderr F + trap cleanup EXIT 2024-11-19T195713.109436222Z stderr F + trap exit INT HUP TERM 2024-11-19T195713.109439592Z stderr F + cp /opt/rancher-system-agent-suc/install.sh /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ 2024-11-19T195713.119573919Z stderr F + cp /opt/rancher-system-agent-suc/rancher-system-agent /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ 2024-11-19T195713.15984102Z stderr F + cp /opt/rancher-system-agent-suc/system-agent-uninstall.sh /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent-uninstall.sh 2024-11-19T195713.16706152Z stderr F + chmod +x /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/install.sh 2024-11-19T195713.173777677Z stderr F + chmod +x /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent-uninstall.sh 2024-11-19T195713.181222466Z stderr F + '[' -n ip-100-103-XXX-XXX.us-gov-west-1.compute.internal ']' 2024-11-19T195713.181242622Z stderr F + NODE_FILE=/host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195713.181247745Z stderr F + kubectl get node ip-100-103-XXX-XXX.us-gov-west-1.compute.internal -o yaml 2024-11-19T195743.242817822Z stderr F E1119 195743.242605 2510 memcache.go:265] couldn't get current server API group list: Get "https://10.43.0.1:443/api?timeout=32s": dial tcp 10.43.0.1443 i/o timeout 2024-11-19T195743.271375217Z stderr F + '[' -z '' ']' 2024-11-19T195743.27183632Z stderr F + grep -q 'node-role.kubernetes.io/etcd: "true"' /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195743.273184789Z stderr F + '[' -z '' ']' 2024-11-19T195743.273387301Z stderr F + grep -q 'node-role.kubernetes.io/controlplane: "true"' /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195743.274756508Z stderr F + '[' -z '' ']' 2024-11-19T195743.274776474Z stderr F + grep -q 'node-role.kubernetes.io/control-plane: "true"' /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195743.275922039Z stderr F + '[' -z '' ']' 2024-11-19T195743.275971961Z stderr F + grep -q 'node-role.kubernetes.io/worker: "true"' /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195743.277135711Z stderr F + export CATTLE_AGENT_BINARY_LOCAL=true 2024-11-19T195743.277151918Z stderr F + CATTLE_AGENT_BINARY_LOCAL=true 2024-11-19T195743.277210019Z stderr F + export CATTLE_AGENT_UNINSTALL_LOCAL=true 2024-11-19T195743.277215701Z stderr F + CATTLE_AGENT_UNINSTALL_LOCAL=true 2024-11-19T195743.27725702Z stderr F + export CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent 2024-11-19T195743.277264394Z stderr F + CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent 2024-11-19T195743.277403223Z stderr F + export CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent-uninstall.sh 2024-11-19T195743.277410887Z stderr F + CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent-uninstall.sh 2024-11-19T195743.277660443Z stderr F + '[' -s /host/etc/systemd/system/rancher-system-agent.env ']' 2024-11-19T195743.277718413Z stderr F + chroot /host /var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/install.sh 2024-11-19T195743.289930996Z stderr F [FATAL] You must select at least one role. 2024-11-19T195743.290328053Z stderr F + cleanup 2024-11-19T195743.290388352Z stderr F + rm -rf /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ

aloof-umbrella-36367

11/19/2024, 8:05 PM

[FATAL] You must select at least one role.

aloof-umbrella-36367

11/19/2024, 8:05 PM

I have also upgraded to Rancher 2.10, and still get the same error as well......

aloof-umbrella-36367

11/20/2024, 4:26 PM

Looks like it will not allow any workers to join cluster unless there is a node that has all roles assigned (CEW). I tried a cluster with: (separating out CE from W) 1 Control Plane and ETCD 1 Worker = Did not work (waiting for probes: kube-apiserver, kube-controller-manager, kube-schedule) Then, I tried cluster with 1 Control Plane and ETCD and Worker = Works as normal

creamy-pencil-82913

11/20/2024, 4:45 PM

Yes? You can't have an agent (worker) without etcd and apiserver (control-plane). What would it connect to?

aloof-umbrella-36367

11/20/2024, 7:57 PM

I was trying to keep the ETCD/ControlPlane server on 1 node, and workers on another.....its how we did the RKE1 env.

aloof-umbrella-36367

11/20/2024, 8:00 PM

Is there maybe a known issue with having Trellix/Mcafee on the linux servers? Im using RHEL8

creamy-pencil-82913

11/20/2024, 8:01 PM

ok so what’s the problem? make an etc+controlplane node (server), then join a worker. (agent) You just cant do the worker first because it has nothing to connect to without a server.

aloof-umbrella-36367

11/20/2024, 8:06 PM

When I spin up a new cluster.....I need to have a server with ALL roles.

aloof-umbrella-36367

11/20/2024, 8:07 PM

Then I tried to split off the roles: 1 node with etc+controlplane 1 node with worker

aloof-umbrella-36367

11/20/2024, 8:07 PM

Then, I removed the node with ALL roles

aloof-umbrella-36367

11/20/2024, 8:07 PM

After I did that, I could no longer attach any workers to the new cluster

creamy-pencil-82913

11/20/2024, 8:08 PM

just spin up a new cluster and add one etcd+controlplane, and one worker

aloof-umbrella-36367

11/20/2024, 8:09 PM

yes.....That solution did not work

creamy-pencil-82913

11/20/2024, 8:20 PM

in what way does it not work

aloof-umbrella-36367

11/20/2024, 8:21 PM

The worker that was attached as the latest tot eh cluster complains about this in the logs: [FATAL] You must select at least one role.

aloof-umbrella-36367

11/20/2024, 8:21 PM

I have 100X made sure the WORKER role is the only one selected in the pool

creamy-pencil-82913

11/20/2024, 8:23 PM

What is the exact command that’s being run to install the rancher-system-agent? You should be able to find it in the cloud-init logs.

aloof-umbrella-36367

11/20/2024, 8:24 PM

crap....already deleted server......

aloof-umbrella-36367

11/20/2024, 8:24 PM

lemme see if I can fire up another.

creamy-pencil-82913

11/20/2024, 8:25 PM

it should have

--etcd

--controlplane

and/or

--worker

at the end depending on which roles you’ve requested

aloof-umbrella-36367

11/20/2024, 8:25 PM

ok, thanks.....Ill look when server comes up.....

creamy-pencil-82913

11/20/2024, 8:25 PM

something like

Copy code

curl -fL <https://RANCHER/system-agent-install.sh> | sudo  sh -s - --server <https://RANCHER> --label '<http://cattle.io/os=linux|cattle.io/os=linux>' --token TOKEN --etcd --controlplane --worker

aloof-umbrella-36367

11/20/2024, 8:26 PM

where are the "normal" spots for the cloud-init logs again?

creamy-pencil-82913

11/20/2024, 8:26 PM

/var/log/cloud-init-*

aloof-umbrella-36367

11/20/2024, 8:50 PM

I went through all the logs.....

aloof-umbrella-36367

11/20/2024, 8:50 PM

I found this in the cloud-init-output.log

aloof-umbrella-36367

11/20/2024, 8:50 PM

2024-11-20 203535,243 - util.py[WARNING]: Failed loading yaml blob. Yaml load allows (<class 'dict'>,) root types, but got str instead Cloud-init v. 23.4-7.el8_10.8 running 'modules:config' at Wed, 20 Nov 2024 203706 +0000. Up 155.39 seconds. [INFO] --no-roles flag passed, unsetting all other requested roles [INFO] CA strict verification is set to true [INFO] Using default agent configuration directory /etc/rancher/agent [INFO] Using default agent var directory /var/lib/rancher/agent [INFO] Successfully downloaded CA certificate [INFO] Value from https://dev.rancher/cacerts is an x509 certificate [INFO] Successfully tested Rancher connection [INFO] Downloading rancher-system-agent binary from https://dev.rancher/assets/rancher-system-agent-amd64 [INFO] Successfully downloaded the rancher-system-agent binary. [INFO] Downloading rancher-system-agent-uninstall.sh script from https://dev.rancher/assets/system-agent-uninstall.sh [INFO] Successfully downloaded the rancher-system-agent-uninstall.sh script. [INFO] Generating Cattle ID [INFO] Successfully downloaded Rancher connection information [INFO] systemd: Creating service file [INFO] Creating environment file /etc/systemd/system/rancher-system-agent.env [INFO] Enabling rancher-system-agent.service Created symlink /etc/systemd/system/multi-user.target.wants/rancher-system-agent.service → /etc/systemd/system/rancher-system-agent.service. [INFO] Starting/restarting rancher-system-agent.service Cloud-init v. 23.4-7.el8_10.8 running 'modules:final' at Wed, 20 Nov 2024 203709 +0000. Up 158.43 seconds.

creamy-pencil-82913

11/20/2024, 9:01 PM

Copy code

INFO]  --no-roles flag passed, unsetting all other requested roles

creamy-pencil-82913

11/20/2024, 9:01 PM

sure sounds like you don’t have any roles selected

creamy-pencil-82913

11/20/2024, 9:01 PM

what version of rancher are you using?

aloof-umbrella-36367

11/20/2024, 9:02 PM

just upgraded to v2.10

aloof-umbrella-36367

11/20/2024, 9:02 PM

had same issues in 2.9.3

creamy-pencil-82913

11/20/2024, 9:03 PM

is that the first pool or the second?

aloof-umbrella-36367

11/20/2024, 9:03 PM

This is my "test" pool with a running cluster.

aloof-umbrella-36367

11/20/2024, 9:04 PM

I have these pools

creamy-pencil-82913

11/20/2024, 9:04 PM

do you or do you not have another pool with etcd+control-plane in it

aloof-umbrella-36367

11/20/2024, 9:05 PM

Pool ECW = etcd/controlplane/worker Pool W = 1 worker Pool TEST = trying to get a worker installed there.

aloof-umbrella-36367

11/20/2024, 9:06 PM

I "did" get 1 worker in there earlier today.....But......I had to have Mcafee/Trelix services turned OFF before node came online.

aloof-umbrella-36367

11/20/2024, 9:07 PM

Have you heard of complications with that software lately? I never had issued with RKE1......But, I gather its a new ballgame????

creamy-pencil-82913

11/20/2024, 9:07 PM

that does not seem related to the issue here where it’s showing that you don’t have any roles selected

✅ 1

aloof-umbrella-36367

11/20/2024, 9:07 PM

Please let me know if you need to see more? AND thanks once again for your time!!!!!

creamy-pencil-82913

11/20/2024, 9:10 PM

if you go into Machines, and pick the node there, whet do you see in the labels?

creamy-pencil-82913

11/20/2024, 9:11 PM

find the one with

rke-machine-pool-name: cluster-test

and confirm that it has

worker-role: true

aloof-umbrella-36367

11/20/2024, 9:25 PM

ill send.....cleaning up so its not so large

aloof-umbrella-36367

11/20/2024, 9:26 PM

apiVersion: cluster.x-k8s.io/v1beta1 kind: Machine metadata: creationTimestamp: '2024-11-20T203401Z' finalizers: - machine.cluster.x-k8s.io generation: 2 labels: cattle.io/os: linux cluster.x-k8s.io/cluster-name: dev cluster.x-k8s.io/deployment-name:dev-cluster-test cluster.x-k8s.io/set-name: dev-cluster-test-ppb7r machine-template-hash: 215087338-ppb7r rke.cattle.io/cluster-name: dev rke.cattle.io/machine-id: ed655b07ba64b0aae1eed2e2ae99bfeb5a9ea685ea050c804da2d5684ea0283 rke.cattle.io/rke-machine-pool-name: cluster-test rke.cattle.io/worker-role: 'true' managedFields: - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: fmetadata manager: capi-machineset operation: Apply time: '2024-11-20T203401Z' - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: manager: manager operation: Update time: '2024-11-20T203401Z' - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: fstatus manager: manager operation: Update subresource: status time: '2024-11-20T203412Z' - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: manager: rancher operation: Update time: '2024-11-20T203713Z' - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: manager: rancher operation: Update subresource: status time: '2024-11-20T205427Z' name: dev-cluster-test-ppb7r-xkbdr namespace: fleet-default ownerReferences: - apiVersion: cluster.x-k8s.io/v1beta1 blockOwnerDeletion: true controller: true kind: MachineSet name: dev-cluster-test-ppb7r uid: 84c14b4f-e5a6-47fa-a5d0-25ec0aa66d42 resourceVersion: '12610520' uid: f0790777-7ba2-4b75-b407-ad0a2ce65490 spec: bootstrap: configRef: apiVersion: rke.cattle.io/v1 kind: RKEBootstrap name: dev-cluster-test-ppb7r-xkbdr namespace: fleet-default uid: 5bba66d5-2294-4233-8feb-cfb254ea1670 dataSecretName: dev-cluster-test-ppb7r-xkbdr-machine-bootstrap clusterName: dev infrastructureRef: apiVersion: rke-machine.cattle.io/v1 kind: Amazonec2Machine name: dev-cluster-test-ppb7r-xkbdr namespace: fleet-default uid: e701112b-84cf-4b36-a69f-f54c403189b9 nodeDeletionTimeout: 10s status: bootstrapReady: true conditions: - lastTransitionTime: '2024-11-20T203402Z' status: 'True' type: Ready - lastTransitionTime: '2024-11-20T203401Z' status: 'True' type: BootstrapReady - lastTransitionTime: '2024-11-20T203412Z' status: 'True' type: InfrastructureReady - lastTransitionTime: '2024-11-20T203401Z' reason: WaitingForNodeRef severity: Info status: 'False' type: NodeHealthy - lastTransitionTime: '2024-11-20T203413Z' status: 'True' type: PlanApplied - lastTransitionTime: '2024-11-20T205427Z' status: 'True' type: Reconciled infrastructureReady: false lastUpdated: '2024-11-20T203401Z' observedGeneration: 2 phase: Provisioning

aloof-umbrella-36367

11/20/2024, 9:29 PM

Basically......you wanted to see this:

aloof-umbrella-36367

11/20/2024, 9:29 PM

rke.cattle.io/worker-role: 'true'

creamy-pencil-82913

11/20/2024, 9:30 PM

you gotta figure out where

[INFO] --no-roles flag passed, unsetting all other requested roles

is coming from. Do you have something that’s injecting extra config into your clusters?

creamy-pencil-82913

11/20/2024, 9:30 PM

Are you doing this via the Rancher UI, or via tf, or something else?

aloof-umbrella-36367

11/20/2024, 9:31 PM

This is done at the Rancher UI

aloof-umbrella-36367

11/20/2024, 9:32 PM

the only strange thing I mentioned before......Is...... When I selected another AMI that had all Mcafee services OFF, the server went through OK. I then ONLY changed the AMI in the pool to one that had Mcafee services ON....and here we are.

aloof-umbrella-36367

11/20/2024, 9:33 PM

That is the only difference that I can tell.....All I did was change the AMI, nothing else

creamy-pencil-82913

11/20/2024, 9:34 PM

I suspect there is something else different about that AMI

creamy-pencil-82913

11/20/2024, 9:34 PM

take a look at the cloud-init userdata that’s coming through on the two VMs, and compare them

aloof-umbrella-36367

11/20/2024, 9:35 PM

ok.

aloof-umbrella-36367

11/20/2024, 9:35 PM

you mean the logs?

creamy-pencil-82913

11/20/2024, 9:35 PM

no, the userdata

creamy-pencil-82913

11/20/2024, 9:35 PM

that is where the script to install rancher-system-agent is passed in

aloof-umbrella-36367

11/20/2024, 9:35 PM

sorry.....could you point where that is?

aloof-umbrella-36367

11/20/2024, 9:35 PM

Ive not looked at that one before.....

aloof-umbrella-36367

11/20/2024, 9:41 PM

this path?

aloof-umbrella-36367

11/20/2024, 9:41 PM

"/var/lib/cloud/instance/user-data.txt"

creamy-pencil-82913

11/20/2024, 9:42 PM

yeah that should do it

creamy-pencil-82913

11/20/2024, 9:42 PM

whats in there

aloof-umbrella-36367

11/20/2024, 9:43 PM

there is that file.....and also one ending with same name, but ".i" at the end

aloof-umbrella-36367

11/20/2024, 9:43 PM

does it matter what file?

aloof-umbrella-36367

11/20/2024, 9:47 PM

SERVER1 (Not working) #cloud-config hostname: dev-cluster-test-ppb7r-xkbdr runcmd: - sh /usr/local/custom_script/install.sh write_files: - content: XXXXXXXX encoding: gzip+b64 path: /usr/local/custom_script/install.sh permissions: "0644" SERVER2 (Working) #cloud-config hostname: dev-cluster-test1-bmvcl-hntck runcmd: - sh /usr/local/custom_script/install.sh write_files: - content: XXXXXXXXXXXX encoding: gzip+b64 path: /usr/local/custom_script/install.sh permissions: "0644"

aloof-umbrella-36367

11/20/2024, 9:48 PM

The file that ends in the .i looks about the same

aloof-umbrella-36367

11/20/2024, 9:53 PM

The file "usr/local/custom_script/install.sh" is exactly the same on both servers (besides the token of course)

creamy-pencil-82913

11/20/2024, 9:56 PM

What is the config in those files as far as roles?

creamy-pencil-82913

11/20/2024, 9:56 PM

you should see the command in there to curl and run the install script, what does that look like

aloof-umbrella-36367

11/20/2024, 9:58 PM

you mean in the "install.sh" file? Or the user-data

aloof-umbrella-36367

11/20/2024, 10:06 PM

Maybe this section?

aloof-umbrella-36367

11/20/2024, 10:06 PM

retrieve_connection_info() { if [ "${CATTLE_REMOTE_ENABLED}" = "true" ]; then UMASK=$(umask) umask 0177 i=1 while [ "${i}" -ne "${RETRYCOUNT}" ]; do noproxy="" if [ "$(in_no_proxy ${CATTLE_AGENT_BINARY_URL})" = "0" ]; then noproxy="--noproxy '*'" fi RESPONSE=$(curl $noproxy --connect-timeout 60 --max-time 60 --write-out "%{http_code}\n" ${CURL_CAFLAG} ${CURL_LOG} -H "Authorization: Bearer ${CATTLE_TOKEN}" -H "X-Cattle-Id: ${CATTLE_ID}" -H "X-Cattle-Role-Etcd: ${CATTLE_ROLE_ETCD}" -H "X-Cattle-Role-Control-Plane: ${CATTLE_ROLE_CONTROLPLANE}" -H "X-Cattle-Role-Worker: ${CATTLE_ROLE_WORKER}" -H "X-Cattle-Node-Name: ${CATTLE_NODE_NAME}" -H "X-Cattle-Address: ${CATTLE_ADDRESS}" -H "X-Cattle-Internal-Address: ${CATTLE_INTERNAL_ADDRESS}" -H "X-Cattle-Labels: ${CATTLE_LABELS}" -H "X-Cattle-Taints: ${CATTLE_TAINTS}" "${CATTLE_SERVER}"/v3/connect/agent -o ${CATTLE_AGENT_VAR_DIR}/rancher2_connection_info.json) case "${RESPONSE}" in 200) info "Successfully downloaded Rancher connection information" umask "${UMASK}" return 0 ;; *) i=$((i + 1)) error "$RESPONSE received while downloading Rancher connection information. Sleeping for 5 seconds and trying again" sleep 5 continue ;; esac done error "Failed to download Rancher connection information in ${i} attempts" umask "${UMASK}" # Clean up invalid rancher2_connection_info.json file rm -f ${CATTLE_AGENT_VAR_DIR}/rancher2_connection_info.json return 1

aloof-umbrella-36367

11/20/2024, 10:08 PM

here is that json file.......

aloof-umbrella-36367

11/20/2024, 10:08 PM

cat rancher2_connection_info.json { "kubeConfig": "apiVersion: v1\nclusters:\n- cluster:\n certificate-authority-data: XXXXXXXXXXX\n server: https://dev.rancher\n name: agent\ncontexts:\n- context:\n cluster: agent\n user: agent\n name: agent\ncurrent-context: agent\nkind: Config\npreferences{}\nusers\n- name: agent\n user:\n token: XXXXXXXXXXXXXXXXXX\n", "namespace": "fleet-default", "secretName": "dev-cluster-test-ppb7r-xkbdr-machine-plan"

aloof-umbrella-36367

11/20/2024, 10:09 PM

I took out TOKEN to shorten

creamy-pencil-82913

11/20/2024, 10:10 PM

that connection info file is just a kubeconfig

aloof-umbrella-36367

11/20/2024, 10:11 PM

ok, anything inparticular to look for?

aloof-umbrella-36367

11/20/2024, 10:11 PM

both files look the same?

creamy-pencil-82913

11/20/2024, 10:12 PM

did you look for

no-roles

in there?

aloof-umbrella-36367

11/20/2024, 10:16 PM

if [ "${CATTLE_ROLE_NONE}" = "true" ]; then info "--no-roles flag passed, unsetting all other requested roles" CATTLE_ROLE_CONTROLPLANE=false CATTLE_ROLE_ETCD=false CATTLE_ROLE_WORKER=false ........................................................................................................................ "--no-roles") info "Role requested: none" CATTLE_ROLE_NONE=true shift 1

creamy-pencil-82913

11/20/2024, 10:29 PM

do you see CATTLE_ROLE_NONE being set in there anywhere? Or somewhere else in this host’s global environment?

aloof-umbrella-36367

11/20/2024, 10:32 PM

being set in the "install.sh" file? Or, "user-data"

aloof-umbrella-36367

11/20/2024, 10:32 PM

I dont see it "hard coded" in either.

creamy-pencil-82913

11/20/2024, 10:33 PM

its coming from somewhere, since that’s what the script is printing when it runs

creamy-pencil-82913

11/20/2024, 10:33 PM

somewhere in an env file under /etc/ perhaps?

aloof-umbrella-36367

11/20/2024, 10:33 PM

lemme grep

aloof-umbrella-36367

11/20/2024, 10:39 PM

nothing so far.....

aloof-umbrella-36367

11/20/2024, 10:39 PM

ill keep lookin

aloof-umbrella-36367

11/20/2024, 10:45 PM

its not coming up with anything in particular.

aloof-umbrella-36367

11/20/2024, 10:45 PM

Ill try firing up another worker with the AMI that has McAfee OFF.....

aloof-umbrella-36367

11/20/2024, 10:46 PM

Ill start the Mcafee services, then make a new AMI from it.

aloof-umbrella-36367

11/20/2024, 10:46 PM

Ill try thta route, that way there is Nothing different besides the mcafee services????

aloof-umbrella-36367

11/20/2024, 10:58 PM

this may continue to tomorrow....I need to cook some dinner for fam

aloof-umbrella-36367

11/20/2024, 10:58 PM

thanks again for help so far!!!!

127 Views

Open in Slack

Previous Next