This message was deleted.
# rke2
a
This message was deleted.
c
Have you looked at the logs on the node to see why it’s not coming up? Just saying that the probes are failing doesn’t give anyone anything to work off of. Log in and see whats wrong.
a
mentions the kube certs directory, etc not being found.
It does not even create the folder?
c
the directories are created when rke2 starts. Is it being installed successfully? It the rke2-server service failing to start?
a
the /var/lib/rankcher/rke folders are being set up, thats about it.
Ill need to see if the service is started....
Here is latest provision log: [INFO ] configuring bootstrap node(s) dev8-cluster-ecw-drrqc-bqhxs: waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler
c
Check the rancher-system-agent logs to see if there are errors running the rke2 installer. Check the rke2-server logs to see if there are errors starting.
Also ensure that you have sufficient resources for all the pods to run. How many CPU cores and how much memory does this node have?
a
Here is the service log: systemctl status rke2-server.service ● rke2-server.service - Rancher Kubernetes Engine v2 (server) Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; enabled; vendor preset: disabled) Active: activating (start) since Mon 2024-11-04 191946 UTC; 2min 6s ago Docs: https://github.com/rancher/rke2#readme Main PID: 2138 (rke2) Tasks: 71 Memory: 1.7G CGroup: /system.slice/rke2-server.service ├─2138 /usr/local/bin/rke2 server ├─2380 containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd ├─2583 kubelet --volume-plugin-dir=/var/lib/kubelet/volumeplugins --file-check-frequency=5s --sync-frequency=30s --cloud-provider=aws --cloud-config= --address=0.0.0.0 --anonymous-auth=false --authen> ├─2634 /var/lib/rancher/rke2/data/v1.28.14-rke2r1-00f05d9dc660/bin/containerd-shim-runc-v2 -namespace k8s.io -id deba8d1639daad240ac8a2bc0ff22a40d191cbcef4de89d738354e99a1eafa39 -address /run/k3s/con> ├─2638 /var/lib/rancher/rke2/data/v1.28.14-rke2r1-00f05d9dc660/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3f06d3c42baa8eadfbc4a5f0921939288cd6287c3c3f988dfecd3f1127d4f9bb -address /run/k3s/con> └─2813 /var/lib/rancher/rke2/data/v1.28.14-rke2r1-00f05d9dc660/bin/containerd-shim-runc-v2 -namespace k8s.io -id 6e01c045a32cc346b9b9c0d6e5f9dde4ec38231ad1e89503ace4b9c6c0e36321 -address /run/k3s/con> Nov 04 192122 ip-*****-195-49 rke2[2138]: time="2024-11-04T192122Z" level=info msg="Defragmenting etcd database" Nov 04 192127 ip-*****-195-49 rke2[2138]: {"level":"warn","ts":"2024-11-04T192127.611701Z","logger":"etcd-client","caller":v3@v3.5.13-k3s1/retry_interceptor.go:62,"msg":"retrying of unary invoker failed> Nov 04 192127 ip-**-195-49 rke2[2138]: time="2024-11-04T192127Z" level=info msg="Failed to test data store connection: failed to report and disarm etcd alarms: etcd alarm list failed: context deadline > Nov 04 192129 ip-*****-195-49 rke2[2138]: time="2024-11-04T192129Z" level=info msg="Pod for etcd is synced" Nov 04 192132 ip-******-195-49 rke2[2138]: time="2024-11-04T192132Z" level=info msg="Defragmenting etcd database" Nov 04 192140 ip-******-195-49 rke2[2138]: time="2024-11-04T192140Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready" Nov 04 192140 ip-******-195-49 rke2[2138]: time="2024-11-04T192140Z" level=info msg="etcd data store connection OK" Nov 04 192140 ip-*******-195-49 rke2[2138]: time="2024-11-04T192140Z" level=info msg="Saving cluster bootstrap data to datastore" Nov 04 192140 ip-******-195-49 rke2[2138]: time="2024-11-04T192140Z" level=info msg="Waiting for API server to become available" Nov 04 192141 ip-*******-195-49 rke2[2138]: time="2024-11-04T192141Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
The node is a m5.large
Keep in mind, I have everything on the same node at this point for testing....(etc, control-plane, worker)
c
are you able to
kubectl describe node
? The apiserver is up, so you should.
a
I exported the kubeconfig to point to the rke2.yaml path
But, it still says connection to server 127.0.0.1:6443 was refused
c
ok, so apiserver is not running. Check the apiserver pod logs under /var/log/pods
a
Ill get the log here in a sec.....
Starting with Kubernetes 1.23, you must deactivate the
CSIMigrationAWS
feature gate to use the in-tree AWS cloud provider. You can do this by setting
feature-gates=CSIMigrationAWS=false
as an additional argument for the cluster's Kubelet, Controller Manager, API Server and Scheduler in the advanced cluster configuration.
2024-11-04T193753.45531168Z stderr F Error: invalid argument "CSIMigrationAWS=false" for "--feature-gates" flag: unrecognized feature gate: CSIMigrationAWS
looks like its complaining about it??
I took out the configs it told me to put in.....
Provision logs now say: [INFO ] configuring bootstrap node(s) dev8-cluster-ecw-drrqc-bqhxs: waiting for cluster agent to connect25124 pm[INFO ] configuring bootstrap node(s) dev8-cluster-ecw-drrqc-bqhxs: waiting for probes: kube-controller-manager
in the kube-controller-manager pod logs.....
1 leaderelection.go:260] successfully acquired lease kube-system/kube-controller-manager 2024-11-04T195042.984948375Z stderr F I1104 195042.981921 1 event.go:307] "Event occurred" object="kube-system/kube-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="ip-*****-195-49_bba5b434-1da8-4b2a-bf29-44b8c30abeec became leader" 2024-11-04T195042.994929167Z stderr F E1104 195042.994202 1 controllermanager.go:235] "Error building controller context" err="cloud provider could not be initialized: unknown cloud provider \"aws\""
c
You’re not running any of the affected versions. Don’t do any of that stuff. The AWS cloud provider is no longer built into Kubernetes and that feature gate is long removed.
a
So, dont do ANY of that stuff on the Rancher page?
c
Starting with Kubernetes 1.23 you would need to set that feature gate. Starting with 1.27 the cloud provider is completely removed from Kubernetes so you can’t use it at all (nor can you enable that feature-gate), you need to deploy the out-of-tree cloud provider as described in the first link.
a
I started Kube 1.28
so, just to be straight......I DO NOT do this step: 2. Rancher managed RKE2/K3s clusters don't support configuring
providerID
. However, the engine will set the node name correctly if the following configuration is set on the provisioning cluster object:
or you can try using UI if you prefer, as described further down the page.
a
Yes, I am trying to do the "out of tree" specified on the page.....
In Kubernetes 1.27 and later, you must use an out-of-tree AWS cloud provider.
c
yes. So don’t set cloud provider name or that csi migration feature gate.
a
So....your saying....DO NOT do the stuff in the start of that page, only do the stuff at the bottom?
Lemme try that.....
c
yes. those are instructions for older versions of Kubernetes. If you are on 1.27 or newer, set cloud provider in the Rancher UI to “External”, and then deploy the aws cloud provider helm chart.
a
Yep, Ive tried that option as well.....
I have that on another cluster.....
But, not sure its set up all good just yet.
But, your saying..... I need to when setting up a cluster from scratch, do the following: Set this for the IMDSv2:
Set the Cloud Provider to "External"?
Will I also have to do the section for setting the over rides Etcd/control-plane/worker still?
the section starting with "Override on Etcd:"
c
no, you don’t need to override anything
a
OK....building cluster now.
lets see how this one goes!
AND....Thankyou soooo much in advance!
c
note that you will need to deploy the cloud provider before it’ll finish coming up. the CNI won’t deploy until the cloud provider is available.
a
You mean this step>> 3. Specify the
aws-cloud-controller-manager
Helm chart as an additional manifest to install:
Ill need to do that only using the .yaml editor, cant through the GUI......right?
I manually put that in the cluster while edigin the .yaml file for it.....
Now it says "[INFO ] configuring bootstrap node(s) dev8-cluster-ecw-2mr4t-8dj7t: waiting for cluster agent to connect"
I put in the al the settings verbatum from the install page.....
Was there something that I needed to change from exactly what was stated on the page?
'''spec: rkeConfig: additionalManifest: |- apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: aws-cloud-controller-manager namespace: kube-system spec: chart: aws-cloud-controller-manager repo: https://kubernetes.github.io/cloud-provider-aws targetNamespace: kube-system bootstrap: true valuesContent: |- hostNetworking: true nodeSelector: node-role.kubernetes.io/control-plane: "true" args: - --configure-cloud-routes=false - --v=5 - --cloud-provider=aws
I see this in the a
"Error syncing pod, skipping" err="failed to \"StartContainer\" for \"helm\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=helm pod=helm-install-aws-cloud-controller-manager-wnvdx_kube-system(dcf32f16-71f5-41d1-9c9e-987aebefd4f9)\"" pod="kube-system/helm-install-aws-cloud-controller-manager-wnvdx" podUID="dcf32f16-71f5-41d1-9c9e-987aebefd4f9"
Im trying this.....
Undo what I just did with the cluster yaml file.
Went onto the ECW server and did a helm install.
restarting node now
in the aws cloud provider logs......Im getting this: error syncing 'ip--195-20': failed to get provider ID for node ip-*-195-20 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
That is the IP of the only server I have in teh cluster at this time
kubectl --kubeconfig=/etc/rancher/rke2/rke2.yaml get po -A NAMESPACE NAME READY STATUS RESTARTS AGE cattle-system cattle-cluster-agent-67646db497-xnfcl 0/1 Pending 0 29m kube-system aws-cloud-controller-manager-7p6vk 1/1 Running 2 (8m29s ago) 13m kube-system etcd-ip-*****-195-20 1/1 Running 1 29m kube-system helm-install-rke2-coredns-46thb 0/1 Completed 0 29m kube-system helm-install-rke2-flannel-xlsbd 0/1 Completed 0 29m kube-system helm-install-rke2-ingress-nginx-srz2x 0/1 Pending 0 29m kube-system helm-install-rke2-metrics-server-q75vt 0/1 Pending 0 29m kube-system helm-install-rke2-snapshot-controller-crd-4dpht 0/1 Pending 0 29m kube-system helm-install-rke2-snapshot-controller-d8sdt 0/1 Pending 0 29m kube-system helm-install-rke2-snapshot-validation-webhook-d68c8 0/1 Pending 0 29m kube-system kube-apiserver-ip-******-195-20 1/1 Running 2 29m kube-system kube-controller-manager-ip-*******-195-20 1/1 Running 1 (8m50s ago) 8m46s kube-system kube-flannel-ds-km2pr 1/1 Running 1 (9m39s ago) 29m kube-system kube-proxy-ip-********-195-20 1/1 Running 1 (9m39s ago) 29m kube-system kube-scheduler-ip-********-195-20 1/1 Running 0 8m45s kube-system rke2-coredns-rke2-coredns-7875c9c6b7-x5ltn 0/1 Pending 0 29m kube-system rke2-coredns-rke2-coredns-autoscaler-564964dcd5-7cx85 0/1 Pending 0 29m
c
did you tag all the instances correctly? And is the hostname set correctly? There is configuration that needs to be done. that is covered in the cloud provider docs.
a
yes, they are tagged
I have to run for the day.....Are you around tomorrow?
Question.....
If Rancher is spinning up these nodes.......It is spinning them up using the FQDN (DNS) from AWS.....(ie. ip-XXX.XXX-195-20.us-gov-west-1.compute.internal)
However, the hostname of the server is set to only (ip-XXX.XXX-195-20)
I am able to ping either name internally to that VPC????
I have also tried the same steps above using another RHEL8 AMI.......Same results as well
c
I believe the AWS cloud provider wants the hostname set to the FQDN
a
is there a way possible for Rancher to do that when it spins up the cluster?
pushing up another cluster.....that seemed to make it happy....
The only thing I dont like was I had to redo the hostname as soon as the server came up, then restart it.....
If I could get get Rancher or AWS to automatically set the hostname to the DNS, Id be in better shape........Any thoughts????
c
yeah I’m not sure about that :/
a
me either....lol
im lookin up now.
c
you could perhaps do it with cloud-init? userdata script?
a
maybe??? Although Rancher is just pulling from an AMI.
I wish there was a way from there....
@creamy-pencil-82913 Thanks for all your help......we finally found the .py script that was setting the IP.....We added in the FQDN to the script and are now able to spin up a cluster.
However, now when I add another worker into the cluster using the Rancher UI......I can see this in the logs of the server that I am trying to tie in as a worker into the cluster:
"/var/log/pods/cattle-system_apply-system-agent-upgrader-on-ip-100-103-XXX-XXX-with-a3b-828xx_bb243443-ff88-46fa-8d89-b2a1328989e5/upgrade"
2024-11-19T195713.09523291Z stderr F + CATTLE_AGENT_VAR_DIR=/var/lib/rancher/agent 2024-11-19T195713.095275555Z stderr F + TMPDIRBASE=/var/lib/rancher/agent/tmp 2024-11-19T195713.095280668Z stderr F + mkdir -p /host/var/lib/rancher/agent/tmp 2024-11-19T195713.105346926Z stderr F ++ chroot /host /bin/sh -c 'mktemp -d -p /var/lib/rancher/agent/tmp' 2024-11-19T195713.109409085Z stderr F + TMPDIR=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ 2024-11-19T195713.109431242Z stderr F + trap cleanup EXIT 2024-11-19T195713.109436222Z stderr F + trap exit INT HUP TERM 2024-11-19T195713.109439592Z stderr F + cp /opt/rancher-system-agent-suc/install.sh /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ 2024-11-19T195713.119573919Z stderr F + cp /opt/rancher-system-agent-suc/rancher-system-agent /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ 2024-11-19T195713.15984102Z stderr F + cp /opt/rancher-system-agent-suc/system-agent-uninstall.sh /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent-uninstall.sh 2024-11-19T195713.16706152Z stderr F + chmod +x /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/install.sh 2024-11-19T195713.173777677Z stderr F + chmod +x /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent-uninstall.sh 2024-11-19T195713.181222466Z stderr F + '[' -n ip-100-103-XXX-XXX.us-gov-west-1.compute.internal ']' 2024-11-19T195713.181242622Z stderr F + NODE_FILE=/host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195713.181247745Z stderr F + kubectl get node ip-100-103-XXX-XXX.us-gov-west-1.compute.internal -o yaml 2024-11-19T195743.242817822Z stderr F E1119 195743.242605 2510 memcache.go:265] couldn't get current server API group list: Get "https://10.43.0.1:443/api?timeout=32s": dial tcp 10.43.0.1443 i/o timeout 2024-11-19T195743.271375217Z stderr F + '[' -z '' ']' 2024-11-19T195743.27183632Z stderr F + grep -q 'node-role.kubernetes.io/etcd: "true"' /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195743.273184789Z stderr F + '[' -z '' ']' 2024-11-19T195743.273387301Z stderr F + grep -q 'node-role.kubernetes.io/controlplane: "true"' /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195743.274756508Z stderr F + '[' -z '' ']' 2024-11-19T195743.274776474Z stderr F + grep -q 'node-role.kubernetes.io/control-plane: "true"' /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195743.275922039Z stderr F + '[' -z '' ']' 2024-11-19T195743.275971961Z stderr F + grep -q 'node-role.kubernetes.io/worker: "true"' /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/node.yaml 2024-11-19T195743.277135711Z stderr F + export CATTLE_AGENT_BINARY_LOCAL=true 2024-11-19T195743.277151918Z stderr F + CATTLE_AGENT_BINARY_LOCAL=true 2024-11-19T195743.277210019Z stderr F + export CATTLE_AGENT_UNINSTALL_LOCAL=true 2024-11-19T195743.277215701Z stderr F + CATTLE_AGENT_UNINSTALL_LOCAL=true 2024-11-19T195743.27725702Z stderr F + export CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent 2024-11-19T195743.277264394Z stderr F + CATTLE_AGENT_BINARY_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent 2024-11-19T195743.277403223Z stderr F + export CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent-uninstall.sh 2024-11-19T195743.277410887Z stderr F + CATTLE_AGENT_UNINSTALL_LOCAL_LOCATION=/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/rancher-system-agent-uninstall.sh 2024-11-19T195743.277660443Z stderr F + '[' -s /host/etc/systemd/system/rancher-system-agent.env ']' 2024-11-19T195743.277718413Z stderr F + chroot /host /var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ/install.sh 2024-11-19T195743.289930996Z stderr F [FATAL] You must select at least one role. 2024-11-19T195743.290328053Z stderr F + cleanup 2024-11-19T195743.290388352Z stderr F + rm -rf /host/var/lib/rancher/agent/tmp/tmp.joYhjF4zdQ
[FATAL] You must select at least one role.
I have also upgraded to Rancher 2.10, and still get the same error as well......
Looks like it will not allow any workers to join cluster unless there is a node that has all roles assigned (CEW). I tried a cluster with: (separating out CE from W) 1 Control Plane and ETCD 1 Worker = Did not work (waiting for probes: kube-apiserver, kube-controller-manager, kube-schedule) Then, I tried cluster with 1 Control Plane and ETCD and Worker = Works as normal
c
Yes? You can't have an agent (worker) without etcd and apiserver (control-plane). What would it connect to?
a
I was trying to keep the ETCD/ControlPlane server on 1 node, and workers on another.....its how we did the RKE1 env.
Is there maybe a known issue with having Trellix/Mcafee on the linux servers? Im using RHEL8
c
ok so what’s the problem? make an etc+controlplane node (server), then join a worker. (agent) You just cant do the worker first because it has nothing to connect to without a server.
a
When I spin up a new cluster.....I need to have a server with ALL roles.
Then I tried to split off the roles: 1 node with etc+controlplane 1 node with worker
Then, I removed the node with ALL roles
After I did that, I could no longer attach any workers to the new cluster
c
just spin up a new cluster and add one etcd+controlplane, and one worker
a
yes.....That solution did not work
c
in what way does it not work
a
The worker that was attached as the latest tot eh cluster complains about this in the logs: [FATAL] You must select at least one role.
I have 100X made sure the WORKER role is the only one selected in the pool
c
What is the exact command that’s being run to install the rancher-system-agent? You should be able to find it in the cloud-init logs.
a
crap....already deleted server......
lemme see if I can fire up another.
c
it should have
--etcd
--controlplane
and/or
--worker
at the end depending on which roles you’ve requested
a
ok, thanks.....Ill look when server comes up.....
c
something like
Copy code
curl -fL <https://RANCHER/system-agent-install.sh> | sudo  sh -s - --server <https://RANCHER> --label '<http://cattle.io/os=linux|cattle.io/os=linux>' --token TOKEN --etcd --controlplane --worker
a
where are the "normal" spots for the cloud-init logs again?
c
/var/log/cloud-init-*
?
a
I went through all the logs.....
I found this in the cloud-init-output.log
2024-11-20 203535,243 - util.py[WARNING]: Failed loading yaml blob. Yaml load allows (<class 'dict'>,) root types, but got str instead Cloud-init v. 23.4-7.el8_10.8 running 'modules:config' at Wed, 20 Nov 2024 203706 +0000. Up 155.39 seconds. [INFO] --no-roles flag passed, unsetting all other requested roles [INFO] CA strict verification is set to true [INFO] Using default agent configuration directory /etc/rancher/agent [INFO] Using default agent var directory /var/lib/rancher/agent [INFO] Successfully downloaded CA certificate [INFO] Value from https://dev.rancher/cacerts is an x509 certificate [INFO] Successfully tested Rancher connection [INFO] Downloading rancher-system-agent binary from https://dev.rancher/assets/rancher-system-agent-amd64 [INFO] Successfully downloaded the rancher-system-agent binary. [INFO] Downloading rancher-system-agent-uninstall.sh script from https://dev.rancher/assets/system-agent-uninstall.sh [INFO] Successfully downloaded the rancher-system-agent-uninstall.sh script. [INFO] Generating Cattle ID [INFO] Successfully downloaded Rancher connection information [INFO] systemd: Creating service file [INFO] Creating environment file /etc/systemd/system/rancher-system-agent.env [INFO] Enabling rancher-system-agent.service Created symlink /etc/systemd/system/multi-user.target.wants/rancher-system-agent.service → /etc/systemd/system/rancher-system-agent.service. [INFO] Starting/restarting rancher-system-agent.service Cloud-init v. 23.4-7.el8_10.8 running 'modules:final' at Wed, 20 Nov 2024 203709 +0000. Up 158.43 seconds.
c
Copy code
INFO]  --no-roles flag passed, unsetting all other requested roles
sure sounds like you don’t have any roles selected
what version of rancher are you using?
a
just upgraded to v2.10
had same issues in 2.9.3
c
is that the first pool or the second?
a
This is my "test" pool with a running cluster.
I have these pools
c
do you or do you not have another pool with etcd+control-plane in it
a
Pool ECW = etcd/controlplane/worker Pool W = 1 worker Pool TEST = trying to get a worker installed there.
I "did" get 1 worker in there earlier today.....But......I had to have Mcafee/Trelix services turned OFF before node came online.
Have you heard of complications with that software lately? I never had issued with RKE1......But, I gather its a new ballgame????
c
that does not seem related to the issue here where it’s showing that you don’t have any roles selected
1
a
Please let me know if you need to see more? AND thanks once again for your time!!!!!
c
if you go into Machines, and pick the node there, whet do you see in the labels?
find the one with
rke-machine-pool-name: cluster-test
and confirm that it has
worker-role: true
a
ill send.....cleaning up so its not so large
apiVersion: cluster.x-k8s.io/v1beta1 kind: Machine metadata: creationTimestamp: '2024-11-20T203401Z' finalizers: - machine.cluster.x-k8s.io generation: 2 labels: cattle.io/os: linux cluster.x-k8s.io/cluster-name: dev cluster.x-k8s.io/deployment-name:dev-cluster-test cluster.x-k8s.io/set-name: dev-cluster-test-ppb7r machine-template-hash: 215087338-ppb7r rke.cattle.io/cluster-name: dev rke.cattle.io/machine-id: ed655b07ba64b0aae1eed2e2ae99bfeb5a9ea685ea050c804da2d5684ea0283 rke.cattle.io/rke-machine-pool-name: cluster-test rke.cattle.io/worker-role: 'true' managedFields: - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: fmetadata manager: capi-machineset operation: Apply time: '2024-11-20T203401Z' - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: manager: manager operation: Update time: '2024-11-20T203401Z' - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: fstatus manager: manager operation: Update subresource: status time: '2024-11-20T203412Z' - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: manager: rancher operation: Update time: '2024-11-20T203713Z' - apiVersion: cluster.x-k8s.io/v1beta1 fieldsType: FieldsV1 fieldsV1: manager: rancher operation: Update subresource: status time: '2024-11-20T205427Z' name: dev-cluster-test-ppb7r-xkbdr namespace: fleet-default ownerReferences: - apiVersion: cluster.x-k8s.io/v1beta1 blockOwnerDeletion: true controller: true kind: MachineSet name: dev-cluster-test-ppb7r uid: 84c14b4f-e5a6-47fa-a5d0-25ec0aa66d42 resourceVersion: '12610520' uid: f0790777-7ba2-4b75-b407-ad0a2ce65490 spec: bootstrap: configRef: apiVersion: rke.cattle.io/v1 kind: RKEBootstrap name: dev-cluster-test-ppb7r-xkbdr namespace: fleet-default uid: 5bba66d5-2294-4233-8feb-cfb254ea1670 dataSecretName: dev-cluster-test-ppb7r-xkbdr-machine-bootstrap clusterName: dev infrastructureRef: apiVersion: rke-machine.cattle.io/v1 kind: Amazonec2Machine name: dev-cluster-test-ppb7r-xkbdr namespace: fleet-default uid: e701112b-84cf-4b36-a69f-f54c403189b9 nodeDeletionTimeout: 10s status: bootstrapReady: true conditions: - lastTransitionTime: '2024-11-20T203402Z' status: 'True' type: Ready - lastTransitionTime: '2024-11-20T203401Z' status: 'True' type: BootstrapReady - lastTransitionTime: '2024-11-20T203412Z' status: 'True' type: InfrastructureReady - lastTransitionTime: '2024-11-20T203401Z' reason: WaitingForNodeRef severity: Info status: 'False' type: NodeHealthy - lastTransitionTime: '2024-11-20T203413Z' status: 'True' type: PlanApplied - lastTransitionTime: '2024-11-20T205427Z' status: 'True' type: Reconciled infrastructureReady: false lastUpdated: '2024-11-20T203401Z' observedGeneration: 2 phase: Provisioning
Basically......you wanted to see this:
c
you gotta figure out where
[INFO] --no-roles flag passed, unsetting all other requested roles
is coming from. Do you have something that’s injecting extra config into your clusters?
Are you doing this via the Rancher UI, or via tf, or something else?
a
This is done at the Rancher UI
the only strange thing I mentioned before......Is...... When I selected another AMI that had all Mcafee services OFF, the server went through OK. I then ONLY changed the AMI in the pool to one that had Mcafee services ON....and here we are.
That is the only difference that I can tell.....All I did was change the AMI, nothing else
c
I suspect there is something else different about that AMI
take a look at the cloud-init userdata that’s coming through on the two VMs, and compare them
a
ok.
you mean the logs?
c
no, the userdata
that is where the script to install rancher-system-agent is passed in
a
sorry.....could you point where that is?
Ive not looked at that one before.....
this path?
"/var/lib/cloud/instance/user-data.txt"
c
yeah that should do it
whats in there
a
there is that file.....and also one ending with same name, but ".i" at the end
does it matter what file?
SERVER1 (Not working) #cloud-config hostname: dev-cluster-test-ppb7r-xkbdr runcmd: - sh /usr/local/custom_script/install.sh write_files: - content: XXXXXXXX encoding: gzip+b64 path: /usr/local/custom_script/install.sh permissions: "0644" SERVER2 (Working) #cloud-config hostname: dev-cluster-test1-bmvcl-hntck runcmd: - sh /usr/local/custom_script/install.sh write_files: - content: XXXXXXXXXXXX encoding: gzip+b64 path: /usr/local/custom_script/install.sh permissions: "0644"
The file that ends in the .i looks about the same
The file "usr/local/custom_script/install.sh" is exactly the same on both servers (besides the token of course)
c
What is the config in those files as far as roles?
you should see the command in there to curl and run the install script, what does that look like
a
you mean in the "install.sh" file? Or the user-data
Maybe this section?
retrieve_connection_info() { if [ "${CATTLE_REMOTE_ENABLED}" = "true" ]; then UMASK=$(umask) umask 0177 i=1 while [ "${i}" -ne "${RETRYCOUNT}" ]; do noproxy="" if [ "$(in_no_proxy ${CATTLE_AGENT_BINARY_URL})" = "0" ]; then noproxy="--noproxy '*'" fi RESPONSE=$(curl $noproxy --connect-timeout 60 --max-time 60 --write-out "%{http_code}\n" ${CURL_CAFLAG} ${CURL_LOG} -H "Authorization: Bearer ${CATTLE_TOKEN}" -H "X-Cattle-Id: ${CATTLE_ID}" -H "X-Cattle-Role-Etcd: ${CATTLE_ROLE_ETCD}" -H "X-Cattle-Role-Control-Plane: ${CATTLE_ROLE_CONTROLPLANE}" -H "X-Cattle-Role-Worker: ${CATTLE_ROLE_WORKER}" -H "X-Cattle-Node-Name: ${CATTLE_NODE_NAME}" -H "X-Cattle-Address: ${CATTLE_ADDRESS}" -H "X-Cattle-Internal-Address: ${CATTLE_INTERNAL_ADDRESS}" -H "X-Cattle-Labels: ${CATTLE_LABELS}" -H "X-Cattle-Taints: ${CATTLE_TAINTS}" "${CATTLE_SERVER}"/v3/connect/agent -o ${CATTLE_AGENT_VAR_DIR}/rancher2_connection_info.json) case "${RESPONSE}" in 200) info "Successfully downloaded Rancher connection information" umask "${UMASK}" return 0 ;; *) i=$((i + 1)) error "$RESPONSE received while downloading Rancher connection information. Sleeping for 5 seconds and trying again" sleep 5 continue ;; esac done error "Failed to download Rancher connection information in ${i} attempts" umask "${UMASK}" # Clean up invalid rancher2_connection_info.json file rm -f ${CATTLE_AGENT_VAR_DIR}/rancher2_connection_info.json return 1
here is that json file.......
cat rancher2_connection_info.json { "kubeConfig": "apiVersion: v1\nclusters:\n- cluster:\n certificate-authority-data: XXXXXXXXXXX\n server: https://dev.rancher\n name: agent\ncontexts:\n- context:\n cluster: agent\n user: agent\n name: agent\ncurrent-context: agent\nkind: Config\npreferences{}\nusers\n- name: agent\n user:\n token: XXXXXXXXXXXXXXXXXX\n", "namespace": "fleet-default", "secretName": "dev-cluster-test-ppb7r-xkbdr-machine-plan"
I took out TOKEN to shorten
c
that connection info file is just a kubeconfig
a
ok, anything inparticular to look for?
both files look the same?
c
did you look for
no-roles
in there?
a
if [ "${CATTLE_ROLE_NONE}" = "true" ]; then info "--no-roles flag passed, unsetting all other requested roles" CATTLE_ROLE_CONTROLPLANE=false CATTLE_ROLE_ETCD=false CATTLE_ROLE_WORKER=false ........................................................................................................................ "--no-roles") info "Role requested: none" CATTLE_ROLE_NONE=true shift 1
c
do you see CATTLE_ROLE_NONE being set in there anywhere? Or somewhere else in this host’s global environment?
a
being set in the "install.sh" file? Or, "user-data"
I dont see it "hard coded" in either.
c
its coming from somewhere, since that’s what the script is printing when it runs
somewhere in an env file under /etc/ perhaps?
a
lemme grep
nothing so far.....
ill keep lookin
its not coming up with anything in particular.
Ill try firing up another worker with the AMI that has McAfee OFF.....
Ill start the Mcafee services, then make a new AMI from it.
Ill try thta route, that way there is Nothing different besides the mcafee services????
this may continue to tomorrow....I need to cook some dinner for fam
thanks again for help so far!!!!