https://rancher.com/ logo
Join the conversationJoin Slack
Channels
academy
amazon
arm
azure
cabpr
chinese
ci-cd
danish
deutsch
developer
elemental
epinio
espanol
events
extensions
fleet
français
gcp
general
harvester
harvester-dev
hobbyfarm
hypper
japanese
k3d
k3os
k3s
k3s-contributor
kim
kubernetes
kubewarden
lima
logging
longhorn-dev
longhorn-storage
masterclass
mesos
mexico
nederlands
neuvector-security
office-hours
one-point-x
onlinemeetup
onlinetraining
opni
os
ozt
phillydotnet
portugues
rancher-desktop
rancher-extensions
rancher-setup
rancher-wrangler
random
rfed_ara
rio
rke
rke2
russian
s3gw
service-mesh
storage
submariner
supermicro-sixsq
swarm
terraform-controller
terraform-provider-rancher2
terraform-provider-rke
theranchcast
training-0110
training-0124
training-0131
training-0207
training-0214
training-1220
ukranian
v16-v21-migration
vsphere
windows
Powered by Linen
rke2
  • f

    future-monitor-61871

    10/06/2022, 6:53 PM
    We're on 1.23.7 and looking to upgrade to 1.25.2. We've got the CIS 1.6 flag turned on and having read the changes in the CIS benchmarks for 1.24+ I'm wondering if anyone has done a similar upgrade via the automated upgrade controller. Is that recomended/supported for hardened clusters?
    c
    • 2
    • 2
  • a

    able-engineer-22050

    10/07/2022, 2:44 PM
    Hi, I have RKE2 v1.22.9 with rke2-canal. As the operating system supports ipset 7.15 I'm facing the issue of it not being compatible with hardened calico ipset. According to the tigera documentation, the latest image supports ipset 7.11 which is still incompatible with 7.15. As I'm not using any of the extra security features of hardened-calico (not even sure why I went with this in the first place), I would like to replace it. I've done a CNI replacement in a different cluster (EKS, replaced the stock CNI with weave), but I'm not sure what is the procedure here (if possible). Tigera documentation mentions a possible migration path from canal to calico, but does that apply here? The CNI was installed as part of the RKE2 installation. Does replacing the CNI break anything in the cluster?
    b
    c
    • 3
    • 25
  • c

    careful-piano-35019

    10/10/2022, 10:21 AM
    Hi @most-hairdresser-42454 😄 glad to see you around. It's been some time.
  • m

    most-hairdresser-42454

    10/10/2022, 10:23 AM
    Ho @careful-piano-35019, it's a long time ago 😄
  • b

    broad-farmer-70498

    10/10/2022, 9:56 PM
    I'm trying to restore a node that is a control plane node (I rebuilt the node and copied all the data/etc back into place) but it not really starting up. The logs keep saying effectively that etcd hasn't started but I'm not really sure where to look for further debugging? any tips?
    c
    • 2
    • 29
  • h

    hallowed-energy-68622

    10/11/2022, 9:21 AM
    Hi @creamy-pencil-82913, It seems with this fixed issue https://github.com/k3s-io/helm-controller/issues/89 , we can use self signed CA in helmchart custom resource, but I do not see any field to provide the custom CA , https://docs.k3s.io/helm#helmchart-field-definitions here, to fetch the chart ?
  • a

    able-engineer-22050

    10/11/2022, 2:19 PM
    Hi, I have an RKE2 cluster with restored etcd. After cluster reset, master2 got online but it didn't connect back to Rancher. Checking rancher-system-agent logs show Oct 11 14:09:38 worker2 rancher-system-agent[1210]: time="2022-10-11T14:09:38Z" level=error msg="[K8s] Received secret that was nil" Oct 11 14:09:43 worker2 rancher-system-agent[1210]: time="2022-10-11T14:09:43Z" level=error msg="[K8s] Received secret that was nil" It appears, that the rancher agent does not have the proper token to register the cluster with Rancher. How do I get rancher-system-agent to reconnect?
    • 1
    • 2
  • r

    rough-ocean-41843

    10/11/2022, 2:29 PM
    When setting up a nginx or haproxy LB, do I set that up outside of kubernetes environment? Like as a docker service on the main rancher docker server? In tandem?
    a
    • 2
    • 6
  • r

    rough-ocean-41843

    10/11/2022, 7:56 PM
    So I got nginx working in RKE2, but how do I find out if it was set up using nodeport, clusterIP or LoadBalancer without looking at the config?
    c
    • 2
    • 11
  • s

    stale-painting-80203

    10/12/2022, 5:30 PM
    I created a Rancher HA setup with RKE2 installed on 3 nodes behind a LB. I have observed that if I reboot one of the nodes, it never rejoins the cluster after rebooting. This is not the case if I use RKE instead of RKE2. In our infrastructure it is possible that one or more nodes could fail. Is this expected behavior with RKE2?
    c
    s
    a
    • 4
    • 17
  • a

    ancient-air-32350

    10/13/2022, 5:41 PM
    is it possible to set
    kube-proxy-replacement
    to strict on rancher launched rke2 clusters with cilium ? if yes, could you please tellme how ? thanks
    c
    • 2
    • 8
  • h

    hundreds-hairdresser-46043

    10/14/2022, 2:00 PM
    General question about air-gap installs: I have 2 images i always download. which are rke2-images-calico.linux-amd64.tar.gz and rke2.linux-amd64.tar.gz. My aim is to make sure that NO images try to download during initial build of the cluster Any hints or guides to the other images? Any way i can see what was download other then tailing the log during install?
  • r

    rich-crowd-36987

    10/14/2022, 3:40 PM
    Recently moved some master nodes into different AWS AZs and am now having cert issues:
    Oct 14 15:09:01 k8worker05 rke2: time="2022-10-14T15:09:01Z" level=info msg="Connecting to proxy" url="<wss://10.149.5.62:9345/v1-rke2/connect>"
    Oct 14 15:09:01 k8worker05 rke2: time="2022-10-14T15:09:01Z" level=error msg="Failed to connect to proxy" error="x509: certificate is valid for 10.149.4.146, 10.149.4.32, 10.149.4.77, 10.43.0.1, 127.0.0.1, not 10.149.5.62"
    Oct 14 15:09:01 k8worker05 rke2: time="2022-10-14T15:09:01Z" level=error msg="Remotedialer proxy error" error="x509: certificate is valid for 10.149.4.146, 10.149.4.32, 10.149.4.77, 10.43.0.1, 127.0.0.1, not 10.149.5.62"
    Obviously
    10.149.5.62
    is the new IP and doesn't match what the cert is advertising. I'm stumped however about how the cert is being generated. The
    /etc/rancher/rke2/config.yaml
    file doesn't have any IP references... There are IPs in
    /var/lib/rancher/rke2/server/tls/dynamic-cert.json
    though these appear to be the result of some process. Any idea how to regenerate these certs?
    c
    • 2
    • 3
  • f

    flat-notebook-92639

    10/14/2022, 6:55 PM
    Hi, I have a question about the generated rke2.yaml kubeconfig (/etc/rancher/rke2/rke2.yaml). Do you know if it’s possible to override the certificate-authority-data? It seems RKE2-server uses the server/tls/server-ca.crt as certificate-authority-data, am I right? Is it possible to change the path ?
    c
    • 2
    • 6
  • m

    magnificent-vr-88571

    10/15/2022, 3:13 AM
    @creamy-pencil-82913, I have logged https://github.com/rancher/rke2/issues/3462 Once I faced the error, thought of trying out crictl command execution on nodes like.
    root@server:/home/ubuntu# crictl pull --creds "AWS:eyJwYXlsb2FkIjoieXRSVW5JMzkwRlVneitXNnpPNnJGOGRqYU9yZ0tRbEFIdkF0aGprMjlNTU1JWWdQd095QlJsQ01FUmRCWFVjZlZNNkEyRTdYS3ByeVRwRjhPNWlneStEdEtmcXdrR2tkMnlwM3RNUnFNNG8zOW1xdUsrSlVOemVWWDFUbGEwR1RqdjkyMmtXMWNsVUZuVnJxOEUzM3VubG9wdm5HbVp0a3o2YVdVSGNzM20reDEvbTl1K2dLZTk1ZnhaTnIrdU43SmRyNlBod0Z1TXBMUnNxUzZoZC9rYy9xMmwxbDJRNXk0Nm9scDNtNG9uc29pdjRid1JBMVpIaEdvMDhSS1lac" <http://1234.dkr.ecr.us-west-2.amazonaws.com/mlflow-run:latest|1234.dkr.ecr.us-west-2.amazonaws.com/mlflow-run:latest>
    This ended up in error
    WARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock]. As the default settings are now deprecated, you should set the endpoint instead.
    ERRO[0002] connect endpoint 'unix:///var/run/dockershim.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
    ERRO[0004] connect endpoint 'unix:///run/containerd/containerd.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
    FATA[0006] connect: connect endpoint 'unix:///run/crio/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded
    to overcome above error, I have added
    /etc/crictl.yaml
    file with below content.
    runtime-endpoint: unix:///run/k3s/containerd/containerd.sock
    image-endpoint: unix:///run/k3s/containerd/containerd.sock
    timeout: 10
    After creating above file, on pod creation images were pulled successfully from AWS ECR.
    • 1
    • 1
  • l

    loud-receptionist-98355

    10/17/2022, 7:42 AM
    Greetings, everyone Maybe someone knows how to solve this - I have an RKE2 cluster stuck in Updating, with an error message "Init node not found". By looking at this cluster CRDs, I've found out that it is stuck on trying to restore a nonexistent ETCD snapshot to a (already) nonexistent node. As a result, I cannot do anything with the cluster (like adding more master nodes). How do I get it un-stuck?
  • l

    loud-receptionist-98355

    10/17/2022, 2:18 PM
    nevermind, solved it
  • a

    alert-grass-67931

    10/17/2022, 3:51 PM
    Hello all! I recently setup a custom cluster that is being provisioned on Openstack. The provisioning status is stuck in "waiting for agent to checkin and apply initial plan" due to the agent being in a restart loop due to status=11/SEGV. I can't seem to break the cycle and debug further, any guidance or tips of how to proceed or where else to look for further information? rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV rancher-system-agent.service: Failed with result 'signal'. New Install Ubuntu 22.04 LTS Rancher v2.6.8 Openstack RKE2 Firewalls are open ANY<>ANY on each side to avoid any issues.
    a
    • 2
    • 3
  • r

    rich-crowd-36987

    10/17/2022, 4:42 PM
    Follow on question to my attempts to move nodes to different AWS AZs... Tried doing this with another stack and am having less luck! The original cluster had 3 master nodes. Currently 1 of those 3 is functional. On the first master node, if I remove the
    /var/lib/rancher/rke2
    directory and relaunch
    rke2 server
    , it appears to create an entirely new cluster (as the process starts successfully, but
    kubectl get nodes
    only returns itself.) On the second master node, after removing the dir and relaunching the service, it is just showing this loop in the logs:
    Oct 17 16:37:09 <http://k8mst02.espc-nostromo.nos-amc.io|k8mst02.espc-nostromo.nos-amc.io> rke2[32216]: time="2022-10-17T16:37:09Z" level=info msg="Failed to test data store connection: this server has not yet been promoted from learner to voting member"
    Oct 17 16:37:10 <http://k8mst02.espc-nostromo.nos-amc.io|k8mst02.espc-nostromo.nos-amc.io> rke2[32216]: time="2022-10-17T16:37:10Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error"
    • 1
    • 2
  • s

    sparse-fireman-14239

    10/18/2022, 11:40 AM
    I have a RKE2 1.24.6 cluster - the other day I added etcd-snapshot-schedule-cron: " */6 * * *" etcd-snapshot-retention: 56 to config.yaml and restarted the master node (control-plane,etcd,master) but even after a few days it's still using the default. Can this not be changed after cluster install?
    c
    • 2
    • 3
  • n

    nice-answer-21943

    10/18/2022, 2:32 PM
    Hey, i have rke2 v1.21 cluster with calico cni. I want to migrate it to cillium with wireguard. Is there any way to do this?
    c
    • 2
    • 1
  • n

    nice-answer-21943

    10/18/2022, 2:49 PM
    1.23 not 21*
  • n

    numerous-country-20400

    10/18/2022, 8:58 PM
    Hello. I'am currently running rke2 1.24.4 and most recently run into several helm chart failing installing due to seccomp issues. Basically i get told that the contain seccomp settings cannot be applied
    forbidden seccomp may not be set
    - for example the new cert-manager 1.10 introduces https://artifacthub.io/packages/helm/cert-manager/cert-manager/1.10.0#default-security-contexts a new default security context - thus i cannot install it on my rke2 cluster. Same goes with bitnami-wordpress start 15.2.0 which also introduce RuntimeDefault as their default runtime. Is there anything missing in my rke2 configuration or do i miss the point entirely?
    c
    • 2
    • 56
  • m

    millions-australia-75015

    10/19/2022, 11:29 AM
    Hello Team! I am trying an airgap install of an RKE2 cluster on AlmaLinux 8 VMs. My environment has no upstream DNS server and I left pretty much all of the RKE2's config to default. I have an issue with CoreDNS though.. My problem is that none of my pods are resolving the name of the services I create (could not resolve host: <service name>). Basically CoreDNS seems to be in a failed state My coredns pods are in CrashingLoopBackOff state and the logs are "Plugin/forward: no nameserver found". In this configuration I haven't changed the Corefile yet, and thus I have "forward . /etc/resolv.conf" in the file. Every server node in my cluster has an empty resolv.conf since I have no parent/upstream DNS server. I've tried to add "nameserver 8.8.8.8" to "/etc/resolv.conf", and after deleting the coredns pods so they can get recreated, they went into a running state but the logs are full of error contacting 8.8.8.8 server (obviously) and pods still can't resolve services' name. I've also tried to remove the forward plugin from the Corefile, the coredns pods are correctly running with no error in logs but all of my pods keep running the same name resolution error. I've launched a busybox pod to help me debugging and every nslookup command gives me a connection refused on the IP of the ClusterIP of coredns.
    c
    h
    e
    • 4
    • 7
  • g

    gentle-petabyte-40055

    10/19/2022, 4:24 PM
    Hello, I am having an issue with RKE2. I have a RKE1 cluster up on rancher using the vsphere provisioner. But when I try to setup an RKE2 cluster it gets stuck on connecting to the machines. Is there a special VM image I need to use for RKE2 or am I doing something wrong.
    c
    • 2
    • 9
  • g

    gentle-petabyte-40055

    10/19/2022, 4:41 PM
    [Disconnected] Cluster agent is not connected
  • c

    cool-pillow-1781

    10/19/2022, 7:48 PM
    I'm working on setting up a RKE2 cluster and I have an HA control plane tainted so it doesn't run workloads. Is there an argument to be made to go a step further and have etcd run separately from the other control plane services? Is there a best practice RKE2 deployment page I missed? Thanks in advanced for the help
  • c

    cool-pillow-1781

    10/19/2022, 7:50 PM
    Only reason I could think of would be if I wanted to scale etcd far larger than I wanted the control plane.
    b
    • 2
    • 2
  • g

    gentle-petabyte-40055

    10/20/2022, 12:19 AM
    Hey all. Whats the best way to move PVC's and PV's from an RKE1 cluster to a new RKE2 cluster?
    c
    b
    • 3
    • 5
  • s

    sparse-fireman-14239

    10/20/2022, 8:26 AM
    Once again I have a question about how to properly shutdown a RKE2 master node 😄 1. Cordon node 2. Drain node 3. systemctl stop rke2-server which ends with the service being faulted I still have etcd, api server and everything else running. From an RKE2 perspective, what is the correct way to shutdown the relevant services? Thinking primarily etcd I guess, but ideally I'd shut down the other core services first and etcd last.
    c
    • 2
    • 8
Powered by Linen
Title
s

sparse-fireman-14239

10/20/2022, 8:26 AM
Once again I have a question about how to properly shutdown a RKE2 master node 😄 1. Cordon node 2. Drain node 3. systemctl stop rke2-server which ends with the service being faulted I still have etcd, api server and everything else running. From an RKE2 perspective, what is the correct way to shutdown the relevant services? Thinking primarily etcd I guess, but ideally I'd shut down the other core services first and etcd last.
c

creamy-pencil-82913

10/20/2022, 8:34 AM
It will always show failed when you stop it. Low priority bug that it doesn't exit with 0 when you signal it to shutdown
s

sparse-fireman-14239

10/20/2022, 8:35 AM
Thanks @creamy-pencil-82913 that's good to know 🙂 Any tips on stopping etcd? Do I just stop it with crictl or should I enter the container and issue etdctl something?
c

creamy-pencil-82913

10/20/2022, 8:35 AM
If you just shut down the node normally, systemd will go through the normal process of signalling things
The killall script also works but is a little brutal
s

sparse-fireman-14239

10/20/2022, 8:36 AM
Ok thanks @creamy-pencil-82913 🙂 Yeah I've looked into using that but as you're saying it's a little brutal.
c

creamy-pencil-82913

10/20/2022, 8:38 AM
Right now we intentionally leave things running when the main rke2 process is stopped. If this is causing problems in your environment, feel free to open an issue. Making it wait and stop all the control-plane components before exiting would be non-trivial but I could see the reason for wanting it.
s

sparse-fireman-14239

10/20/2022, 8:39 AM
Yeah I've read a few GH issues regarding stopping rke2-server and I get your points. For me it's not an issue but logically, if the rke2-server unit starts something, it should also shut it down.
Oh and it'd be lovely if this behavior was documented instead of everyone not understanding why it's faulted and why services are not stopped 🙂
View count: 15