ambitious-plastic-3551
11/16/2022, 5:44 PMsparse-fireman-14239
11/17/2022, 8:14 PMnode-taint:
- "CriticalAddonsOnly=true:NoExecute"
Though, kubelet is started with what I assume is the correct argument.
--register-with-taints=CriticalAddonsOnly=true:NoExecute
Adding the taint with kubectl works fine.sparse-dusk-81900
11/18/2022, 9:29 AMsparse-dusk-81900
11/18/2022, 9:31 AMv2.6.9
and the custom
RKE2 clusters to v1.24.7
. However, both clusters are now in “Updating” state as both have a single node/machine with status waiting for plan to be applied
. What’s the best way to troubleshoot this? I’ve already checked the rancher-system-agent.service
on the regarding VMs but didn’t find anything suspicious. Also, cluster operations like a manually triggered cert rotation after the upgrade to v1.24.7
run successfully - even on the affected nodes/machines. Because of that it just looks like an old status from a previous sync which keeps the whole clusters in “updating” state.sparse-fireman-14239
11/18/2022, 11:13 AMearly-engineer-43393
11/29/2022, 11:18 AMwaiting: waiting for viable init node
has anyone seen this before, we are not even sure how we can troubleshoot as we have no VM spun up to investigate and no other output from the logs. Thankswitty-engineer-12406
11/30/2022, 11:52 AMapiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: ClusterRole
metadata:
name: dummy-cr
rules:
- nonResourceURLs: ["/healthz", "/readyz", "/livez"]
verbs: ["get"]
- apiGroups:
- ""
resources: ["pods", "pods/exec"]
verbs: ["get", "delete", "create", "exec", "list"]
- apiGroups:
- ""
resources: ["configmaps"]
verbs: ["create", "delete"]
---
apiVersion: <http://rbac.authorization.k8s.io/v1|rbac.authorization.k8s.io/v1>
kind: ClusterRoleBinding
metadata:
name: dummy-crb
roleRef:
apiGroup: <http://rbac.authorization.k8s.io|rbac.authorization.k8s.io>
kind: ClusterRole
name: dummy-cr
subjects:
- kind: ServiceAccount
name: dummy-sa
namespace: dummy-demo
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: dummy-sa
namespace: dummy-demo
square-policeman-85866
11/30/2022, 3:29 PMsquare-policeman-85866
12/01/2022, 9:17 AMnumerous-nail-55802
12/02/2022, 11:37 AMgentle-petabyte-40055
12/03/2022, 5:53 AMgifted-eye-43916
12/05/2022, 2:32 PMworried-plastic-58654
12/07/2022, 9:07 PMto install rke2 rancher in aws ec2, it is recommended to open jump 5.4, ubuntu centos or another
?worried-plastic-58654
12/08/2022, 3:21 PMboundless-eye-27124
12/09/2022, 12:43 AMable-engineer-22050
12/09/2022, 10:49 AMable-engineer-22050
12/09/2022, 10:51 AMrefined-scientist-20236
12/09/2022, 11:10 AMhundreds-evening-84071
12/12/2022, 9:04 PMbest-microphone-20624
12/13/2022, 9:08 PMboundless-eye-27124
12/14/2022, 2:41 AMforbidden sysctl: "net.ipv4.tcp_rmem" not allowlisted
error. Already patched psp, still getting the errorsquare-policeman-85866
12/14/2022, 10:05 AMambitious-plastic-3551
12/14/2022, 8:23 PMambitious-plastic-3551
12/14/2022, 8:23 PMambitious-plastic-3551
12/14/2022, 9:34 PMagreeable-art-61329
12/14/2022, 11:46 PMconnection refused
on port 9345 of the VIP. Any thoughts?silly-jordan-81965
12/15/2022, 12:08 PMlemon-ability-39482
12/19/2022, 9:57 AMcp -ar /var/lib/rancher/rke2 /mnt/data/
to preserve all attributes, then modified /etc/rancher/rke2/config.yaml and added the line data-dir: /mnt/data/rke2
. This seems to work on agent/worker nodes. On server nodes, however, it looks like the necessary Kubernetes containers can't start. In the log of rke2-server, I keep getting the message Waiting to retrieve kube-proxy configuration; server is not ready: <https://127.0.0.1:9345/v1-rke2/readyz>: 500 Internal Server Error
, while /mnt/data/rke2/agent/containerd/containerd.log looks like this:
time="2022-12-19T09:32:29.103553429+01:00" level=info msg="CreateContainer within sandbox \"478658988f888b30063a9127fb124abd38385967b796e15016675930bbb6cf88\" for container &ContainerMetadata{Name:cloud-controller-manager,Attempt:23,}"
time="2022-12-19T09:32:29.154564201+01:00" level=info msg="CreateContainer within sandbox \"478658988f888b30063a9127fb124abd38385967b796e15016675930bbb6cf88\" for &ContainerMetadata{Name:cloud-controller-manager,Attempt:23,} returns container id \"4a3fe66d16a18ac6397dafc147a68b5bbe9bda1d7d4f7f7ce5e7f95e3a49b84b\""
time="2022-12-19T09:32:29.154953048+01:00" level=info msg="StartContainer for \"4a3fe66d16a18ac6397dafc147a68b5bbe9bda1d7d4f7f7ce5e7f95e3a49b84b\""
time="2022-12-19T09:32:29.276767961+01:00" level=info msg="StartContainer for \"4a3fe66d16a18ac6397dafc147a68b5bbe9bda1d7d4f7f7ce5e7f95e3a49b84b\" returns successfully"
time="2022-12-19T09:32:29.691126028+01:00" level=info msg="shim disconnected" id=4a3fe66d16a18ac6397dafc147a68b5bbe9bda1d7d4f7f7ce5e7f95e3a49b84b
time="2022-12-19T09:32:29.691184071+01:00" level=warning msg="cleaning up after shim disconnected" id=4a3fe66d16a18ac6397dafc147a68b5bbe9bda1d7d4f7f7ce5e7f95e3a49b84b namespace=<http://k8s.io|k8s.io>
time="2022-12-19T09:32:29.691196163+01:00" level=info msg="cleaning up dead shim"
time="2022-12-19T09:32:29.708945825+01:00" level=warning msg="cleanup warnings time=\"2022-12-19T09:32:29+01:00\" level=info msg=\"starting signal loop\" namespace=<http://k8s.io|k8s.io> pid=3497353 runtime=io.containerd.runc.v2\n"
time="2022-12-19T09:32:30.049467561+01:00" level=info msg="RemoveContainer for \"9e40fb319e648b964c90cc77975c6cf7400aac36e53eb6354f738ad31995ce3c\""
time="2022-12-19T09:32:30.056618827+01:00" level=info msg="RemoveContainer for \"9e40fb319e648b964c90cc77975c6cf7400aac36e53eb6354f738ad31995ce3c\" returns successfully"
There are similar messages for kube-apiserver, etcd and kube-controller-manager.
If I remove the data-dir line from my config, it all works again.
Am I doing something wrong here? Some help would be much appreciated.creamy-pencil-82913
12/19/2022, 11:30 AMflat-notebook-92639
12/19/2022, 5:35 PMv1.23.12
to version v1.24.8
and I am facing an issue with the crictl commands. When I load locally a "big" (around 5GB) container image (/var/lib/rancher/rke2/bin/ctr --address=/var/run/k3s/containerd/containerd.sock -n <http://k8s.io|k8s.io> images import /tmp/my-image.tar
), I can not remove it with crictl
without having an error that I did not have with RKE2 v1.23.12
.
After some investigation, it seems that crictl v1.24.0
(https://github.com/rancher/rke2/blob/v1.24.8+rke2r1/Dockerfile#L137) is maybe the source of the problem because I downloaded the version v1.23.0
and crictl rmi
command works well. Just below commands and outputs:
With crictl v1.24.0
and RKE2 v1.24.8
$ /var/lib/rancher/rke2/bin/crictl --runtime-endpoint=unix:///run/k3s/containerd/containerd.sock inspecti my-registry:30005/my-project/test/test:1.0.0
{
"status": {
"id": "sha256:[...]",
"repoTags": [
"my-registry:30005/my-project/test/test:1.0.0"
],
"repoDigests": [],
"size": "5357042852",
"uid": null,
"username": "test",
"spec": null,
...
}
$ /var/lib/rancher/rke2/bin/crictl --runtime-endpoint=unix:///run/k3s/containerd/containerd.sock rmi my-registry:30005/my-project/test/test:1.0.0
E1219 17:18:29.677479 127034 remote_image.go:266] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = failed to delete image reference \"sha256:[...]\" for \"sha256:[...]\": context deadline exceeded: unknown" image="my-registry:30005/my-project/test/test:1.0.0"
ERRO[0002] error of removing image "my-registry:30005/my-project/test/test:1.0.0": rpc error: code = Unknown desc = failed to delete image reference "sha256:[...]" for "sha256:[...]": context deadline exceeded: unknown
FATA[0002] unable to remove the image(s)
With crictl v1.23.0
and RKE2 v1.24.8
:
$ ./crictl --version
crictl version v1.23.0
$ /var/lib/rancher/rke2/bin/crictl --runtime-endpoint=unix:///run/k3s/containerd/containerd.sock inspecti my-registry:30005/my-project/test/test:1.0.0
{
"status": {
"id": "sha256:[...]",
"repoTags": [
"my-registry:30005/my-project/test/test:1.0.0"
],
"repoDigests": [],
"size": "5357042852",
"uid": null,
"username": "test",
"spec": null,
...
}
$ ./crictl --runtime-endpoint=unix:///run/k3s/containerd/containerd.sock rmi my-registry:30005/my-project/test/test:1.0.0
Deleted: my-registry:30005/my-project/test/test:1.0.0
Have you heard of a similar issue?