This message was deleted.
# rke2
a
This message was deleted.
c
Can you post the full output? it looks like it’s stuck in a state that the helm job pod doesn’t handle, I’m curious what that is.
you could try doing
kubectl delete helmchart -n kube-system rke2-coredns
; that should trigger an uninstall of the chart. Then reinstart rke2 on one of the servers and it should put it back.
b
k, I'll give that a try shortly
if coredns isn't running are we sure the uninstall will work? ie: the controller wouldn't be able to resolve kubernete.svc or whatever and therefore not do anything..
Copy code
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
	echo "KUBERNETES_SERVICE_HOST is using IPv6"
	CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
	CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ [[ -f /config/ca-file.pem ]]
+ [[ -n '' ]]
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-coredns.tgz.base64
+ CHART_PATH=/tmp/rke2-coredns.tgz
+ [[ ! -f /chart/rke2-coredns.tgz.base64 ]]
+ base64 -d /chart/rke2-coredns.tgz.base64
+ CHART=/tmp/rke2-coredns.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-coredns.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^rke2-coredns$' --namespace kube-system --output json
++ tr '[:upper:]' '[:lower:]'
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
+ LINE=1.9.3,uninstalling
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-10_HelmChartConfig.yaml'
+ [[ install = \d\e\l\e\t\e ]]
+ [[ 1.9.3 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ uninstalling == \d\e\p\l\o\y\e\d ]]
+ [[ uninstalling =~ ^(deleted|failed|null|unknown)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16 rke2-coredns /tmp/rke2-coredns.tgz --values /config/values-10_HelmChartConfig.yaml
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
that's the full log
c
ah, it got stuck uninstalling somehow
you can try deleting it to trigger a retry of the delete, or use the helm cli to do it manually
or worst case scenario delete the helm secret
b
I was planning on that last option
unless you want me to try something else first..
just in the middle of something else so will be a few minutes
c
nah, I’d probably just nuke the secret
b
ok, seems to have done the trick FYI
I'll keep an eye on it for a bit
it seems to be in some sort of loop..constantly running helm-install-rke2-coredns-pg5k2
like, it's successfully deployed, but that keeps running and completing but not sure why it keeps running non-stop
c
Do you have different content in the rke2-coredns manifest on different nodes?
what do you see if you do
kubectl describe addon -n kube-system rke2-coredns
?
b
Copy code
kubectl describe addon -n kube-system rke2-coredns                                                                                                                                                                             (thansen-docker|✚1…1)
Name:         rke2-coredns
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>
API Version:  <http://k3s.cattle.io/v1|k3s.cattle.io/v1>
Kind:         Addon
Metadata:
  Creation Timestamp:  2023-05-12T20:33:39Z
  Generation:          2
  Resource Version:    986083903
  UID:                 e04c254f-eef8-465a-9989-31267358160f
Spec:
  Checksum:  46e8acef2cd2a7d6cd576f90325a27836d4f8a0e1360243bc419064a12e35078
  Source:    /var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml
Status:
  Gvks:
    Group:    <http://helm.cattle.io|helm.cattle.io>
    Kind:     HelmChart
    Version:  v1
Events:       <none>
c
hmm no events from it being redeployed
the pod logs show it succeeding, but the job keeps re-running?
b
it does complete
but keeps running every minute or 2
for example I'm at revision 21 already lol
after wiping those secrets
c
something is changing it then
is the job being updated? or is the job just re-running the pod because its failing.
b
pod is not failing
jobs appear to be deleted and then recreated
c
what about
kubectl describe helmchart -n kube-system rke2-coredns
have you deployed a HelmChartConfig to customize the coredns config?
b
no
so this is a cluster migrated from rke1 -> rke2, it appears the metrics server is doing the same behavior
Copy code
sh.helm.release.v1.rke2-metrics-server.v2965     <http://helm.sh/release.v1|helm.sh/release.v1>                    1      4m19s
sh.helm.release.v1.rke2-metrics-server.v2966     <http://helm.sh/release.v1|helm.sh/release.v1>                    1      3m19s
sh.helm.release.v1.rke2-metrics-server.v2967     <http://helm.sh/release.v1|helm.sh/release.v1>                    1      2m20s
sh.helm.release.v1.rke2-metrics-server.v2968     <http://helm.sh/release.v1|helm.sh/release.v1>                    1      80s
sh.helm.release.v1.rke2-metrics-server.v2969     <http://helm.sh/release.v1|helm.sh/release.v1>                    1      19s
we converted it on Friday so it's been a few days...
c
oh
yeah, rke1->rke2 migrations are still highly experimental and not supported
we really recommend people stand up new clusters and migrate workloads over, we are not currently planning on moving forward with direct conversion support
b
based on the speed of progress on the tool I'm guessing it's likely to never be supported 😉
regardless, we've done several others and I haven't noticed this issue
c
yeah we took a shot at it, but given there is no good way to roll back if you run into problems, our support org didn’t want to be on the hook for a bunch of potential outages caused by a tool that doesn’t have a back-out option.
b
understood completely, no complaints here, we have meticulous notes/details about how to do it for our setup and we're walking through them slowly
c
if you look at the rke2-server logs there should be something in there describing why it’s updating the helm job
b
I'm just glad something exists to migrate, migrating workloads would be a larger nightmare for us 😞
I'll check out the logs and see what we can discover, gimme a few and I'll report back
Copy code
May 16 17:53:04 na01lkubrchd03 rke2[2402]: I0516 17:53:04.287517    2402 event.go:294] "Event occurred" object="kube-system/rke2-coredns" fieldPath="" kind="HelmChart" apiVersion="<http://helm.cattle.io/v1|helm.cattle.io/v1>" type="Normal" reason="ApplyJob" message="Applying HelmChart using Job kube-system/helm-install-rke2-coredns"
May 16 17:53:04 na01lkubrchd03 rke2[2402]: time="2023-05-16T17:53:04-04:00" level=error msg="error syncing 'kube-system/rke2-coredns': handler helm-controller-chart-registration: DesiredSet - Replace Wait batch/v1, Kind=Job kube-system/helm-install-rke2-coredns for helm-controller-chart-registration kube-system/rke2-coredns, requeuing"
ok, sorry to bother, it was a manual deployment of the k3s-helm-operator fighting with the rke2 built-in stuff
apparently that step in the meticulous steps got forgot on this cluster 😞
c
yall were running the k3s helm-controller on RKE1!?
b
yeah 😄
we use it for some internal deployments (not cluster services, but actual workloads)
602 Views