This message was deleted.
# rke2
a
This message was deleted.
f
oh, I think I understand we just need to grab this container and add it to a plan every time we want to update
g
Yep exactly! Or you can upgrade based on channel, but doing that you’d want to have an automated solution to pull the rke2-upgrade containers into your private registry as well, otherwise the upgrade will of course fail with the image not being able to pull
f
got it, thanks. Do you know if nodes need to reboot after being "upgraded"? We're running in an azure vmscaleset - essentially an aws autoscaling group equiv. Just trying to weigh our options on what's least painful to maintain some kind of ansible/bash script to drain + kill vms or to implement this
g
Nope, just the rke2 process needs to be restarted (which, if you are using automated upgrade, it will handle all that for you).
f
How about any complications with SELinux? Or we should be good to go with the rke2-selinux package?
g
should be good to go with the rke2-selinux package
f
alright, I'll give it a shot. appreciate it.
👍 1
@gray-lawyer-73831 Should I be pulling these images from docker-hub or are you guys hosting somewhere else? It doesn't look like the latest release is here https://hub.docker.com/r/rancher/rke2-upgrade/tags?page=1 nvm, looks like just a different naming convention v1.23.6-rc4+rke2r1 vs v1.23.6-rc4-rke2r1
f
Kind of a stupid side question.. what's the difference between v1.21,1.22,1.23?
correlation to k8s version, I'm guessing?
yeah, that was a dumb question heh
g
correlation to k8s version, I’m guessing?
Yep! 😄
f
Hey - so I got this thing deployed in my airgap and all my images mirrored... plans deployed etc. but it doesn't look like it's doing anything?
I see this env var that I kustomized in -
SYSTEM_UPGRADE_JOB_KUBECTL_IMAGE
and I'm using a private registry.. I'm wondering if I need to put any imagePullSecrets somewhere?
g
what’s your current rke2 version, and what version or channel do you have in your plan?
f
Copy code
apiVersion: <http://kustomize.toolkit.fluxcd.io/v1beta1|kustomize.toolkit.fluxcd.io/v1beta1>
kind: Kustomization
metadata:
  name: rke2-system-upgrade-controller
  namespace: bigbang
spec:
  interval: 1m
  sourceRef:
    kind: GitRepository
    name: rke2-system-upgrade-controller-repo
  path: .
  prune: true
  images:
  - name: rancher/system-upgrade-controller
    newName: private.registry.internal/rancher/system-upgrade-controller
    newTag: v0.9.1
  patches:
    - patch: |-
        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: default-controller-env
        data:
          SYSTEM_UPGRADE_JOB_KUBECTL_IMAGE: private.registry.internal/rancher/kubectl:v1.23.6
      target:
        kind: ConfigMap
    - patch: |-
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: system-upgrade-controller
          namespace: system-upgrade
        spec:
          template:
            spec:
              imagePullSecrets:
                - name: private-registry 
      target:
        kind: Deployment
I have the images airgapped with those tags.. and it's mirroed correctly, etc, the deployment is deployed
my plan:
my plan is also kustomized in -
Copy code
patches:
  - target:
      kind: Plan
    patch: |-
      apiVersion: <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>
      kind: Plan
      metadata:
        name: whatever
      spec:
        version: v1.22.9-rke2r1
I'm just using the default and patching in the versions
g
so, from the initial thing you posted, do you have a namespace
system-upgrade
and do you see the
system-upgrade-controller
deployed in that namespace?
f
yeah
g
and what’s the current rke2 version running in the cluster?
f
v1.22.6+rke2r1
I may need to update the image paths in my plans now that I'm looking at it..
👍 1
I'm guessing I'll need image pull secrets too?
g
Yeah so you have the prereqs, next is just making sure the Plan is up to snuff
f
Copy code
# Server plan
apiVersion: <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
  labels:
    rke2-upgrade: server
spec:
  concurrency: 1
  nodeSelector:
    matchExpressions:
      - {key: rke2-upgrade, operator: Exists}
      - {key: rke2-upgrade, operator: NotIn, values: ["disabled", "false"]}
      # When using k8s version 1.19 or older, swap control-plane with master
      - {key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>, operator: In, values: ["true"]}
  serviceAccountName: system-upgrade
  cordon: true
#  drain:
#    force: true
  upgrade:
    image: rancher/rke2-upgrade
  version: v1.23.1+rke2r2
---
# Agent plan
apiVersion: <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>
kind: Plan
metadata:
  name: agent-plan
  namespace: system-upgrade
  labels:
    rke2-upgrade: agent
spec:
  concurrency: 1
  nodeSelector:
    matchExpressions:
      - {key: rke2-upgrade, operator: Exists}
      - {key: rke2-upgrade, operator: NotIn, values: ["disabled", "false"]}
      # When using k8s version 1.19 or older, swap control-plane with master
      - {key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>, operator: NotIn, values: ["true"]}
  prepare:
    args:
    - prepare
    - server-plan
    image: rancher/rke2-upgrade
  serviceAccountName: system-upgrade
  cordon: true
  drain:
    force: true
  upgrade:
    image: rancher/rke2-upgrade
  version: v1.23.1+rke2r2
I have this I'm just kustomizing over it
g
You might yeah, but if all the images necessary for v1.22.9 are present in your private registry then you might not
f
well, uhh
is there anywhere in here I can fit in some secrets?
( if needed)
not totally sure what a plan does tbh
g
I don’t think there’s anywhere to put in secrets actually
f
fudge
g
but that shouldn’t block anything
the Plans tell system-upgrade-controller what to do
f
alright, cool I wasn't sure if it was trying to do some other pod standup or something
how about this other rancher/kubectl container?
g
When it works, it will create Jobs (which will create pods) to upgrade
in my plans, I don’t mess with the kubectl version at all, but you should be able to adjust it as you’ve done. I think that is used in the jobs that get deployed
but my knowledge in this area is a little bit fuzzy
here are known working plans though (not necessarily for airgap, but should be the same):
Copy code
apiVersion: <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>
kind: Plan
metadata:
  name: rke2-server
  namespace: system-upgrade
  labels:
    rke2-upgrade: server
spec:
  concurrency: 1
  version: v1.22.8-rke2r1
  nodeSelector:
    matchExpressions:
      - {key: <http://node-role.kubernetes.io/master|node-role.kubernetes.io/master>, operator: In, values: ["true"]}
  serviceAccountName: system-upgrade
  cordon: true
  #drain:
  #  force: true
  upgrade:
    image: rancher/rke2-upgrade
---
apiVersion: <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>
kind: Plan
metadata:
  name: rke2-agent
  namespace: system-upgrade
  labels:
    rke2-upgrade: agent
spec:
  concurrency: 2
  version: v1.22.8-rke2r1
  nodeSelector:
    matchExpressions:
      - {key: <http://node-role.kubernetes.io/master|node-role.kubernetes.io/master>, operator: NotIn, values: ["true"]}
  serviceAccountName: system-upgrade
  prepare:
    image: rancher/rke2-upgrade
    args: ["prepare", "rke2-server"]
  drain:
    force: true
  upgrade:
    image: rancher/rke2-upgrade
f
so I kustoized our image path over this ->
Copy code
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: default-controller-env
  namespace: system-upgrade
data:
  SYSTEM_UPGRADE_CONTROLLER_DEBUG: "false"
  SYSTEM_UPGRADE_CONTROLLER_THREADS: "2"
  SYSTEM_UPGRADE_JOB_ACTIVE_DEADLINE_SECONDS: "900"
  SYSTEM_UPGRADE_JOB_BACKOFF_LIMIT: "99"
  SYSTEM_UPGRADE_JOB_IMAGE_PULL_POLICY: "Always"
  SYSTEM_UPGRADE_JOB_KUBECTL_IMAGE: "rancher/kubectl:v1.21.9"
  SYSTEM_UPGRADE_JOB_PRIVILEGED: "true"
  SYSTEM_UPGRADE_JOB_TTL_SECONDS_AFTER_FINISH: "900"
  SYSTEM_UPGRADE_PLAN_POLLING_INTERVAL: "15m"
just all of our images usually need a secret to be pulled
g
Your best bet is checking throughout https://github.com/rancher/system-upgrade-controller to see if there is some imagepullsecret support
f
ahh so that's like a uh, host volume secret mount or something
f
g
airgap of course always makes things more complicated, and to be honest, it’s probably safer in airgap to do what I call a “manual upgrade”
because there could be cases where you don’t have the images necessary for an upgrade, and then these automated upgrades try to pull anyway, and end up breaking your cluster
f
haven't had time to learn go yet heh
g
yeah that configmap controls the system-upgrade-controller deployment, which is the brains (the controller) behind how it applies the plans to upgrade your cluster
I THINK.. i need a secret in here somewhere so something can spin up a pod with this container?
KubectlImage
- this guy. yeah I'm not seeing anywhere to add
imagePullSecrets
to whatever pod spec
g
It’s possible that doesn’t exist and would need an enhancement to system-upgrade-controller in general. It wouldn’t hurt if you want to create an issue in that repo with the details of what you think you’d need, and if it does exist, someone with more knowledge of this than me can respond and hopefully point you in the right direction. Or if it doesn’t, it can get added and supported eventually 🙂
f
cool - thanks for poking around for me. I'll try to put one in. Just didn't know what I was looking at/for tbh :^)
g
Sorry I’m not much help here! It’s fun debugging this stuff though. Thank you!
f
it's weird, I think everything is deployed.. it's just not doing anything lol
from the rancher/system-upgrade-controller:v0.9.1 pod:
Copy code
│
│ system-upgrade-controller-5bd59b74fc-hnqn6 W0505 17:56:39.022311       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.                                                                                                      │
│ system-upgrade-controller-5bd59b74fc-hnqn6 time="2022-05-05T17:56:39Z" level=info msg="Applying CRD <http://plans.upgrade.cattle.io|plans.upgrade.cattle.io>"                                                                                                                                                                           │
│ system-upgrade-controller-5bd59b74fc-hnqn6 time="2022-05-05T17:56:40Z" level=info msg="Starting /v1, Kind=Node controller"                                                                                                                                                                             │
│ system-upgrade-controller-5bd59b74fc-hnqn6 time="2022-05-05T17:56:40Z" level=info msg="Starting /v1, Kind=Secret controller"                                                                                                                                                                           │
│ system-upgrade-controller-5bd59b74fc-hnqn6 time="2022-05-05T17:56:40Z" level=info msg="Starting batch/v1, Kind=Job controller"                                                                                                                                                                         │
│ system-upgrade-controller-5bd59b74fc-hnqn6 time="2022-05-05T17:56:40Z" level=info msg="Starting <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>, Kind=Plan controller"                                                                                                                                                            │
│
g
I’d try to minimize your plans, maybe only give it a server plan for now. When those are applied and it is a new version and working, you should see jobs/pods created
specifically minimize the nodeSelector in there
f
okay - i'll try to pull off. those two rke2-upgrade labels? what should put them there though (normally)?
Copy code
# Server plan
apiVersion: <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
  labels:
    rke2-upgrade: server
spec:
  concurrency: 1
  nodeSelector:
    matchExpressions:
      - {key: rke2-upgrade, operator: Exists}
      - {key: rke2-upgrade, operator: NotIn, values: ["disabled", "false"]}
      # When using k8s version 1.19 or older, swap control-plane with master
      - {key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>, operator: In, values: ["true"]}
  serviceAccountName: system-upgrade
  cordon: true
#  drain:
#    force: true
  upgrade:
    image: rancher/rke2-upgrade
  version: v1.23.1+rke2r2
g
Copy code
# Server plan
apiVersion: <http://upgrade.cattle.io/v1|upgrade.cattle.io/v1>
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
  labels:
    rke2-upgrade: server
spec:
  concurrency: 1
  nodeSelector:
    matchExpressions:
      - {key: <http://node-role.kubernetes.io/control-plane|node-role.kubernetes.io/control-plane>, operator: In, values: ["true"]}
  serviceAccountName: system-upgrade
  cordon: true
#  drain:
#    force: true
  upgrade:
    image: rancher/rke2-upgrade
  version: v1.23.1-rke2r2
Try that
I think I see the issue
the
version
is confusing with these Plans
it should be a dash instead of a plus
f
🤪
hey I'm getting some errors from OPA gatekeeper blocking some stuff, that's progress
🎉 1
time to go create some new exceptions..weee
g
😄
f
I see a job
🦜 1
yeah in the job container spec it's referencing my airgapped
Copy code
- name: SYSTEM_UPGRADE_PLAN_LATEST_VERSION                                                                                                                                                                                                                                                     
           value: v1.22.9-rke2r1                                                                                                                                                                                                                                                                        
          image: private.registry/rancher/rke2-upgrade:v1.22.9-rke2r1                                                                                                                                                                                                                                   
          imagePullPolicy: Always                                                                                                                                                                                                                                                                        
          name: upgrade
it just looks like it's stuck maybe
g
you don’t see any pods created? yeah probably stuck from something.. maybe those secrets here too 🤔
f
yeah I look at the job and there's no pods in there
hold on... OPA gatekeeper was blocking privileged containers even through my wildcard exceptions...
Copy code
│ Events:                                                                                                                                                                                                                                                                                                │
│   Type     Reason            Age   From               Message                                                                                                                                                                                                                                          │
│   ----     ------            ----  ----               -------                                                                                                                                                                                                                                          │
│   Warning  FailedScheduling  113s  default-scheduler  0/7 nodes are available: 3 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.                                                                                      │
│   Warning  FailedScheduling  52s   default-scheduler  0/7 nodes are available: 3 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
ahhh super close
🙏 1
I think those taints are part of our RKE2 deployment... I don't remember why..
alright, hopefully I can add a toleration to this thing
ahhh!!!!
I got past gatekeeper + taints and tolerations
Copy code
Events:                                                                                                                                                                                                                                                                                                │
│   Type     Reason     Age                From               Message                                                                                                                                                                                                                                    │
│   ----     ------     ----               ----               -------                                                                                                                                                                                                                                    │
│   Normal   Scheduled  66s                default-scheduler  Successfully assigned system-upgrade/apply-server-plan-on-vm-gvzonecil2zackrke2server000002--1-6qvnr to vm-gvzonecil2zackrke2server000002                                                                                                  │
│   Normal   Pulling    25s (x3 over 67s)  kubelet            Pulling image "private.registry/rancher/kubectl:v1.23.6"                                                                                                                                                                                  │
│   Warning  Failed     24s (x3 over 66s)  kubelet            Failed to pull image "private.registry/rancher/kubectl:v1.23.6": rpc error: code = Unknown desc = failed to pull and unpack image "private.registry/rancher/kubectl:v1.23.6": failed to resolve reference "private.registry/rancher/kub │
│ ectl:v1.23.6": pulling from host zarf.c1.internal failed with status code [manifests v1.23.6]: 401 Unauthorized                                                                                                                                                                                        │
│   Warning  Failed     24s (x3 over 66s)  kubelet            Error: ErrImagePull                                                                                                                                                                                                                        │
│   Normal   BackOff    13s (x3 over 66s)  kubelet            Back-off pulling image "private.registry/rancher/kubectl:v1.23.6"                                                                                                                                                                         │
│   Warning  Failed     13s (x3 over 66s)  kubelet            Error: ImagePullBackOff                                                                                                                                                                                                                    │
│
I'm mirroring -> private.registry to zarf.c1.internal
just need those imagePullSecrets
g
You might be able to edit the job directly and add those! Also, maybe putting the credentials in registries.yaml to avoid messing around with imagepullsecrets entirely
f
I added the credentials to registries.yaml on one of nodes and it upgrades
only problem is it's a pain because the registry creds are randomly generated after the rke2 cluster is up and running heh
we need to figure out a way to generate the secrets and have our private registry use them..
g
Ahh interesting setup! So you bring the cluster up using the tarball then I assume, and then run a private registry within the cluster itself?
f
using a tool called zarf which packages up all of our artifacts and hosts them in a docker registry and gitea server
so it's a fat tarball with everything that we can bring into an airgap
g
makes sense. So it’s much easier to just use imagepullsecrets for everything then and ensure to include that in the manifests
okay for one of the nodes that hasn’t upgraded yet, are you able to edit the job directly and add the imagepullsecrets?
or maybe just the pod
f
on the job? let me try
I can't change it on the pod because you can't change a pod spec for imagePullSecrets
Copy code
# pods "apply-server-plan-on-vm-gvzonecil2zackrke2server000000--1-h4pnm" was not valid:
# * spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
#   core.PodSpec{
#       ... // 11 identical fields
#       NodeName:         "vm-gvzonecil2zackrke2server000000",
#       SecurityContext:  &{HostNetwork: true, HostPID: true, HostIPC: true},
# -     ImagePullSecrets: []core.LocalObjectReference{{Name: "private-registry"}},
# +     ImagePullSecrets: nil,
#       Hostname:         "",
#       Subdomain:        "",
#       ... // 14 identical fields
#   }

#rke2
🤦 1
g
right
f
and I can't change the job
and I can't have the plan put it in the job
g
darn, okay. I’m going to do some internal snooping around to see if there’s a way. Would you create an issue in system-upgrade-controller repo as well? I think this is something that we don’t currently have but clearly it could be nice to include
f
yeah I don't think it should be a heavy lift to add it to the job templating for the pod spec
💯 1
not that I'm a go developer
g
It shouldn’t be, but I can’t guarantee that it’ll get completed at all or at least anytime soon since there are always a lot of other priorities going on, but let’s see what we can do! 💪 Thank you for debugging on this too, and I’m glad we found something that works even if it’s a pain right now
hey
f
huh
interesting
can I kustomize that in?
I'm afk a bit - will try later for surs
g
I think you probably can, but I’ve never done that before so I’m not sure! I asked someone a lot smarter than me and he pointed me there 🙂
f
damn, that worked
nice
g
beautiful
f
Copy code
apiVersion: <http://kustomize.toolkit.fluxcd.io/v1beta1|kustomize.toolkit.fluxcd.io/v1beta1>
kind: Kustomization
metadata:
  name: rke2-system-upgrade-controller
  namespace: bigbang
spec:
  interval: 1m
  sourceRef:
    kind: GitRepository
    name: rke2-system-upgrade-controller-repo
  path: .
  prune: true
  images:
  - name: rancher/system-upgrade-controller
    newName: private.registry/rancher/system-upgrade-controller
    newTag: v0.9.1
  patches:
    - patch: |-
        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: default-controller-env
        data:
          SYSTEM_UPGRADE_JOB_KUBECTL_IMAGE: private.registry/rancher/kubectl:v1.22.6
      target:
        kind: ConfigMap
    - patch: |-
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: system-upgrade-controller
          namespace: system-upgrade
        spec:
          template:
            spec:
              imagePullSecrets:
                - name: private-registry 
      target:
        kind: Deployment
    - patch: |-
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: system-upgrade
          namespace: system-upgrade
        imagePullSecrets:
        - name: private-registry
      target:
        kind: ServiceAccount
thanks for helping me out man
still going to keep the issue up, but this is totally workable compared to shoving it into registries.yaml for us
g
Yeah I’m glad we got something going! I’ll comment on the issue that doing this works too 🙂
f
so now that I got this thing working.. I'm getting a lot of errors on my worker nodes upgrading regarding not being able to evict pods.. so looking into that now
Copy code
│ drain evicting pod logging/logging-ek-es-master-0                                                                                                                                                                                                                                                                                  │
│ drain evicting pod istio-system/passthrough-ingressgateway-7879ff64db-kh86m                                                                                                                                                                                                                                                        │
│ drain evicting pod gatekeeper-system/gatekeeper-controller-manager-5bd878c895-4sbrp                                                                                                                                                                                                                                                │
│ drain error when evicting pods/"passthrough-ingressgateway-7879ff64db-kh86m" -n "istio-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.                                                                                                                                            │
│ drain error when evicting pods/"gatekeeper-controller-manager-5bd878c895-4sbrp" -n "gatekeeper-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.                                                                                                                                    │
│ drain error when evicting pods/"logging-ek-es-master-0" -n "logging" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
a bunch of my operators/helm charts had pod disruption budgets - so we are scaling up 🙂
💪 1