This message was deleted.
# harvester
a
This message was deleted.
w
I think your upgrade got stuck at phase 4 which https://docs.harvesterhci.io/v1.2/upgrade/troubleshooting/#phase-4-upgrade-nodes says not to restart the upgrade. I'd not do anything to your cluster until someone from the team spots this thread - they'll need a support bundle
w
I think when working through this list I ran -
Copy code
kubectl get jobs -n harvester-system -l <http://harvesterhci.io/upgradeComponent=node|harvesterhci.io/upgradeComponent=node>
and got no response - so assumed the jobs had completed - however I may have made a mistake there - fingers crossed we can get this sorted 😕
I did download the logs before doing this and I've just generated a support bundle - ready to send it if needed
p
deleting the upgrade is not a good idea because the upgrade controller still reconciles those two pending nodes
w
Is the simplest thing to decommission the nodes that were pending and reinstall, enrol them as new nodes? Currently only one node cordoned and the other looks healthy - however I’m not sure it was upgraded… I’ve got ram upgrades to do anyway so guess this is an opportunity to start doing these a node at a time. Or should I consider rolling back the update - that even as I write it sounds like it would have more potential of causing issues…
p
Is the simplest thing to decommission the nodes that were pending and reinstall, enrol them as new nodes?
No it would get worse because missing nodes from Rancher's perspective
I don't have a nice idea at this moment, since the upgrade is already deleted...
w
well - it had completed on 2 nodes, so was hoping that one by one we could remove the failed nodes completely - wipe them and add them fresh
(we have 4 nodes total)
1 is cordoned currently - the other three are behaving as if they are upgraded
we can potentially build out a 5th machine to add, but we would need to order another CPU ... cant really go backwards currently as we have already got a bunch of VM's running that are essential (though it is backed up to offsite S3, the issue is it would take about 4 days to copy the data back)
Thanks for your help so far btw, I guess rancher needs to be told that that machine is not longer part of the party?
p
can you generate a support bundle first?
w
Sure
p
I can take a look about the cluster's status. no worry most secrets are erased in the bundle.
You can DM me
w
I've DM'ed many thanks
b
@worried-state-78253 I have a scenario identical to yours. What was done to resolve it? Did you get any work round to update the missing nodes or did you have to reinstall them in version 1.2.1?
w
No work around yet - still being looked at, were trying to get a definitive way forward worked out @bland-baker-70724 is our main guy this side, he may also have some thoughts @prehistoric-balloon-31801 has been looking, have a conversation going there but he's very busy atm so were waiting for his thoughts too.
@kian this is the thread
@bland-baker-70724 this
b
Hello!
p
@worried-state-78253 I checked the bundle. So basically you delete the upgrade while
n2
is being drained. Right now the build-in Rancher still tries to evict pods from
n2
, we need to finish this first. Can you first check if log of the pod
rancher-67c56bb6b4-pk5cq
and see if it's still outputting this:
Copy code
evicting pod longhorn-system/instance-manager-r-ee5ad69319e908adc763f7b6833839a4
error when evicting pods/"instance-manager-r-ee5ad69319e908adc763f7b6833839a4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
If yes, I think the reason is you have a VM called
g*****-r****r-d11
on
n4
, but its volume has only 1 replica and the replica is live on
n2
. Can you confirm this first?
w
That makes total sense, I’ll be in in an hour and will confirm this - many thanks 🙏
b
Hi @prehistoric-balloon-31801 here are the logs right now for Pod rancher-67c56bb6b4-pk5cq, can confirm that is still whats happening. And yes it looks like there is a replica on n2 for a fair few volumes. What would you recommend for us todo?
w
@kian that vm can be shut down, if it can then evict happy days. If it still can’t then I suspect we will need to clone the volume to a class with replicas (one of the nvme based ones) then replace the volume that has no replicas - that volume can then be deleted and id guess the eviction can then complete …
p
For single replica volumes, you can shutdown the VM or increase the replica temporarily (https://docs.harvesterhci.io/v1.2/upgrade/v1-1-2-to-v1-2-0#5-an-upgrade-is-stuck-in-the-pre-drained-state)
b
Hi @prehistoric-balloon-31801 Everything seems happy and on the correct updated version but the n2 is still cordoned after following the instructions in the guide. Would it be fine to uncordon?
p
If you manually uncordon it, it will be cordon again by Rancher. So you will need to make the cluster out of provisioning state. Let me help you step by step. Can you first share the output of the script
./drain-status.sh
here: https://github.com/bk201/misc/tree/main/rancher
Note my reply will delay, it's evening to me...
b
kian@Kians-MacBook-Air ~ % ./drain-status.sh n1 (custom-3878476cabd0) rke-pre-drain: null harvester-pre-hook null rke-post-drain: null harvester-post-hook: null n2 (custom-607ddde4fbb8) rke-pre-drain: {"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"harvesterhci.io/post-hook"}],"preDrainHooks":[{"annotation":"harvesterhci.io/pre-hook"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0} harvester-pre-hook {"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"harvesterhci.io/post-hook"}],"preDrainHooks":[{"annotation":"harvesterhci.io/pre-hook"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0} rke-post-drain: {"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"harvesterhci.io/post-hook"}],"preDrainHooks":[{"annotation":"harvesterhci.io/pre-hook"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0} harvester-post-hook: null n3 (custom-18b385c5ce5f) rke-pre-drain: null harvester-pre-hook null rke-post-drain: null harvester-post-hook: null n4 (custom-4013ca3239b6) rke-pre-drain: null harvester-pre-hook null rke-post-drain: null harvester-post-hook: null Hi this is the output, thank you. Hope your evening is going well!
p
Thanks! Can you then do
Copy code
./post-drain.sh n2
b
Copy code
kian@Kians-MacBook-Air ~ % ./post-drain.sh n2
+ NODE=n2
++ kubectl get node n2 -o yaml
++ yq -e e '.metadata.annotations."<http://cluster.x-k8s.io/machine|cluster.x-k8s.io/machine>"' -
+ MACHINE=custom-607ddde4fbb8
++ kubectl get secret -n fleet-local custom-607ddde4fbb8-machine-plan -o yaml
++ yq -e e '.metadata.annotations."<http://rke.cattle.io/post-drain|rke.cattle.io/post-drain>"' -
+ data='{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
+ echo <http://harvester.cattle.io/post-hook|harvester.cattle.io/post-hook>: ''\''{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'\'''
<http://harvester.cattle.io/post-hook|harvester.cattle.io/post-hook>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
+ kubectl annotate secret -n fleet-local custom-607ddde4fbb8-machine-plan '<http://harvesterhci.io/post-hook={%22deleteEmptyDirData%22:true,%22disableEviction%22:false,%22enabled%22:true,%22force%22:true,%22gracePeriod%22:0,%22ignoreDaemonSets%22:true,%22ignoreErrors%22:false,%22postDrainHooks%22:[{%22annotation%22:%22harvesterhci.io/post-hook%22}],%22preDrainHooks%22:[{%22annotation%22:%22harvesterhci.io/pre-hook%22}],%22skipWaitForDeleteTimeoutSeconds%22:0,%22timeout%22:0}|harvesterhci.io/post-hook={"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"harvesterhci.io/post-hook"}],"preDrainHooks":[{"annotation":"harvesterhci.io/pre-hook"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}>'
secret/custom-607ddde4fbb8-machine-plan annotated
kian@Kians-MacBook-Air ~ %
heres the output 🙂
nolonger cordoned, looks healthy 🙂
🙌 1
p
kubectl get <http://clusters.provisioning.cattle.io|clusters.provisioning.cattle.io> local -n fleet-local -o yaml
can you share the output, to check cluster status.
b
Copy code
kian@Kians-MacBook-Air ~ % kubectl get <http://clusters.provisioning.cattle.io|clusters.provisioning.cattle.io> local -n fleet-local -o yaml
apiVersion: <http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>
kind: Cluster
metadata:
 annotations:
  <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
   {"apiVersion":"<http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>","kind":"Cluster","metadata":{"annotations":{},"name":"local","namespace":"fleet-local"},"spec":{"kubernetesVersion":"v1.25.9+rke2r1","rkeConfig":{"controlPlaneConfig":{"disable":["rke2-snapshot-controller","rke2-snapshot-controller-crd","rke2-snapshot-validation-webhook"]}}}}
  <http://objectset.rio.cattle.io/applied|objectset.rio.cattle.io/applied>: H4sIAAAAAAAA/4yQzU7DMBCEXwXt2Slt079Y4oAQ4sCVF9jYS2Ow15G9CYfK746SVqJC4udo78xovjlBIEGLgqBPgMxRUFzkPD1j+0ZGMskiubgwKOJp4eKts6ChT3F02UV2fKyMH7JQqkwiFAL1ozV+MKXqOL6DhoCMRwrEciUYa3Xz7NjePZwj/8xiDAQafDTo/yXOPZrJAUXB3NdFfnGBsmDoQfPgvQKPLflfR+gwd6Bhu9ztt3XdUGNwc7Crdr9u6jW1y/pg91vb2LXdbHarA6jzYpbSVwho6DCNNIMWBd9Yrtu+eiKpzpeiIPdkpnbzx2Wq+0G6R7Z9dCygT2WSCcpwwciURrJPxJRmZtDLUj4DAAD//5CVWGcAAgAA
  <http://objectset.rio.cattle.io/id|objectset.rio.cattle.io/id>: provisioning-cluster-create
  <http://objectset.rio.cattle.io/owner-gvk|objectset.rio.cattle.io/owner-gvk>: <http://management.cattle.io/v3|management.cattle.io/v3>, Kind=Cluster
  <http://objectset.rio.cattle.io/owner-name|objectset.rio.cattle.io/owner-name>: local
  <http://objectset.rio.cattle.io/owner-namespace|objectset.rio.cattle.io/owner-namespace>: ""
 creationTimestamp: "2023-10-27T14:18:54Z"
 finalizers:
 - <http://wrangler.cattle.io/provisioning-cluster-remove|wrangler.cattle.io/provisioning-cluster-remove>
 - <http://wrangler.cattle.io/rke-cluster-remove|wrangler.cattle.io/rke-cluster-remove>
 generation: 4
 labels:
  <http://objectset.rio.cattle.io/hash|objectset.rio.cattle.io/hash>: 50675339e9ca48d1b72932eb038d75d9d2d44618
  <http://provider.cattle.io|provider.cattle.io>: harvester
 name: local
 namespace: fleet-local
 resourceVersion: "29740662"
 uid: 7aec5041-e2e1-4469-909e-314a62837976
spec:
 kubernetesVersion: v1.25.9+rke2r1
 localClusterAuthEndpoint: {}
 rkeConfig:
  chartValues: null
  machineGlobalConfig: null
  provisionGeneration: 1
  upgradeStrategy:
   controlPlaneDrainOptions:
    deleteEmptyDirData: false
    disableEviction: false
    enabled: false
    force: false
    gracePeriod: 0
    ignoreDaemonSets: null
    postDrainHooks: null
    preDrainHooks: null
    skipWaitForDeleteTimeoutSeconds: 0
    timeout: 0
   workerDrainOptions:
    deleteEmptyDirData: false
    disableEviction: false
    enabled: false
    force: false
    gracePeriod: 0
    ignoreDaemonSets: null
    postDrainHooks: null
    preDrainHooks: null
    skipWaitForDeleteTimeoutSeconds: 0
    timeout: 0
status:
 clientSecretName: local-kubeconfig
 clusterName: local
 conditions:
 - lastUpdateTime: "2023-10-27T14:20:36Z"
  message: marking control plane as initialized and ready
  reason: Waiting
  status: Unknown
  type: Ready
 - lastUpdateTime: "2023-10-27T14:18:54Z"
  status: "False"
  type: Reconciling
 - lastUpdateTime: "2023-10-27T14:18:54Z"
  status: "False"
  type: Stalled
 - lastUpdateTime: "2023-11-11T11:30:32Z"
  status: "True"
  type: Created
 - lastUpdateTime: "2023-11-17T12:11:39Z"
  status: "True"
  type: RKECluster
 - status: Unknown
  type: DefaultProjectCreated
 - status: Unknown
  type: SystemProjectCreated
 - lastUpdateTime: "2023-10-27T14:19:09Z"
  status: "True"
  type: Connected
 - lastUpdateTime: "2023-11-17T12:11:39Z"
  message: configuring worker node(s) custom-318894c86e3c,custom-3d5f2a9df91d,custom-5f62c2f5e52e
  reason: Waiting
  status: Unknown
  type: Updated
 - lastUpdateTime: "2023-11-17T12:11:39Z"
  message: configuring worker node(s) custom-318894c86e3c,custom-3d5f2a9df91d,custom-5f62c2f5e52e
  reason: Waiting
  status: Unknown
  type: Provisioned
 observedGeneration: 4
 ready: true
kian@Kians-MacBook-Air ~ %
p
looks good, but looks like you have some new node to join right?
Here is the situation of the cluster. So basically the whole system service is upgraded to v1.2.1. Because the upgrade was deleted, the OS on
n2
and
n4
are still in 1.2.0.
b
i see, how should i go about updating n2, and n4?
p
you can manually upgrade to the same version (https://docs.harvesterhci.io/v1.2/upgrade/index#prepare-an-air-gapped-upgrade). But please wait until all new nodes join. I don't know what's your new nodes' version.
w
@prehistoric-balloon-31801 I think we only have n1 / n2 / n3 / n4 on this cluster, there shoudn't be any other nodes trying to join. We did attempt to add nodes last week of the 1.2.1 - @bland-baker-70724 that was the mini-pc's - but they failed to join and we aborted the effort (didn't get any further than getting stuck on the install screen with them trying to connect), they are now part of a separate cluster of machines on which were installing k3's / Rancher which we will hookup later.