This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

11/11/2023, 11:47 AM

This message was deleted.

witty-jelly-95845

11/11/2023, 8:56 PM

I think your upgrade got stuck at phase 4 which https://docs.harvesterhci.io/v1.2/upgrade/troubleshooting/#phase-4-upgrade-nodes says not to restart the upgrade. I'd not do anything to your cluster until someone from the team spots this thread - they'll need a support bundle

worried-state-78253

11/11/2023, 10:12 PM

I think when working through this list I ran -

Copy code

kubectl get jobs -n harvester-system -l <http://harvesterhci.io/upgradeComponent=node|harvesterhci.io/upgradeComponent=node>

and got no response - so assumed the jobs had completed - however I may have made a mistake there - fingers crossed we can get this sorted 😕

worried-state-78253

11/11/2023, 11:33 PM

I did download the logs before doing this and I've just generated a support bundle - ready to send it if needed

prehistoric-balloon-31801

11/13/2023, 6:53 AM

deleting the upgrade is not a good idea because the upgrade controller still reconciles those two pending nodes

worried-state-78253

11/13/2023, 8:01 AM

Is the simplest thing to decommission the nodes that were pending and reinstall, enrol them as new nodes? Currently only one node cordoned and the other looks healthy - however I’m not sure it was upgraded… I’ve got ram upgrades to do anyway so guess this is an opportunity to start doing these a node at a time. Or should I consider rolling back the update - that even as I write it sounds like it would have more potential of causing issues…

prehistoric-balloon-31801

11/14/2023, 9:39 AM

Is the simplest thing to decommission the nodes that were pending and reinstall, enrol them as new nodes?

No it would get worse because missing nodes from Rancher's perspective

prehistoric-balloon-31801

11/14/2023, 9:39 AM

I don't have a nice idea at this moment, since the upgrade is already deleted...

worried-state-78253

11/14/2023, 9:40 AM

well - it had completed on 2 nodes, so was hoping that one by one we could remove the failed nodes completely - wipe them and add them fresh

worried-state-78253

11/14/2023, 9:40 AM

(we have 4 nodes total)

worried-state-78253

11/14/2023, 9:42 AM

1 is cordoned currently - the other three are behaving as if they are upgraded

worried-state-78253

11/14/2023, 9:44 AM

we can potentially build out a 5th machine to add, but we would need to order another CPU ... cant really go backwards currently as we have already got a bunch of VM's running that are essential (though it is backed up to offsite S3, the issue is it would take about 4 days to copy the data back)

worried-state-78253

11/14/2023, 9:45 AM

Thanks for your help so far btw, I guess rancher needs to be told that that machine is not longer part of the party?

prehistoric-balloon-31801

11/14/2023, 9:45 AM

can you generate a support bundle first?

worried-state-78253

11/14/2023, 9:45 AM

Sure

prehistoric-balloon-31801

11/14/2023, 9:45 AM

I can take a look about the cluster's status. no worry most secrets are erased in the bundle.

prehistoric-balloon-31801

11/14/2023, 9:45 AM

You can DM me

worried-state-78253

11/14/2023, 9:48 AM

I've DM'ed many thanks

busy-photographer-15014

11/14/2023, 8:29 PM

@worried-state-78253 I have a scenario identical to yours. What was done to resolve it? Did you get any work round to update the missing nodes or did you have to reinstall them in version 1.2.1?

worried-state-78253

11/15/2023, 10:10 AM

No work around yet - still being looked at, were trying to get a definitive way forward worked out @bland-baker-70724 is our main guy this side, he may also have some thoughts @prehistoric-balloon-31801 has been looking, have a conversation going there but he's very busy atm so were waiting for his thoughts too.

worried-state-78253

11/15/2023, 10:12 AM

~~@kian this is the thread~~

worried-state-78253

11/15/2023, 10:13 AM

@bland-baker-70724 this

bland-baker-70724

11/15/2023, 10:13 AM

Hello!

prehistoric-balloon-31801

11/17/2023, 6:42 AM

@worried-state-78253 I checked the bundle. So basically you delete the upgrade while

n2

is being drained. Right now the build-in Rancher still tries to evict pods from

n2

, we need to finish this first. Can you first check if log of the pod

rancher-67c56bb6b4-pk5cq

and see if it's still outputting this:

Copy code

evicting pod longhorn-system/instance-manager-r-ee5ad69319e908adc763f7b6833839a4
error when evicting pods/"instance-manager-r-ee5ad69319e908adc763f7b6833839a4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

If yes, I think the reason is you have a VM called

g*****-r****r-d11

n4

, but its volume has only 1 replica and the replica is live on

n2

. Can you confirm this first?

worried-state-78253

11/17/2023, 8:17 AM

That makes total sense, I’ll be in in an hour and will confirm this - many thanks 🙏

bland-baker-70724

11/17/2023, 8:31 AM

Hi @prehistoric-balloon-31801 here are the logs right now for Pod rancher-67c56bb6b4-pk5cq, can confirm that is still whats happening. And yes it looks like there is a replica on n2 for a fair few volumes. What would you recommend for us todo?

'Rancher log'.log

worried-state-78253

11/17/2023, 8:35 AM

@kian that vm can be shut down, if it can then evict happy days. If it still can’t then I suspect we will need to clone the volume to a class with replicas (one of the nvme based ones) then replace the volume that has no replicas - that volume can then be deleted and id guess the eviction can then complete …

prehistoric-balloon-31801

11/17/2023, 8:39 AM

For single replica volumes, you can shutdown the VM or increase the replica temporarily (https://docs.harvesterhci.io/v1.2/upgrade/v1-1-2-to-v1-2-0#5-an-upgrade-is-stuck-in-the-pre-drained-state)

bland-baker-70724

11/17/2023, 11:23 AM

Hi @prehistoric-balloon-31801 Everything seems happy and on the correct updated version but the n2 is still cordoned after following the instructions in the guide. Would it be fine to uncordon?

prehistoric-balloon-31801

11/17/2023, 11:34 AM

If you manually uncordon it, it will be cordon again by Rancher. So you will need to make the cluster out of provisioning state. Let me help you step by step. Can you first share the output of the script

./drain-status.sh

here: https://github.com/bk201/misc/tree/main/rancher

prehistoric-balloon-31801

11/17/2023, 11:35 AM

Note my reply will delay, it's evening to me...

bland-baker-70724

11/17/2023, 11:43 AM

kian@Kians-MacBook-Air ~ % ./drain-status.sh n1 (custom-3878476cabd0) rke-pre-drain: null harvester-pre-hook null rke-post-drain: null harvester-post-hook: null n2 (custom-607ddde4fbb8) rke-pre-drain: {"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"harvesterhci.io/post-hook"}],"preDrainHooks":[{"annotation":"harvesterhci.io/pre-hook"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0} harvester-pre-hook {"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"harvesterhci.io/post-hook"}],"preDrainHooks":[{"annotation":"harvesterhci.io/pre-hook"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0} rke-post-drain: {"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"harvesterhci.io/post-hook"}],"preDrainHooks":[{"annotation":"harvesterhci.io/pre-hook"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0} harvester-post-hook: null n3 (custom-18b385c5ce5f) rke-pre-drain: null harvester-pre-hook null rke-post-drain: null harvester-post-hook: null n4 (custom-4013ca3239b6) rke-pre-drain: null harvester-pre-hook null rke-post-drain: null harvester-post-hook: null Hi this is the output, thank you. Hope your evening is going well!

prehistoric-balloon-31801

11/17/2023, 12:09 PM

Thanks! Can you then do

Copy code

./post-drain.sh n2

bland-baker-70724

11/17/2023, 12:11 PM

Copy code

kian@Kians-MacBook-Air ~ % ./post-drain.sh n2
+ NODE=n2
++ kubectl get node n2 -o yaml
++ yq -e e '.metadata.annotations."<http://cluster.x-k8s.io/machine|cluster.x-k8s.io/machine>"' -
+ MACHINE=custom-607ddde4fbb8
++ kubectl get secret -n fleet-local custom-607ddde4fbb8-machine-plan -o yaml
++ yq -e e '.metadata.annotations."<http://rke.cattle.io/post-drain|rke.cattle.io/post-drain>"' -
+ data='{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
+ echo <http://harvester.cattle.io/post-hook|harvester.cattle.io/post-hook>: ''\''{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'\'''
<http://harvester.cattle.io/post-hook|harvester.cattle.io/post-hook>: '{"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"<http://harvesterhci.io/post-hook|harvesterhci.io/post-hook>"}],"preDrainHooks":[{"annotation":"<http://harvesterhci.io/pre-hook|harvesterhci.io/pre-hook>"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}'
+ kubectl annotate secret -n fleet-local custom-607ddde4fbb8-machine-plan '<http://harvesterhci.io/post-hook={%22deleteEmptyDirData%22:true,%22disableEviction%22:false,%22enabled%22:true,%22force%22:true,%22gracePeriod%22:0,%22ignoreDaemonSets%22:true,%22ignoreErrors%22:false,%22postDrainHooks%22:[{%22annotation%22:%22harvesterhci.io/post-hook%22}],%22preDrainHooks%22:[{%22annotation%22:%22harvesterhci.io/pre-hook%22}],%22skipWaitForDeleteTimeoutSeconds%22:0,%22timeout%22:0}|harvesterhci.io/post-hook={"deleteEmptyDirData":true,"disableEviction":false,"enabled":true,"force":true,"gracePeriod":0,"ignoreDaemonSets":true,"ignoreErrors":false,"postDrainHooks":[{"annotation":"harvesterhci.io/post-hook"}],"preDrainHooks":[{"annotation":"harvesterhci.io/pre-hook"}],"skipWaitForDeleteTimeoutSeconds":0,"timeout":0}>'
secret/custom-607ddde4fbb8-machine-plan annotated
kian@Kians-MacBook-Air ~ %

bland-baker-70724

11/17/2023, 12:11 PM

heres the output 🙂

bland-baker-70724

11/17/2023, 12:12 PM

nolonger cordoned, looks healthy 🙂

🙌 1

prehistoric-balloon-31801

11/17/2023, 12:12 PM

kubectl get <http://clusters.provisioning.cattle.io|clusters.provisioning.cattle.io> local -n fleet-local -o yaml

can you share the output, to check cluster status.

bland-baker-70724

11/17/2023, 12:13 PM

Copy code

kian@Kians-MacBook-Air ~ % kubectl get <http://clusters.provisioning.cattle.io|clusters.provisioning.cattle.io> local -n fleet-local -o yaml
apiVersion: <http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>
kind: Cluster
metadata:
 annotations:
  <http://kubectl.kubernetes.io/last-applied-configuration|kubectl.kubernetes.io/last-applied-configuration>: |
   {"apiVersion":"<http://provisioning.cattle.io/v1|provisioning.cattle.io/v1>","kind":"Cluster","metadata":{"annotations":{},"name":"local","namespace":"fleet-local"},"spec":{"kubernetesVersion":"v1.25.9+rke2r1","rkeConfig":{"controlPlaneConfig":{"disable":["rke2-snapshot-controller","rke2-snapshot-controller-crd","rke2-snapshot-validation-webhook"]}}}}
  <http://objectset.rio.cattle.io/applied|objectset.rio.cattle.io/applied>: H4sIAAAAAAAA/4yQzU7DMBCEXwXt2Slt079Y4oAQ4sCVF9jYS2Ow15G9CYfK746SVqJC4udo78xovjlBIEGLgqBPgMxRUFzkPD1j+0ZGMskiubgwKOJp4eKts6ChT3F02UV2fKyMH7JQqkwiFAL1ozV+MKXqOL6DhoCMRwrEciUYa3Xz7NjePZwj/8xiDAQafDTo/yXOPZrJAUXB3NdFfnGBsmDoQfPgvQKPLflfR+gwd6Bhu9ztt3XdUGNwc7Crdr9u6jW1y/pg91vb2LXdbHarA6jzYpbSVwho6DCNNIMWBd9Yrtu+eiKpzpeiIPdkpnbzx2Wq+0G6R7Z9dCygT2WSCcpwwciURrJPxJRmZtDLUj4DAAD//5CVWGcAAgAA
  <http://objectset.rio.cattle.io/id|objectset.rio.cattle.io/id>: provisioning-cluster-create
  <http://objectset.rio.cattle.io/owner-gvk|objectset.rio.cattle.io/owner-gvk>: <http://management.cattle.io/v3|management.cattle.io/v3>, Kind=Cluster
  <http://objectset.rio.cattle.io/owner-name|objectset.rio.cattle.io/owner-name>: local
  <http://objectset.rio.cattle.io/owner-namespace|objectset.rio.cattle.io/owner-namespace>: ""
 creationTimestamp: "2023-10-27T14:18:54Z"
 finalizers:
 - <http://wrangler.cattle.io/provisioning-cluster-remove|wrangler.cattle.io/provisioning-cluster-remove>
 - <http://wrangler.cattle.io/rke-cluster-remove|wrangler.cattle.io/rke-cluster-remove>
 generation: 4
 labels:
  <http://objectset.rio.cattle.io/hash|objectset.rio.cattle.io/hash>: 50675339e9ca48d1b72932eb038d75d9d2d44618
  <http://provider.cattle.io|provider.cattle.io>: harvester
 name: local
 namespace: fleet-local
 resourceVersion: "29740662"
 uid: 7aec5041-e2e1-4469-909e-314a62837976
spec:
 kubernetesVersion: v1.25.9+rke2r1
 localClusterAuthEndpoint: {}
 rkeConfig:
  chartValues: null
  machineGlobalConfig: null
  provisionGeneration: 1
  upgradeStrategy:
   controlPlaneDrainOptions:
    deleteEmptyDirData: false
    disableEviction: false
    enabled: false
    force: false
    gracePeriod: 0
    ignoreDaemonSets: null
    postDrainHooks: null
    preDrainHooks: null
    skipWaitForDeleteTimeoutSeconds: 0
    timeout: 0
   workerDrainOptions:
    deleteEmptyDirData: false
    disableEviction: false
    enabled: false
    force: false
    gracePeriod: 0
    ignoreDaemonSets: null
    postDrainHooks: null
    preDrainHooks: null
    skipWaitForDeleteTimeoutSeconds: 0
    timeout: 0
status:
 clientSecretName: local-kubeconfig
 clusterName: local
 conditions:
 - lastUpdateTime: "2023-10-27T14:20:36Z"
  message: marking control plane as initialized and ready
  reason: Waiting
  status: Unknown
  type: Ready
 - lastUpdateTime: "2023-10-27T14:18:54Z"
  status: "False"
  type: Reconciling
 - lastUpdateTime: "2023-10-27T14:18:54Z"
  status: "False"
  type: Stalled
 - lastUpdateTime: "2023-11-11T11:30:32Z"
  status: "True"
  type: Created
 - lastUpdateTime: "2023-11-17T12:11:39Z"
  status: "True"
  type: RKECluster
 - status: Unknown
  type: DefaultProjectCreated
 - status: Unknown
  type: SystemProjectCreated
 - lastUpdateTime: "2023-10-27T14:19:09Z"
  status: "True"
  type: Connected
 - lastUpdateTime: "2023-11-17T12:11:39Z"
  message: configuring worker node(s) custom-318894c86e3c,custom-3d5f2a9df91d,custom-5f62c2f5e52e
  reason: Waiting
  status: Unknown
  type: Updated
 - lastUpdateTime: "2023-11-17T12:11:39Z"
  message: configuring worker node(s) custom-318894c86e3c,custom-3d5f2a9df91d,custom-5f62c2f5e52e
  reason: Waiting
  status: Unknown
  type: Provisioned
 observedGeneration: 4
 ready: true
kian@Kians-MacBook-Air ~ %

prehistoric-balloon-31801

11/17/2023, 12:17 PM

looks good, but looks like you have some new node to join right?

prehistoric-balloon-31801

11/17/2023, 12:18 PM

Here is the situation of the cluster. So basically the whole system service is upgraded to v1.2.1. Because the upgrade was deleted, the OS on

n2

and

n4

are still in 1.2.0.

bland-baker-70724

11/17/2023, 12:19 PM

i see, how should i go about updating n2, and n4?

prehistoric-balloon-31801

11/17/2023, 12:20 PM

you can manually upgrade to the same version (https://docs.harvesterhci.io/v1.2/upgrade/index#prepare-an-air-gapped-upgrade). But please wait until all new nodes join. I don't know what's your new nodes' version.

worried-state-78253

11/17/2023, 12:28 PM

@prehistoric-balloon-31801 I think we only have n1 / n2 / n3 / n4 on this cluster, there shoudn't be any other nodes trying to join. We did attempt to add nodes last week of the 1.2.1 - @bland-baker-70724 that was the mini-pc's - but they failed to join and we aborted the effort (didn't get any further than getting stuck on the install screen with them trying to connect), they are now part of a separate cluster of machines on which were installing k3's / Rancher which we will hookup later.

3 Views

Open in Slack

Previous Next