This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

09/24/2024, 8:49 AM

This message was deleted.

sticky-summer-13450

09/24/2024, 9:20 AM

Support bundle, if it's useful.

supportbundle_5bb44244-434e-4530-ad35-35c4ef1ff661_2024-09-24T08-36-22Z.zip

witty-jelly-95845

09/24/2024, 9:56 AM

a different experience to myself - with my 2 node cluster the first node got stuck at Pre-drained but the second node was still happy (not the failed rke2-agent problem). after leaving it over the weekend I rebooted it then upgrade proceeded and both nodes now 1.3.2.

witty-jelly-95845

09/24/2024, 9:57 AM

have you tried manually restarting the rke2-server process on the first node?

sticky-summer-13450

09/24/2024, 10:07 AM

Interesting... looking carefully at the output of

systemctl list-units

the first node (the one that's been "updated") does not have an

rke2-server.service

...

Copy code

lvm2-monitor.service                                                                          loaded active exited    Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
  rancher-system-agent.service                                                                  loaded active running   Rancher System Agent
  smartd.service                                                                                loaded active running   Self Monitoring and Reporting Technology (SMART) Daemon
  sshd.service                                                                                  loaded active running   OpenSSH Daemon

witty-jelly-95845

09/24/2024, 11:41 AM

so there's no /etc/systemd/system/rke2-server.service file?

sticky-summer-13450

09/24/2024, 11:46 AM

good point - there is, but it's empty:

Copy code

rancher@harvester003:~> ls -la /etc/systemd/system/rke2-server.service
-rw-r--r-- 1 root root 0 Sep 23 19:12 /etc/systemd/system/rke2-server.service

witty-jelly-95845

09/24/2024, 11:49 AM

hmm most odd. the contents of mine are as follows if you want to try recreating it though I wonder what else has happened to your node

Copy code

[Unit]
Description=Rancher Kubernetes Engine v2 (server)
Documentation=<https://github.com/rancher/rke2#readme>
Wants=network-online.target
After=network-online.target
Conflicts=rke2-agent.service

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/opt/rke2/lib/systemd/system/%N.env
KillMode=process
Delegate=yes
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/opt/rke2/bin/rke2 server
ExecStopPost=-/bin/sh -c "systemd-cgls /system.slice/%n | grep -Eo '[0-9]+ (containerd|kubelet)' | awk '{print $1}' | xargs -r kill"
EnvironmentFile=-/var/lib/rancher/rke2/system-agent-installer/rke2-sa.env

witty-jelly-95845

09/24/2024, 11:49 AM

since you have 3 nodes your other two should also have rke2-server rather than rke2-agent so the above file should exist on both of those

sticky-summer-13450

09/24/2024, 11:52 AM

The other two nodes haven't been updated yet, so may not be the same as your version - so thank you, I'll try that. But I worry: (a) why it has been truncated (b) what other damage has been done

witty-jelly-95845

09/24/2024, 11:53 AM

I trust /etc/systemd hasn't run out of space 😱

sticky-summer-13450

09/24/2024, 11:54 AM

on the non-updated nodes I have:

Copy code

-rw-r--r--  1 root root 554 Jul  6 13:27 /etc/systemd/system/rancher-system-agent.service
-rw-r--r--  1 root root 868 Sep 23 12:33 /etc/systemd/system/rke2-agent.service
-rw-r--r--  1 root root 943 Sep 23 12:33 /etc/systemd/system/rke2-server.service
-rw-r--r--. 1 root root 317 Jun 14 02:00 /etc/systemd/system/rke2-shutdown.service

On the updated node I have:

Copy code

-rw-r--r--  1 root root 554 Jul  6 13:27 /etc/systemd/system/rancher-system-agent.service
-rw-r--r--  1 root root   0 Sep 23 19:12 /etc/systemd/system/rke2-agent.service
-rw-r--r--  1 root root   0 Sep 23 19:12 /etc/systemd/system/rke2-server.service
-rw-r--r--. 1 root root 317 Sep  6 07:01 /etc/systemd/system/rke2-shutdown.service

👀 1

sticky-summer-13450

09/24/2024, 11:56 AM

So maybe the two

rke2-[agent|server]

service files have been updated (yesterday's date)

witty-jelly-95845

09/24/2024, 11:57 AM

on my updated node I have:

Copy code

-rw-r--r--   1 root root  554 Jun 14 10:40 rancher-system-agent.service
-rw-r--r--   1 root root  554 Nov 22  2022 rancher-system-agent.service.ORIG
drwxr-xr-x.  2 root root 4096 Sep  6 07:01 rancher-system-agent.service.d
drwxr-xr-x.  2 root root 4096 Sep  6 07:01 rancherd.service.d
drwxr-xr-x   2 root root 4096 Sep 12  2023 reboot.target.requires
drwxr-xr-x.  2 root root 4096 Sep  3 17:53 remote-fs.target.wants
-rw-r--r--   1 root root  868 Sep 23 17:15 rke2-agent.service
drwxr-xr-x.  2 root root 4096 Sep  6 07:01 rke2-agent.service.d
-rw-r--r--   1 root root  943 Sep 23 17:15 rke2-server.service
drwxr-xr-x.  2 root root 4096 Sep  6 07:01 rke2-server.service.d
-rw-r--r--.  1 root root  317 Sep  6 07:01 rke2-shutdown.service

sticky-summer-13450

09/24/2024, 11:57 AM

/etc

seems to be on

and it has plenty of space

Copy code

/dev/loop0     ext2      3.0G  1.3G  1.6G  46% /

witty-jelly-95845

09/24/2024, 11:58 AM

try /etc/systemd/system as for me /etc/systemd is a separate mount

sticky-summer-13450

09/24/2024, 12:47 PM

oh - I see.

/etc/systemd

is a bind mount from the

/usr/local

filesystem:

Copy code

rancher@harvester003:~> cat /etc/fstab | grep '/etc/systemd'
/usr/local/.state/etc-systemd.bind /etc/systemd none defaults,bind 0 0

Seems to be plenty of space there:

Copy code

rancher@harvester003:~> df -h /usr/local
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p5   98G   75G   19G  81% /usr/local

sticky-summer-13450

09/24/2024, 4:45 PM

So, I made

rke2-server.service

and when I start that service I find that

/opt/rke2/bin/rke2

is also zero length. So, it looks like a lot of useful files are zero length and this node is screwed.

Copy code

rancher@harvester003:~> ls -la /opt/rke2/bin/rke2
-rwxr-xr-x 1 root root 0 Aug  1 22:36 /opt/rke2/bin/rke2

sticky-summer-13450

09/24/2024, 4:45 PM

Can I stop the upgrade, roll this node back and try again? Or, what are my other options?

bland-article-62755

09/24/2024, 8:12 PM

for what it's worth, the precheck script looks to make sure that there's 30G of space available on the nodes.

bland-article-62755

09/24/2024, 8:13 PM

So while 19G seems like plenty, it's not as high as what's recommended.

bland-article-62755

09/24/2024, 8:15 PM

Is this the first node that tried to update? Did any of the other nodes complete?

sticky-summer-13450

09/24/2024, 8:18 PM

Yes, this the first node. No, the others have not updated - and they can't update until the first has completed.

bland-article-62755

09/24/2024, 8:25 PM

What I would do... Is probably: • Delete the upgrade object (sorry I don't remember exactly what the name is) which should cancel the upgrade. • Remove node1 from the cluster. • Re-install the matching version as node2 and node3. • Wait for a healthy cluster. • Run the preupgrade check ( I have a pending PR with new tests if you're feeling brave and want to try it out) and make sure everything passes. • Kick off a new upgrade to 1.3.2

👍 1

prehistoric-balloon-31801

09/25/2024, 7:29 AM

@sticky-summer-13450 I assume the rebooting node you mention is

harvester003

? can you run

blkid

on it and share with me? Thanks

prehistoric-balloon-31801

09/25/2024, 7:31 AM

cc @bland-farmer-13503 @red-king-19196 too

sticky-summer-13450

09/25/2024, 10:35 AM

Copy code

rancher@harvester003:~> sudo blkid
/dev/nvme0n1p5: LABEL="COS_PERSISTENT" UUID="81aae9fc-59e2-4c73-ad92-8a024aeb3357" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="persistent" PARTUUID="0b109d3c-3948-4b1c-9aca-5e40f3979cab"
/dev/nvme0n1p3: LABEL="COS_STATE" UUID="a0a8656e-9c51-484f-91b0-e85ca3971428" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="state" PARTUUID="83bc5609-e7ed-4188-b0d1-63756fadc928"
/dev/nvme0n1p1: LABEL_FATBOOT="COS_GRUB" LABEL="COS_GRUB" UUID="0916-11CA" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="primary" PARTUUID="3f34ec9a-1356-4119-92a7-41e675e8a4ec"
/dev/nvme0n1p6: LABEL="HARV_LH_DEFAULT" UUID="77c46199-ed8c-40df-b96f-3b5966b47b9d" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="longhorn" PARTUUID="3ab004c5-ae89-4750-a873-c46e037228f8"
/dev/nvme0n1p4: LABEL="COS_RECOVERY" UUID="8a96ed32-696f-42e5-ad09-eb8ccee8603b" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="recovery" PARTUUID="16e7de20-73c4-496c-bd3b-7cd8e503a6e8"
/dev/nvme0n1p2: LABEL="COS_OEM" UUID="77340ad5-2d73-4830-98b1-952cc0b73fcc" BLOCK_SIZE="1024" TYPE="ext4" PARTLABEL="oem" PARTUUID="21aca11f-20f2-4cf9-b6fe-6c4c4b8dcb2c"
/dev/loop0: LABEL="COS_ACTIVE" UUID="c0627e2c-e94a-4cd5-b20b-b5acbed52f0b" BLOCK_SIZE="4096" TYPE="ext2"

14 Views

Open in Slack

Previous Next