This message was deleted.
# harvester
a
This message was deleted.
s
w
a different experience to myself - with my 2 node cluster the first node got stuck at Pre-drained but the second node was still happy (not the failed rke2-agent problem). after leaving it over the weekend I rebooted it then upgrade proceeded and both nodes now 1.3.2.
have you tried manually restarting the rke2-server process on the first node?
s
Interesting... looking carefully at the output of
systemctl list-units
the first node (the one that's been "updated") does not have an
rke2-server.service
...
Copy code
lvm2-monitor.service                                                                          loaded active exited    Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
  rancher-system-agent.service                                                                  loaded active running   Rancher System Agent
  smartd.service                                                                                loaded active running   Self Monitoring and Reporting Technology (SMART) Daemon
  sshd.service                                                                                  loaded active running   OpenSSH Daemon
w
so there's no /etc/systemd/system/rke2-server.service file?
s
good point - there is, but it's empty:
Copy code
rancher@harvester003:~> ls -la /etc/systemd/system/rke2-server.service
-rw-r--r-- 1 root root 0 Sep 23 19:12 /etc/systemd/system/rke2-server.service
w
hmm most odd. the contents of mine are as follows if you want to try recreating it though I wonder what else has happened to your node
Copy code
[Unit]
Description=Rancher Kubernetes Engine v2 (server)
Documentation=<https://github.com/rancher/rke2#readme>
Wants=network-online.target
After=network-online.target
Conflicts=rke2-agent.service

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/opt/rke2/lib/systemd/system/%N.env
KillMode=process
Delegate=yes
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/opt/rke2/bin/rke2 server
ExecStopPost=-/bin/sh -c "systemd-cgls /system.slice/%n | grep -Eo '[0-9]+ (containerd|kubelet)' | awk '{print $1}' | xargs -r kill"
EnvironmentFile=-/var/lib/rancher/rke2/system-agent-installer/rke2-sa.env
since you have 3 nodes your other two should also have rke2-server rather than rke2-agent so the above file should exist on both of those
s
The other two nodes haven't been updated yet, so may not be the same as your version - so thank you, I'll try that. But I worry: (a) why it has been truncated (b) what other damage has been done
w
I trust /etc/systemd hasn't run out of space 😱
s
on the non-updated nodes I have:
Copy code
-rw-r--r--  1 root root 554 Jul  6 13:27 /etc/systemd/system/rancher-system-agent.service
-rw-r--r--  1 root root 868 Sep 23 12:33 /etc/systemd/system/rke2-agent.service
-rw-r--r--  1 root root 943 Sep 23 12:33 /etc/systemd/system/rke2-server.service
-rw-r--r--. 1 root root 317 Jun 14 02:00 /etc/systemd/system/rke2-shutdown.service
On the updated node I have:
Copy code
-rw-r--r--  1 root root 554 Jul  6 13:27 /etc/systemd/system/rancher-system-agent.service
-rw-r--r--  1 root root   0 Sep 23 19:12 /etc/systemd/system/rke2-agent.service
-rw-r--r--  1 root root   0 Sep 23 19:12 /etc/systemd/system/rke2-server.service
-rw-r--r--. 1 root root 317 Sep  6 07:01 /etc/systemd/system/rke2-shutdown.service
👀 1
So maybe the two
rke2-[agent|server]
service files have been updated (yesterday's date)
w
on my updated node I have:
Copy code
-rw-r--r--   1 root root  554 Jun 14 10:40 rancher-system-agent.service
-rw-r--r--   1 root root  554 Nov 22  2022 rancher-system-agent.service.ORIG
drwxr-xr-x.  2 root root 4096 Sep  6 07:01 rancher-system-agent.service.d
drwxr-xr-x.  2 root root 4096 Sep  6 07:01 rancherd.service.d
drwxr-xr-x   2 root root 4096 Sep 12  2023 reboot.target.requires
drwxr-xr-x.  2 root root 4096 Sep  3 17:53 remote-fs.target.wants
-rw-r--r--   1 root root  868 Sep 23 17:15 rke2-agent.service
drwxr-xr-x.  2 root root 4096 Sep  6 07:01 rke2-agent.service.d
-rw-r--r--   1 root root  943 Sep 23 17:15 rke2-server.service
drwxr-xr-x.  2 root root 4096 Sep  6 07:01 rke2-server.service.d
-rw-r--r--.  1 root root  317 Sep  6 07:01 rke2-shutdown.service
s
/etc
seems to be on
/
and it has plenty of space
Copy code
/dev/loop0     ext2      3.0G  1.3G  1.6G  46% /
w
try /etc/systemd/system as for me /etc/systemd is a separate mount
s
oh - I see.
/etc/systemd
is a bind mount from the
/usr/local
filesystem:
Copy code
rancher@harvester003:~> cat /etc/fstab | grep '/etc/systemd'
/usr/local/.state/etc-systemd.bind /etc/systemd none defaults,bind 0 0
Seems to be plenty of space there:
Copy code
rancher@harvester003:~> df -h /usr/local
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p5   98G   75G   19G  81% /usr/local
So, I made
rke2-server.service
and when I start that service I find that
/opt/rke2/bin/rke2
is also zero length. So, it looks like a lot of useful files are zero length and this node is screwed.
Copy code
rancher@harvester003:~> ls -la /opt/rke2/bin/rke2
-rwxr-xr-x 1 root root 0 Aug  1 22:36 /opt/rke2/bin/rke2
Can I stop the upgrade, roll this node back and try again? Or, what are my other options?
b
for what it's worth, the precheck script looks to make sure that there's 30G of space available on the nodes.
So while 19G seems like plenty, it's not as high as what's recommended.
Is this the first node that tried to update? Did any of the other nodes complete?
s
Yes, this the first node. No, the others have not updated - and they can't update until the first has completed.
b
What I would do... Is probably: • Delete the upgrade object (sorry I don't remember exactly what the name is) which should cancel the upgrade. • Remove node1 from the cluster. • Re-install the matching version as node2 and node3. • Wait for a healthy cluster. • Run the preupgrade check ( I have a pending PR with new tests if you're feeling brave and want to try it out) and make sure everything passes. • Kick off a new upgrade to 1.3.2
👍 1
p
@sticky-summer-13450 I assume the rebooting node you mention is
harvester003
? can you run
blkid
on it and share with me? Thanks
cc @bland-farmer-13503 @red-king-19196 too
s
Copy code
rancher@harvester003:~> sudo blkid
/dev/nvme0n1p5: LABEL="COS_PERSISTENT" UUID="81aae9fc-59e2-4c73-ad92-8a024aeb3357" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="persistent" PARTUUID="0b109d3c-3948-4b1c-9aca-5e40f3979cab"
/dev/nvme0n1p3: LABEL="COS_STATE" UUID="a0a8656e-9c51-484f-91b0-e85ca3971428" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="state" PARTUUID="83bc5609-e7ed-4188-b0d1-63756fadc928"
/dev/nvme0n1p1: LABEL_FATBOOT="COS_GRUB" LABEL="COS_GRUB" UUID="0916-11CA" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="primary" PARTUUID="3f34ec9a-1356-4119-92a7-41e675e8a4ec"
/dev/nvme0n1p6: LABEL="HARV_LH_DEFAULT" UUID="77c46199-ed8c-40df-b96f-3b5966b47b9d" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="longhorn" PARTUUID="3ab004c5-ae89-4750-a873-c46e037228f8"
/dev/nvme0n1p4: LABEL="COS_RECOVERY" UUID="8a96ed32-696f-42e5-ad09-eb8ccee8603b" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="recovery" PARTUUID="16e7de20-73c4-496c-bd3b-7cd8e503a6e8"
/dev/nvme0n1p2: LABEL="COS_OEM" UUID="77340ad5-2d73-4830-98b1-952cc0b73fcc" BLOCK_SIZE="1024" TYPE="ext4" PARTLABEL="oem" PARTUUID="21aca11f-20f2-4cf9-b6fe-6c4c4b8dcb2c"
/dev/loop0: LABEL="COS_ACTIVE" UUID="c0627e2c-e94a-4cd5-b20b-b5acbed52f0b" BLOCK_SIZE="4096" TYPE="ext2"