This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

03/25/2024, 3:30 PM

This message was deleted.

bored-painting-68221

03/25/2024, 3:49 PM

When you say "cannot get to the console" do you mean the Harvester dashboard or do you mean you don't have IPMI or physical access to the two servers that are down?

bulky-lion-74983

03/25/2024, 4:34 PM

Harvester dashboard. We have physical access to everything. The 2 servers that aren't allowing dashboard access.

bored-painting-68221

03/25/2024, 4:37 PM

Alright. Make sure all the servers are powered on, then starting with one of the nodes that failed earlier, access its console and see what rke2-server wants

Copy code

# journalctl -u rke2-server

bulky-lion-74983

03/25/2024, 4:51 PM

The second one that is not ready has no entries in the output. The first one had a ton of them and this one keeps repeating:

Copy code

Mar 25 16:50:20 harvey-02 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).
Mar 25 16:50:26 harvey-02 systemd[1]: rke2-server.service: Scheduled restart job, restart counter is at 65566.
Mar 25 16:50:26 harvey-02 systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Mar 25 16:50:26 harvey-02 systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Mar 25 16:50:26 harvey-02 sh[26990]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Mar 25 16:50:26 harvey-02 sh[26991]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory

Going to guess I have a file system problem on that one. But how do I get the first one with no entries to go ready?

bored-painting-68221

03/25/2024, 4:58 PM

let's focus on the one that is logging for now, since we need to get 2 of them back up for the cluster to restore quorum

bored-painting-68221

03/25/2024, 4:59 PM

Can you share the output of

Copy code

# blkid

Copy code

# lsblk

bulky-lion-74983

03/25/2024, 5:00 PM

Sure

bulky-lion-74983

03/25/2024, 5:00 PM

Copy code

harvey-02:~ # blkid
/dev/sdf: UUID="ddd63fa5-3470-4239-bbc9-a1580354c041" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sdd: UUID="89750dea-5cce-4d1a-90bf-628d19a280cc" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sdb4: LABEL="COS_STATE" UUID="2a7e5f06-d90e-4640-9749-c1c4143866f5" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="state" PARTUUID="9b09c68b-b9a5-40c0-8c09-4cfd0c080bfd"
/dev/sdb2: LABEL="COS_OEM" UUID="232a8502-89c3-4147-a1d6-362c52a3d6e1" BLOCK_SIZE="1024" TYPE="ext4" PARTLABEL="oem" PARTUUID="3f63229d-c828-4082-b4ee-bd8be3b08de4"
/dev/sdb5: LABEL="COS_PERSISTENT" UUID="7c71a967-728b-4c39-b78e-d1667a4bbe68" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="persistent" PARTUUID="1d0f464e-51a2-4fbf-a1a1-f12f60ced4da"
/dev/sdb3: LABEL="COS_RECOVERY" UUID="34bd40a5-8933-4f37-afbf-c7d5c087f4cb" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="recovery" PARTUUID="0c2f8c99-6a2a-4670-a47e-d107d6ee1fb4"
/dev/sdb1: SEC_TYPE="msdos" LABEL_FATBOOT="COS_GRUB" LABEL="COS_GRUB" UUID="99C7-9AE6" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="efi" PARTUUID="6e382787-675d-47bc-a25f-f3d6d95b65c5"
/dev/loop0: LABEL="COS_ACTIVE" UUID="0e98d4b6-79ed-4edc-b676-342320650f84" BLOCK_SIZE="4096" TYPE="ext2"
/dev/sde: UUID="db1953e2-461d-4dee-94ee-37010bc5225b" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sdc: UUID="30ad4bcc-7d09-4d38-a38d-119df9c80ed3" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sda: LABEL="HARV_LH_DEFAULT" UUID="d83b5cd4-8f14-4f0d-ac2f-01e238078d6e" BLOCK_SIZE="4096" TYPE="ext4"

bulky-lion-74983

03/25/2024, 5:01 PM

Copy code

harvey-02:~ # lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0    7:0    0     3G  1 loop /
sda      8:0    0 465.3G  0 disk /var/lib/harvester/defaultdisk
sdb      8:16   0 558.4G  0 disk 
├─sdb1   8:17   0    64M  0 part 
├─sdb2   8:18   0    64M  0 part /oem
├─sdb3   8:19   0     4G  0 part 
├─sdb4   8:20   0     8G  0 part /run/initramfs/cos-state
└─sdb5   8:21   0 546.2G  0 part /var/lib/longhorn
                                 /var/crash
                                 /var/lib/third-party
                                 /var/lib/cni
                                 /var/lib/wicked
                                 /var/lib/kubelet
                                 /var/lib/rancher
                                 /var/log
                                 /usr/libexec
                                 /root
                                 /opt
                                 /home
                                 /etc/pki/trust/anchors
                                 /etc/cni
                                 /etc/iscsi
                                 /etc/ssh
                                 /etc/rancher
                                 /etc/systemd
                                 /usr/local
sdc      8:32   0 558.4G  0 disk 
sdd      8:48   0 465.3G  0 disk 
sde      8:64   0 465.3G  0 disk 
sdf      8:80   0 465.3G  0 disk 
sr0     11:0    1  1024M  0 rom

bulky-lion-74983

03/25/2024, 5:49 PM

I am going to give you the full error message cycle as maybe there is something else going on too.

Copy code

Mar 25 16:50:14 harvey-02 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).
Mar 25 16:50:19 harvey-02 systemd[1]: rke2-server.service: Scheduled restart job, restart counter is at 65565.
Mar 25 16:50:19 harvey-02 systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Mar 25 16:50:19 harvey-02 systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Mar 25 16:50:19 harvey-02 sh[26931]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Mar 25 16:50:19 harvey-02 sh[26936]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Mar 25 16:50:19 harvey-02 harv-update-rke2-server-url[26947]: + HARVESTER_CONFIG_FILE=/oem/harvester.config
Mar 25 16:50:19 harvey-02 harv-update-rke2-server-url[26947]: + RKE2_VIP_CONFIG_FILE=/etc/rancher/rke2/config.yaml.d/90-harvester-vip.yaml
Mar 25 16:50:19 harvey-02 harv-update-rke2-server-url[26947]: + case $1 in
Mar 25 16:50:19 harvey-02 harv-update-rke2-server-url[26947]: + rm -f /etc/rancher/rke2/config.yaml.d/90-harvester-vip.yaml
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=warning msg="Unknown flag --apiVersion found in config.yaml, skipping\n"
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=warning msg="Unknown flag --kind found in config.yaml, skipping\n"
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=warning msg="Unknown flag --omitStages found in config.yaml, skipping\n"
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=warning msg="Unknown flag --omitStages found in config.yaml, skipping\n"
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=warning msg="Unknown flag --rules found in config.yaml, skipping\n"
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=warning msg="not running in CIS mode"
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=info msg="Applying Pod Security Admission Configuration"
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=info msg="Starting rke2 v1.25.9+rke2r1 (842d05e64bcbf78552f1db0b32700b8faea403a0)"
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=info msg="Managed etcd cluster bootstrap already complete and initialized"
Mar 25 16:50:19 harvey-02 rke2[26950]: time="2024-03-25T16:50:19Z" level=info msg="Starting temporary etcd to reconcile with datastore"
Mar 25 16:50:20 harvey-02 rke2[26950]: {"level":"info","ts":"2024-03-25T16:50:20.048Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["<http://127.0.0.1:2400>"]}
Mar 25 16:50:20 harvey-02 rke2[26950]: {"level":"info","ts":"2024-03-25T16:50:20.048Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["<http://127.0.0.1:2399>"]}
Mar 25 16:50:20 harvey-02 rke2[26950]: {"level":"info","ts":"2024-03-25T16:50:20.048Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.19.8 X:boringcrypto","go-os":"linux","go-arch":"amd64","max-cpu-set":16,"max-cpu-available":16,"member-initialized":true,"n>
Mar 25 16:50:20 harvey-02 rke2[26950]: {"level":"info","ts":"2024-03-25T16:50:20.072Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/rancher/rke2/server/db/etcd-tmp/member/snap/db","took":"23.834257ms"}
Mar 25 16:50:20 harvey-02 rke2[26950]: {"level":"info","ts":"2024-03-25T16:50:20.928Z","caller":"embed/etcd.go:368","msg":"closing etcd server","name":"harvey-02-01b859d5","data-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp","advertise-peer-urls":["<http://127.0.0.1:2400>"],"advertise-client-urls":["<http://127.0.0.1:2399>"]}
Mar 25 16:50:20 harvey-02 rke2[26950]: {"level":"info","ts":"2024-03-25T16:50:20.928Z","caller":"embed/etcd.go:370","msg":"closed etcd server","name":"harvey-02-01b859d5","data-dir":"/var/lib/rancher/rke2/server/db/etcd-tmp","advertise-peer-urls":["<http://127.0.0.1:2400>"],"advertise-client-urls":["<http://127.0.0.1:2399>"]}
Mar 25 16:50:20 harvey-02 rke2[26950]: time="2024-03-25T16:50:20Z" level=fatal msg="Failed to reconcile with temporary etcd: walpb: crc mismatch"
Mar 25 16:50:20 harvey-02 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
Mar 25 16:50:20 harvey-02 systemd[1]: rke2-server.service: Failed with result 'exit-code'.
Mar 25 16:50:20 harvey-02 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).

bored-painting-68221

03/25/2024, 6:13 PM

Was this a restart or a power failure? Also,

Copy code

df -h /var/lib/rancher

bulky-lion-74983

03/25/2024, 6:15 PM

The power didn't fail, it was at the bios screen and then was restarted.

bulky-lion-74983

03/25/2024, 6:16 PM

Copy code

harvey-02:~ # df -h /var/lib/rancher
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb5       537G   34G  476G   7% /var/lib/rancher

bulky-lion-74983

03/25/2024, 6:17 PM

There was a disk controller battery failure, and we are getting a new battery, but the main power never failed.

bored-painting-68221

03/25/2024, 6:17 PM

have you ran any SMART checks on /dev/sdb?

bulky-lion-74983

03/25/2024, 6:18 PM

No.

bulky-lion-74983

03/25/2024, 6:19 PM

Can it be run from the shell?

bulky-lion-74983

03/25/2024, 6:19 PM

Or do we need to go to the bios?

bored-painting-68221

03/25/2024, 6:19 PM

yeah,

smartctl /dev/sdb

bored-painting-68221

03/25/2024, 6:19 PM

er,

smartctl -a /dev/sdb

rather

bulky-lion-74983

03/25/2024, 6:26 PM

Copy code

harvey-02:~ # smartctl -a /dev/sdb
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150400.24.92-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, <http://www.smartmontools.org|www.smartmontools.org>

Smartctl open device: /dev/sdb failed: DELL or MegaRaid controller, please try adding '-d megaraid,N'

bulky-lion-74983

03/25/2024, 6:27 PM

Copy code

harvey-02:~ # smartctl -ad megaraid,N /dev/sdb
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150400.24.92-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, <http://www.smartmontools.org|www.smartmontools.org>

/dev/sdb: Unknown device type 'megaraid,N'
=======> VALID ARGUMENTS ARE: ata, scsi[+TYPE], nvme[,NSID], sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,p][,x][,N], usbprolific, usbsunplus, sntjmicron[,NSID], sntrealtek, intelliprop,N[+TYPE], jmb39x[-q],N[,sLBA][,force][+TYPE], jms56x,N[,sLBA][,force][+TYPE], marvell, areca,N/E, 3ware,N, hpt,L/M/N, megaraid,N, aacraid,H,L,ID, cciss,N, auto, test <=======

bored-painting-68221

03/25/2024, 6:28 PM

I think replace the "N" with a "1"

smartctl -ad megaraid,1 /dev/sdb

bulky-lion-74983

03/25/2024, 6:28 PM

Got it:

Copy code

harvey-02:~ # smartctl -ad megaraid,0 /dev/sdb
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150400.24.92-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, <http://www.smartmontools.org|www.smartmontools.org>

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              AL13SEB600
Revision:             DE11
Compliance:           SPC-4
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Rotation Rate:        10000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x50000396880a3675
Serial number:        9560A1CEFRD3
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Mar 25 18:27:31 2024 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     28 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 50562:46
Manufactured in week 36 of year 2015
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  35
Specified load-unload count over device lifetime:  200000
Accumulated load-unload cycles:  25140
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       75        75        75         92    1665384.714           0
write:         0      168       168       168        168      64334.156           0
verify:        0        5         5         5          5     179520.188           0

Non-medium error count:       49

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  64   50547                 - [-   -    -]
# 2  Background short  Completed                  64   50523                 - [-   -    -]
# 3  Background short  Completed                  64   50499                 - [-   -    -]
# 4  Background short  Completed                  64   50475                 - [-   -    -]
# 5  Background short  Completed                  64   50451                 - [-   -    -]
# 6  Background short  Completed                  64   50427                 - [-   -    -]
# 7  Background short  Completed                  64   50403                 - [-   -    -]
# 8  Background short  Completed                  64   50379                 - [-   -    -]
# 9  Background short  Completed                  64   50355                 - [-   -    -]
#10  Background short  Completed                  64   50331                 - [-   -    -]
#11  Background short  Completed                  64   50307                 - [-   -    -]
#12  Background short  Completed                  64   50283                 - [-   -    -]
#13  Background short  Completed                  64   50259                 - [-   -    -]
#14  Background short  Completed                  64   50235                 - [-   -    -]
#15  Background short  Completed                  64   50211                 - [-   -    -]
#16  Background short  Completed                  64   50187                 - [-   -    -]
#17  Background short  Completed                  64   50163                 - [-   -    -]
#18  Background short  Completed                  64   50139                 - [-   -    -]
#19  Background short  Completed                  64   50115                 - [-   -    -]
#20  Background short  Completed                  64   50091                 - [-   -    -]

Long (extended) Self-test duration: 3964 seconds [66.1 minutes]

bulky-lion-74983

03/25/2024, 6:28 PM

oh that was 0

bulky-lion-74983

03/25/2024, 6:29 PM

The both look the same

Copy code

harvey-02:~ # smartctl -ad megaraid,1 /dev/sdb
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150400.24.92-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, <http://www.smartmontools.org|www.smartmontools.org>

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              AL13SEB600
Revision:             DE0C
Compliance:           SPC-4
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Rotation Rate:        10000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x50000395e8038f2d
Serial number:        Y450A1BWFRD3
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Mar 25 18:28:47 2024 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     26 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 59449:08
Manufactured in week 45 of year 2014
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  55
Specified load-unload count over device lifetime:  200000
Accumulated load-unload cycles:  18882
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        3         3         3          6    2089672.317           0
write:         0   251302    251302    251302     599314      84983.114           0
verify:        0        0         0         0          0     211209.315           0

Non-medium error count:       48

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  64   59433                 - [-   -    -]
# 2  Background short  Completed                  64   59409                 - [-   -    -]
# 3  Background short  Completed                  64   59385                 - [-   -    -]
# 4  Background short  Completed                  64   59361                 - [-   -    -]
# 5  Background short  Completed                  64   59337                 - [-   -    -]
# 6  Background short  Completed                  64   59314                 - [-   -    -]
# 7  Background short  Completed                  64   59290                 - [-   -    -]
# 8  Background short  Completed                  64   59266                 - [-   -    -]
# 9  Background short  Completed                  64   59242                 - [-   -    -]
#10  Background short  Completed                  64   59218                 - [-   -    -]
#11  Background short  Completed                  64   59194                 - [-   -    -]
#12  Background short  Completed                  64   59170                 - [-   -    -]
#13  Background short  Completed                  64   59146                 - [-   -    -]
#14  Background short  Completed                  64   59122                 - [-   -    -]
#15  Background short  Completed                  64   59098                 - [-   -    -]
#16  Background short  Completed                  64   59074                 - [-   -    -]
#17  Background short  Completed                  64   59050                 - [-   -    -]
#18  Background short  Completed                  64   59026                 - [-   -    -]
#19  Background short  Completed                  64   59002                 - [-   -    -]
#20  Background short  Completed                  64   58978                 - [-   -    -]

Long (extended) Self-test duration: 3964 seconds [66.1 minutes]

brainy-whale-97450

03/25/2024, 7:34 PM

Did you make any changes to any files like in /oem or the linux cmd line grub config?. We had a very similar problem. Turned out to be a typo in /oem/90... that was causing rke2 not to start. We had copied this to multiple machines, all failed in a similar way.

bulky-lion-74983

03/25/2024, 7:36 PM

Nope, no changes. I have been looking at the "Failed to reconcile with temporary etcd: walpb: crc mismatch" error.

bulky-lion-74983

03/25/2024, 7:38 PM

But not sure how to fix that. I tried renaming the wal files and got a different error.

bulky-lion-74983

03/25/2024, 9:01 PM

I looked at the other server that says Not Ready and pointed it to the one that is running and started rke2. This is the error messages I get from that one:

Copy code

Mar 25 20:46:34 harvey-10 systemd[1]: rke2-server.service: Scheduled restart job, restart counter is at 208.
Mar 25 20:46:34 harvey-10 systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Mar 25 20:46:34 harvey-10 systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Mar 25 20:46:34 harvey-10 sh[9248]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Mar 25 20:46:34 harvey-10 sh[9249]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Mar 25 20:46:34 harvey-10 harv-update-rke2-server-url[9252]: + HARVESTER_CONFIG_FILE=/oem/harvester.config
Mar 25 20:46:34 harvey-10 harv-update-rke2-server-url[9252]: + RKE2_VIP_CONFIG_FILE=/etc/rancher/rke2/config.yaml.d/90-harvester-vip.yaml
Mar 25 20:46:34 harvey-10 harv-update-rke2-server-url[9252]: + case $1 in
Mar 25 20:46:34 harvey-10 harv-update-rke2-server-url[9252]: + rm -f /etc/rancher/rke2/config.yaml.d/90-harvester-vip.yaml
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="Unknown flag --apiVersion found in config.yaml, skipping\n"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="Unknown flag --kind found in config.yaml, skipping\n"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="Unknown flag --omitStages found in config.yaml, skipping\n"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="Unknown flag --omitStages found in config.yaml, skipping\n"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="Unknown flag --rules found in config.yaml, skipping\n"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="not running in CIS mode"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=info msg="Applying Pod Security Admission Configuration"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=info msg="Starting rke2 v1.25.9+rke2r1 (842d05e64bcbf78552f1db0b32700b8faea403a0)"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=info msg="Managed etcd cluster not yet initialized"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="critical configuration mismatched: ClusterDNSs.slice[0].slice[13]"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="critical configuration mismatched: ClusterIPRanges.slice[0].IP.slice[13]"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="critical configuration mismatched: ClusterDNS.slice[13]"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="critical configuration mismatched: ClusterIPRange.IP.slice[13]"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="critical configuration mismatched: ServiceIPRange.IP.slice[13]"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=warning msg="critical configuration mismatched: ServiceIPRanges.slice[0].IP.slice[13]"
Mar 25 20:46:34 harvey-10 rke2[9254]: time="2024-03-25T20:46:34Z" level=fatal msg="starting kubernetes: preparing server: failed to validate server configuration: critical configuration value mismatch between servers"
Mar 25 20:46:34 harvey-10 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
Mar 25 20:46:34 harvey-10 systemd[1]: rke2-server.service: Failed with result 'exit-code'.
Mar 25 20:46:34 harvey-10 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).

bulky-lion-74983

03/25/2024, 9:02 PM

Mis matched configurations. Do you know how to synchronize the?

bulky-lion-74983

03/25/2024, 9:03 PM

If we can get this one working then we can just reinstall that first one and add it to the cluster.

bulky-lion-74983

03/26/2024, 3:09 PM

Thanks for all your help, BTW, I wouldn't have gotten this far without it! Having issues like this is quite the Harvester education. lol

bulky-lion-74983

03/28/2024, 4:02 PM

We got it working again. This is for anyone that has the same issue. We restored etcd on the server that first had issues with the previous etcd snapshot. https://gist.github.com/flrichar/667da206c39cf5d973a4abfff1507de8 Restarted rke2 server on the second one and it connected to the first. Then the 3rd one that had all the vm's still operating on it we had to reboot and when it came back up it rejoined. We had a few issues with VM networking so we restarted the VM's and that solved that issue. Longhorn did it's thing and resync'd the volumes and everything shows green now. Thanks @bored-painting-68221 and @brainy-whale-97450 for your help!

41 Views

Open in Slack

Previous Next