big-judge-33880
04/20/2023, 10:32 AM[Thu Apr 20 10:32:04 2023] blk_update_request: critical medium error, dev sdt, sector 14776192 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Thu Apr 20 10:32:04 2023] sd 16:0:0:1: [sdt] tag#104 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Apr 20 10:32:04 2023] sd 16:0:0:1: [sdt] tag#104 Sense Key : Medium Error [current]
[Thu Apr 20 10:32:04 2023] sd 16:0:0:1: [sdt] tag#104 Add. Sense: Unrecovered read error
[Thu Apr 20 10:32:04 2023] sd 16:0:0:1: [sdt] tag#104 CDB: Read(10) 28 00 00 e1 77 80 00 00 08 00
[Thu Apr 20 10:32:04 2023] blk_update_request: critical medium error, dev sdt, sector 14776192 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Thu Apr 20 10:32:04 2023] Buffer I/O error on dev sdt1, logical block 1846768, async page read
[Thu Apr 20 10:32:04 2023] Buffer I/O error on dev sdt2, logical block 0, async page read
[Thu Apr 20 10:32:04 2023] Buffer I/O error on dev sdt2, logical block 0, async page read
[Thu Apr 20 10:32:04 2023] Buffer I/O error on dev sdt2, logical block 0, async page read
[Thu Apr 20 10:32:04 2023] Buffer I/O error on dev sdt2, logical block 0, async page read
[Thu Apr 20 10:32:04 2023] Buffer I/O error on dev sdt2, logical block 0, async page read
[Thu Apr 20 10:32:34 2023] scsi_io_completion_action: 26 callbacks suppressed
BUG: found the second active engine
):
time="2023-04-20T11:40:04Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d" controller=longhorn-volume error="failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d" node=har-03
time="2023-04-20T11:40:04Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d" controller=longhorn-volume error="failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d" node=har-03
time="2023-04-20T11:40:04Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d" controller=longhorn-volume error="failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7" node=har-03
E0420 11:40:04.391450 1 volume_controller.go:216] failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d
time="2023-04-20T11:40:04Z" level=warning msg="Dropping Longhorn volume longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d out of the queue" controller=longhorn-volume error="failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d" node=har-03
time="2023-04-20T11:40:04Z" level=error msg="invalid customized default setting taint-toleration with value <http://kubevirt.io/drain:NoSchedule|kubevirt.io/drain:NoSchedule>, will continue applying other customized settings" error="failed to set the setting taint-toleration with invalid value <http://kubevirt.io/drain:NoSchedule|kubevirt.io/drain:NoSchedule>: cannot modify toleration setting before all volumes are detached"
10.52.6.233 - - [20/Apr/2023:11:40:04 +0000] "GET /metrics HTTP/1.1" 200 22765 "" "Prometheus/2.28.1"
10.52.3.109 - - [20/Apr/2023:11:40:12 +0000] "GET /v1/volumes/pvc-8208cb2b-fc61-49e0-b9fc-b05e0570cd0e HTTP/1.1" 200 7691 "" "Go-http-client/1.1"
10.52.0.100 - - [20/Apr/2023:11:40:18 +0000] "GET /v1/volumes/pvc-50826dc5-2ee4-4ba2-903b-c5c1f1a905be HTTP/1.1" 200 7666 "" "Go-http-client/1.1"
10.52.5.78 - - [20/Apr/2023:11:40:28 +0000] "GET /metrics HTTP/1.1" 200 22781 "" "Prometheus/2.28.1"
10.52.0.100 - - [20/Apr/2023:11:40:30 +0000] "GET /v1/volumes/pvc-83919cf6-16e3-40a7-a049-ac241f3adead HTTP/1.1" 200 7698 "" "Go-http-client/1.1"
time="2023-04-20T11:40:34Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d" controller=longhorn-volume error="failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d" node=har-03
time="2023-04-20T11:40:34Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d" controller=longhorn-volume error="failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d" node=har-03
time="2023-04-20T11:40:34Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d" controller=longhorn-volume error="failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d" node=har-03
E0420 11:40:34.397769 1 volume_controller.go:216] failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d
time="2023-04-20T11:40:34Z" level=warning msg="Dropping Longhorn volume longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d out of the queue" controller=longhorn-volume error="failed to sync longhorn-system/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: failed to reconcile engine/replica state for pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d: BUG: found the second active engine pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 besides pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d" node=har-03
time="2023-04-20T11:40:34Z" level=error msg="invalid customized default setting taint-toleration with value <http://kubevirt.io/drain:NoSchedule|kubevirt.io/drain:NoSchedule>, will continue applying other customized settings" error="failed to set the setting taint-toleration with invalid value <http://kubevirt.io/drain:NoSchedule|kubevirt.io/drain:NoSchedule>: cannot modify toleration setting before all volumes are detached"
Size:
10 Gi
Actual Size:3.11 Gi
Data Locality:disabled
Access Mode:ReadWriteMany
Backing Image:samhandling-image-2tp6w
Backing Image Size:8 Gi
Engine Image:longhornio/longhorn-engine:v1.3.2
Currently this volume is stuck “Attaching”➜ k get engine -n longhorn-system |grep pvc-460
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-5233c88d running har-03 instance-manager-e-0db3693c longhornio/longhorn-engine:v1.3.2 3h50m
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7 running har-03 instance-manager-e-0db3693c longhornio/longhorn-engine:v1.3.2 3h19m
The run-up to this is maintenance -> reboot on all nodes in a harvester cluster to see if that would clear up Buffer I/O errors (yes, except this one)apiVersion: <http://longhorn.io/v1beta2|longhorn.io/v1beta2>
kind: Engine
metadata:
creationTimestamp: "2023-04-20T08:31:54Z"
finalizers:
- <http://longhorn.io|longhorn.io>
generation: 3
labels:
longhornnode: har-03
longhornvolume: pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d
name: pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-e-8e3ba9d7
namespace: longhorn-system
ownerReferences:
- apiVersion: <http://longhorn.io/v1beta2|longhorn.io/v1beta2>
kind: Volume
name: pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d
uid: 1aa4072c-3e19-4722-b470-5830d6b51055
resourceVersion: "250130724"
uid: 373597bb-5a64-4370-8e5d-515b1343132b
spec:
active: true
backupVolume: ""
desireState: running
disableFrontend: false
engineImage: longhornio/longhorn-engine:v1.3.2
frontend: blockdev
logRequested: false
nodeID: har-03
replicaAddressMap:
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-r-06dcce80: 10.30.14.40:10180
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-r-9588b1fe: 10.30.14.39:10270
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-r-f87de513: 10.30.14.34:10420
requestedBackupRestore: ""
requestedDataSource: ""
revisionCounterDisabled: false
salvageRequested: false
upgradedReplicaAddressMap: {}
volumeName: pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d
volumeSize: "10737418240"
status:
backupStatus: null
cloneStatus:
<tcp://10.30.14.34:10420>:
error: 'failed to get snapshot clone status of <tcp://10.30.14.34:10420>: failed
to get snapshot clone status: rpc error: code = Unavailable desc = all SubConns
are in TransientFailure, latest connection error: connection error: desc =
"transport: Error while dialing dial tcp 10.30.14.34:10422: connect: connection
refused"'
fromReplicaAddress: ""
isCloning: false
progress: 0
snapshotName: ""
state: ""
<tcp://10.30.14.39:10270>:
error: 'failed to get snapshot clone status of <tcp://10.30.14.39:10270>: failed
to get snapshot clone status: rpc error: code = Unavailable desc = all SubConns
are in TransientFailure, latest connection error: connection error: desc =
"transport: Error while dialing dial tcp 10.30.14.39:10272: connect: no route
to host"'
fromReplicaAddress: ""
isCloning: false
progress: 0
snapshotName: ""
state: ""
<tcp://10.30.14.40:10180>:
error: ""
fromReplicaAddress: ""
isCloning: false
progress: 0
snapshotName: ""
state: ""
currentImage: longhornio/longhorn-engine:v1.3.2
currentReplicaAddressMap:
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-r-06dcce80: 10.30.14.40:10180
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-r-9588b1fe: 10.30.14.39:10270
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-r-f87de513: 10.30.14.34:10420
currentSize: "10737418240"
currentState: running
endpoint: /dev/longhorn/pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d
instanceManagerName: instance-manager-e-0db3693c
ip: 10.52.5.52
isExpanding: false
lastExpansionError: ""
lastExpansionFailedAt: ""
lastRestoredBackup: ""
logFetched: false
ownerID: har-03
port: 10006
purgeStatus: {}
rebuildStatus: {}
replicaModeMap:
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-r-06dcce80: ERR
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-r-9588b1fe: ERR
pvc-46003070-cd0d-4e96-bdf4-9b6e9430e42d-r-f87de513: ERR
restoreStatus: {}
salvageExecuted: false
snapshots: {}
snapshotsError: ""
started: true
storageIP: 10.30.14.33
Now I have a working VM that rebuilt its third replica from the two referenced in the remaining engine.