This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

03/03/2023, 2:04 AM

This message was deleted.

famous-shampoo-18483

03/07/2023, 3:52 AM

Can you describe more details about this issue?

famous-shampoo-18483

03/07/2023, 3:53 AM

You can check the logs in CSI-plugin pods or longhorn-manager pods and see if there are any clues first.

acceptable-soccer-28720

03/07/2023, 7:53 AM

The instance-manager-* logs show:

Copy code

longhorn-manager-sqmqs/longhorn-manager.log:2023-03-02T17:26:27.290456880Z time="2023-03-02T17:26:27Z" level=error msg="Failed rebuilding of replica 10.42.66.15:10015" controller=longhorn-engine engine=pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-e-3a4268ae error="proxyServer=10.42.222.214:8501 destination=10.42.222.214:10000: failed to add replica <tcp://10.42.66.15:10015> for volume: rpc error: code = Unknown desc = failed to get replica 10.42.66.15:10015: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.42.66.15:10015: connect: connection refused\"" node=mynode-gr-2 volume=pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6
longhorn-manager-sqmqs/longhorn-manager.log:2023-03-02T17:26:27.290588832Z time="2023-03-02T17:26:27Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-e-3a4268ae\", UID:\"5acc73aa-75b1-4fcf-93e2-d3e625c101c7\", APIVersion:\"<http://longhorn.io/v1beta2\|longhorn.io/v1beta2\>", ResourceVersion:\"99362634\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.66.15:10015: proxyServer=10.42.222.214:8501 destination=10.42.222.214:10000: failed to add replica <tcp://10.42.66.15:10015> for volume: rpc error: code = Unknown desc = failed to get replica 10.42.66.15:10015: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.42.66.15:10015: connect: connection refused\""
longhorn-manager-sqmqs/longhorn-manager.log:2023-03-02T17:26:47.283882242Z time="2023-03-02T17:26:47Z" level=warning msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-e-3a4268ae error="failed to sync engine for longhorn-system/pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-e-3a4268ae: failed to start rebuild for pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-r-860d73d6 of pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-e-3a4268ae: timed out waiting for the condition" node=mynode-gr-2
longhorn-manager-sqmqs/longhorn-manager.log:2023-03-02T17:27:52.359866473Z time="2023-03-02T17:27:52Z" level=info msg="Event(v1.ObjectReference{Kind:\"Snapshot\", Namespace:\"longhorn-system\", Name:\"6f82491e-340d-453b-8012-4aa96c89e7f0\", UID:\"54d8563b-7b70-40ab-9dd1-04fb83a0fce0\", APIVersion:\"<http://longhorn.io/v1beta2\|longhorn.io/v1beta2\>", ResourceVersion:\"99363245\", FieldPath:\"\"}): type: 'Warning' reason: 'SnapshotError' lost track of the corresponding snapshot info inside volume engine"
longhorn-manager-sqmqs/longhorn-manager.log:2023-03-02T17:37:12.852837948Z time="2023-03-02T17:37:12Z" level=warning msg="Error syncing Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=mynode-gr-2
longhorn-manager-sqmqs/longhorn-manager.log:2023-03-02T17:37:12.858124259Z time="2023-03-02T17:37:12Z" level=warning msg="Error syncing Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: cannot apply storage-network setting to Longhorn workloads when there are attached volumes" node=mynode-gr-2

The csi-attacher-* logs show:

Copy code

csi-attacher-5ddf9c48cf-bk9kb/csi-attacher.log:2023-03-02T16:38:16.858273606Z I0302 16:38:16.858190       1 csi_handler.go:286] Failed to save detach error to "csi-5465590defab051551a13c29693238cb17bfecf437befbc317228983e63f6abc": <http://volumeattachments.storage.k8s.io|volumeattachments.storage.k8s.io> "csi-5465590defab051551a13c29693238cb17bfecf437befbc317228983e63f6abc" not found
csi-attacher-5ddf9c48cf-bk9kb/csi-attacher.log:2023-03-02T16:38:16.858283127Z I0302 16:38:16.858204       1 csi_handler.go:231] Error processing "csi-5465590defab051551a13c29693238cb17bfecf437befbc317228983e63f6abc": failed to detach: could not mark as detached: <http://volumeattachments.storage.k8s.io|volumeattachments.storage.k8s.io> "csi-5465590defab051551a13c29693238cb17bfecf437befbc317228983e63f6abc" not found
csi-attacher-5ddf9c48cf-bk9kb/csi-attacher.log:2023-03-02T16:47:11.255939348Z I0302 16:47:11.255840       1 csi_handler.go:286] Failed to save detach error to "csi-2484896420e1b817d19dc8cf9cb04857ea98b8e92813691c8fb4c055ff5b79c0": <http://volumeattachments.storage.k8s.io|volumeattachments.storage.k8s.io> "csi-2484896420e1b817d19dc8cf9cb04857ea98b8e92813691c8fb4c055ff5b79c0" not found
csi-attacher-5ddf9c48cf-bk9kb/csi-attacher.log:2023-03-02T16:47:11.255953364Z I0302 16:47:11.255858       1 csi_handler.go:231] Error processing "csi-2484896420e1b817d19dc8cf9cb04857ea98b8e92813691c8fb4c055ff5b79c0": failed to detach: could not mark as detached: <http://volumeattachments.storage.k8s.io|volumeattachments.storage.k8s.io> "csi-2484896420e1b817d19dc8cf9cb04857ea98b8e92813691c8fb4c055ff5b79c0" not found
csi-attacher-5ddf9c48cf-bk9kb/csi-attacher.log:2023-03-02T17:18:24.321435763Z I0302 17:18:24.321364       1 csi_handler.go:255] Failed to save attach error to "csi-1d7598dcbb3af22359621b0a187b54efe4ad99e2a669a88037543c0ef7ca0407": <http://VolumeAttachment.storage.k8s.io|VolumeAttachment.storage.k8s.io> "csi-1d7598dcbb3af22359621b0a187b54efe4ad99e2a669a88037543c0ef7ca0407" is invalid: status.attachError.message: Too long: must have at most 262144 bytes
csi-attacher-5ddf9c48cf-bk9kb/csi-attacher.log:2023-03-02T17:18:24.321461722Z I0302 17:18:24.321430       1 csi_handler.go:231] Error processing "csi-1d7598dcbb3af22359621b0a187b54efe4ad99e2a669a88037543c0ef7ca0407": failed to attach: rpc error: code = Internal desc = Action [updateAccessMode] not available on [&{pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4 volume map[self:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4>] map[activate:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=activate> attach:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=attach> cancelExpansion:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=cancelExpansion> detach:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=detach> engineUpgrade:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=engineUpgrade> pvCreate:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=pvCreate> pvcCreate:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=pvcCreate> recurringJobAdd:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=recurringJobAdd> recurringJobDelete:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=recurringJobDelete> recurringJobList:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=recurringJobList> replicaRemove:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=replicaRemove> snapshotBackup:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=snapshotBackup> snapshotCreate:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=snapshotCreate> snapshotDelete:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=snapshotDelete> snapshotGet:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=snapshotGet> snapshotList:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=snapshotList> snapshotPurge:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=snapshotPurge> snapshotRevert:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=snapshotRevert> updateDataLocality:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=updateDataLocality> updateReplicaAutoBalance:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=updateReplicaAutoBalance> updateReplicaCount:<http://longhorn-backend:9500/v1/volumes/pvc-b59dafa1-3efa-44fc-92ba-e2be23e5d4a4?action=updateReplicaCount]}]>
csi-attacher-5ddf9c48cf-bk9kb/csi-attacher.log:2023-03-02T17:18:24.327135590Z I0302 17:18:24.327075       1 csi_handler.go:255] Failed to save attach error to "csi-1d7598dcbb3af22359621b0a187b54efe4ad99e2a669a88037543c0ef7ca0407": <http://VolumeAttachment.storage.k8s.io|VolumeAttachment.storage.k8s.io> "csi-1d7598dcbb3af22359621b0a187b54efe4ad99e2a669a88037543c0ef7ca0407" is invalid: status.attachError.message: Too long: must have at most 262144 bytes

It works again after restarting corresponding pods and dettaching/attaching corresponding volumes. But this is not a long term solution, because this problem is repeating.

famous-shampoo-18483

03/09/2023, 4:24 PM

This one is the direct cause:

Copy code

longhorn-manager-sqmqs/longhorn-manager.log:2023-03-02T17:26:47.283882242Z time="2023-03-02T17:26:47Z" level=warning msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-e-3a4268ae error="failed to sync engine for longhorn-system/pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-e-3a4268ae: failed to start rebuild for pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-r-860d73d6 of pvc-974d2b20-9e3c-48d0-a4f6-34f631d383f6-e-3a4268ae: timed out waiting for the condition" node=mynode-gr-2

Somehow the longhorn manager pod timeouts talking with the instance-manager-e pod on the same node. Then the volume attachment/detachment cannot continue. Next time when you encounter this, can you check the following? 1. The in-cluster network situation 2. If the instance-manager-e pod work fine and reachable

430 Views

Open in Slack

Previous Next