This message was deleted Rancher Users #longhorn-storage

Join Slack

This message was deleted.

# longhorn-storage

adamant-kite-43734

07/19/2022, 9:42 AM

This message was deleted.

flaky-coat-75909

07/19/2022, 9:45 AM

The disk IO saturation is very high (142%,180%) it appear when someone upload file to redis-master (1 replica) the single file nearly 15MB it not happen always but from time to time when I or someone will upload file to redis

flaky-coat-75909

07/19/2022, 9:47 AM

while the disk io arise then cpu saturaion arise to and here I see the problem is on specific node

192.168.1.163

and it above 300% (on this node redis is up)

flaky-coat-75909

07/19/2022, 9:50 AM

when error appear I have failed replicas it is

192.168.1.163

flaky-coat-75909

07/19/2022, 9:50 AM

after several moments rebuilding is starting on node 163 (when it back to life)

flaky-coat-75909

07/19/2022, 9:53 AM

then it is update the state on other nodes (I'm guess)

flaky-coat-75909

07/19/2022, 9:53 AM

and finally all back to normal

flaky-coat-75909

07/19/2022, 9:54 AM

dmesg | tail

(on 192.168.1.163) is returing

Copy code

[1594329.136250] sd 5:0:0:1: [sdd] Attached SCSI disk
[1594332.843527] EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
[1594334.300191] IPv6: ADDRCONF(NETDEV_UP): lxc561a89fb6db7: link is not ready
[1594334.311581] eth0: renamed from tmpe73d5
[1594334.331348] IPv6: ADDRCONF(NETDEV_CHANGE): lxc561a89fb6db7: link becomes ready
[1594360.579431] EXT4-fs warning (device sdb): htree_dirblock_to_tree:984: inode #2: lblock 0: comm longhorn-manage: error -5 reading directory block
[1594364.917450] EXT4-fs (sdc): mounted filesystem with ordered data mode. Opts: (null)
[1594366.276998] IPv6: ADDRCONF(NETDEV_UP): lxc377ac94e1cc2: link is not ready
[1594366.285463] eth0: renamed from tmp35dfa
[1594366.300134] IPv6: ADDRCONF(NETDEV_CHANGE): lxc377ac94e1cc2: link becomes ready

flaky-coat-75909

07/19/2022, 10:00 AM

without | tail log in attachment

dmseg-today.log

flaky-coat-75909

07/19/2022, 7:44 PM

I also see alarming metrics value

sum(rate(container_memory_failures_total{pod!=""}[5m])) by (pod)

up to 7k+ for

longhorn-manager

where most of app are below 100 for this time series

flaky-coat-75909

07/19/2022, 8:45 PM

and I see in logs

instance-manager

Copy code

[pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c] time="2022-07-19T09:21:35Z" level=warning msg="Received signal interrupt to shutdown"

this pvc is pointing to redis pvc also I have some more logs

Copy code

time="2022-07-19T09:19:43Z" level=error msg="Error reading from wire: read tcp 10.0.2.60:36200->10.0.1.253:10016: use of closed network connection"

and for

longhorn-manager

Copy code

time="2022-07-19T08:37:38Z" level=debug msg="CheckEngineImageReadiness: nodes [testser.local] don't have the engine image longhornio/longhorn-engine:v1.2.3"

flaky-coat-75909

07/19/2022, 8:47 PM

full context for

instance-manager

Copy code

[longhorn-instance-manager] time="2022-07-19T09:21:33Z" level=debug msg="Process Manager: start getting logs for process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:33Z" level=debug msg="Process Manager: got logs for process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: prepare to delete process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: deleted process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: trying to stop process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=info msg="wait for process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c to shutdown"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: wait for process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c to shutdown before unregistering process"
[pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c] time="2022-07-19T09:21:35Z" level=warning msg="Received signal interrupt to shutdown"
time="2022-07-19T09:21:35Z" level=warning msg="Starting to execute registered shutdown func <http://github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4|github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4>"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=info msg="Process Manager: process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c stopped"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: prepare to delete process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: deleted process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=info msg="Process Manager: successfully unregistered process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:36Z" level=info msg="Process Manager: successfully unregistered process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"

flaky-coat-75909

07/19/2022, 8:49 PM

and it will appear for my many pvc for example for prometheus

flaky-coat-75909

07/19/2022, 8:58 PM

and from redis whole stack

flaky-coat-75909

07/19/2022, 9:26 PM

any tips why it happening?

286 Views

Open in Slack

Previous Next