https://rancher.com/ logo
Title
f

flaky-coat-75909

07/19/2022, 9:42 AM
while I'm using longhorn storageClass my CPU and disk IO saturatons are very high how can I debug the reason why it happens? In thread I will give more info
The disk IO saturation is very high (142%,180%) it appear when someone upload file to redis-master (1 replica) the single file nearly 15MB it not happen always but from time to time when I or someone will upload file to redis
while the disk io arise then cpu saturaion arise to and here I see the problem is on specific node
192.168.1.163
and it above 300% (on this node redis is up)
when error appear I have failed replicas it is
192.168.1.163
after several moments rebuilding is starting on node 163 (when it back to life)
then it is update the state on other nodes (I'm guess)
and finally all back to normal
dmesg | tail
(on 192.168.1.163) is returing
[1594329.136250] sd 5:0:0:1: [sdd] Attached SCSI disk
[1594332.843527] EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: (null)
[1594334.300191] IPv6: ADDRCONF(NETDEV_UP): lxc561a89fb6db7: link is not ready
[1594334.311581] eth0: renamed from tmpe73d5
[1594334.331348] IPv6: ADDRCONF(NETDEV_CHANGE): lxc561a89fb6db7: link becomes ready
[1594360.579431] EXT4-fs warning (device sdb): htree_dirblock_to_tree:984: inode #2: lblock 0: comm longhorn-manage: error -5 reading directory block
[1594364.917450] EXT4-fs (sdc): mounted filesystem with ordered data mode. Opts: (null)
[1594366.276998] IPv6: ADDRCONF(NETDEV_UP): lxc377ac94e1cc2: link is not ready
[1594366.285463] eth0: renamed from tmp35dfa
[1594366.300134] IPv6: ADDRCONF(NETDEV_CHANGE): lxc377ac94e1cc2: link becomes ready
without | tail log in attachment
I also see alarming metrics value
sum(rate(container_memory_failures_total{pod!=""}[5m])) by (pod)
up to 7k+ for
longhorn-manager
where most of app are below 100 for this time series
and I see in logs
instance-manager
[pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c] time="2022-07-19T09:21:35Z" level=warning msg="Received signal interrupt to shutdown"
this pvc is pointing to redis pvc also I have some more logs
time="2022-07-19T09:19:43Z" level=error msg="Error reading from wire: read tcp 10.0.2.60:36200->10.0.1.253:10016: use of closed network connection"
and for
longhorn-manager
time="2022-07-19T08:37:38Z" level=debug msg="CheckEngineImageReadiness: nodes [testser.local] don't have the engine image longhornio/longhorn-engine:v1.2.3"
full context for
instance-manager
[longhorn-instance-manager] time="2022-07-19T09:21:33Z" level=debug msg="Process Manager: start getting logs for process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:33Z" level=debug msg="Process Manager: got logs for process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: prepare to delete process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: deleted process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: trying to stop process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=info msg="wait for process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c to shutdown"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: wait for process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c to shutdown before unregistering process"
[pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c] time="2022-07-19T09:21:35Z" level=warning msg="Received signal interrupt to shutdown"
time="2022-07-19T09:21:35Z" level=warning msg="Starting to execute registered shutdown func <http://github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4|github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4>"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=info msg="Process Manager: process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c stopped"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: prepare to delete process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=debug msg="Process Manager: deleted process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:35Z" level=info msg="Process Manager: successfully unregistered process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
[longhorn-instance-manager] time="2022-07-19T09:21:36Z" level=info msg="Process Manager: successfully unregistered process pvc-280f7155-bee9-4ede-9110-ea0ce977aba3-r-9078586c"
and it will appear for my many pvc for example for prometheus
and from redis whole stack
any tips why it happening?