https://rancher.com/ logo
#longhorn-storage
Title
# longhorn-storage
a

adamant-kite-43734

03/09/2023, 9:18 AM
This message was deleted.
b

big-judge-33880

03/09/2023, 1:27 PM
Sorted out the prometheus pvc situation and the dmesg spam persists, so definately wasn’t related. This is LH 1.3.2
f

famous-shampoo-18483

03/09/2023, 4:28 PM
YES some of controllers in longhorn-manager pods will reconcile Longhorn resources every 30s. Besides, there is a disk monitor will check and summarize the data path info with the same interval. Currently, I am not sure which one is the culprit. Can you check the availability of each disk Longhorn nodes are using then see if there are any clues? BTW, you can provide the support bundle as well.
b

big-judge-33880

03/09/2023, 6:45 PM
All of the disks appear healthy, but there’s one that seems a bit… overloaded according to longhorn (I assume used + reserved = used?)
Looking at longhorn-manager logs was a good idea and makes it easy enough to confirm the source of the messages though:
Copy code
time="2023-03-09T18:44:37Z" level=info msg="Failed backing image data source zzz-image-4gr75 is still in the backoff window, Longhorn cannot recreate pod for it" backingImageDataSource=zzz-image-4gr75 controller=longhorn-backing-image-data-source diskUUID=7606d6b6-7ee6-4f4b-a26c-5e1043c0b409 node=har-03 nodeID=har-03 parameters="map[url:<http://zzz/zzz-eval-datadisk.qcow2>]" sourceType=download
time="2023-03-09T18:45:07Z" level=info msg="Failed backing image data source zzz-image-4gr75 is still in the backoff window, Longhorn cannot recreate pod for it" backingImageDataSource=zzz-image-4gr75 controller=longhorn-backing-image-data-source diskUUID=7606d6b6-7ee6-4f4b-a26c-5e1043c0b409 node=har-03 nodeID=har-03 parameters="map[url:<http://zzz/zzz-eval-datadisk.qcow2>]" sourceType=download
messages in the kernel log on the same node coincide perfectly, 10seconds after the longhorn-manager messages:
Copy code
[Thu Mar  9 18:44:47 2023] buffer_io_error: 17 callbacks suppressed
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-0, logical block 2441200, async page read
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-0, logical block 2441200, async page read
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-0, logical block 2441200, async page read
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-1, logical block 1220592, async page read
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-1, logical block 1220592, async page read
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-1, logical block 1220592, async page read
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-2, logical block 24413168, async page read
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-2, logical block 24413168, async page read
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-2, logical block 24413168, async page read
[Thu Mar  9 18:44:47 2023] Buffer I/O error on dev dm-3, logical block 2441200, async page read
[Thu Mar  9 18:45:17 2023] buffer_io_error: 17 callbacks suppressed
[Thu Mar  9 18:45:17 2023] Buffer I/O error on dev dm-0, logical block 2441200, async page read
[Thu Mar  9 18:45:17 2023] Buffer I/O error on dev dm-0, logical block 2441200, async page read
f

famous-shampoo-18483

03/10/2023, 10:44 AM
In the longhorn manager and the instance-manager-r of node har-04 , I saw lots of the thread blocking errors:
Copy code
2023-03-05T08:28:33.122418486Z [pvc-4d2d11a4-1966-4d98-981b-9180f3d9f98b-r-8814ee8e] runtime/cgo: pthread_create failed: Resource temporarily unavailable
2023-03-05T08:28:33.129791854Z [pvc-4d2d11a4-1966-4d98-981b-9180f3d9f98b-r-8814ee8e] SIGABRT: abort
2023-03-05T08:28:33.148562182Z PC=0xc4ba6b m=25[pvc-4d2d11a4-1966-4d98-981b-9180f3d9f98b-r-8814ee8e]  sigcode=18446744073709551610
2023-03-05T08:28:33.148587426Z 
2023-03-05T08:28:33.148595775Z goroutine 0 [idle]:
2023-03-05T08:28:33.148602116Z runtime: unknown pc 0xc4ba6b
2023-03-05T08:28:33.148607809Z stack: frame={sp:0x7fe882ffcc10, fp:0x0} stack=[0x7fe8827fd688,0x7fe882ffd288)
2023-03-05T08:28:33.148613164Z 00007fe882ffcb10:  0000000000000000  0000000000000000 
......

2023-03-05T08:28:33.148852993Z goroutine 1 [chan receive, 40 minutes]:
2023-03-05T08:28:33.148856114Z <http://github.com/longhorn/longhorn-engine/app/cmd.startReplica(0xc0001cd4a0|github.com/longhorn/longhorn-engine/app/cmd.startReplica(0xc0001cd4a0>, 0xeccbc3, 0x0)
2023-03-05T08:28:33.148859526Z  /go/src/github.com/longhorn/longhorn-engine/app/cmd/replica.go:168 +0x4f8
......

2023-03-05T08:28:33.148902147Z goroutine 19 [IO wait]:
...
2023-03-05T08:28:33.148960009Z  /go/src/github.com/longhorn/longhorn-engine/app/cmd/replica.go:123 +0x397
2023-03-05T08:28:33.148963065Z created by <http://github.com/longhorn/longhorn-engine/app/cmd.startReplica|github.com/longhorn/longhorn-engine/app/cmd.startReplica>
2023-03-05T08:28:33.148966036Z  /go/src/github.com/longhorn/longhorn-engine/app/cmd/replica.go:108 +0x300
......

2023-03-05T08:28:33.149152608Z goroutine 22 [syscall, 40 minutes]:
2023-03-05T08:28:33.149155801Z os/signal.signal_recv(0x0)
2023-03-05T08:28:33.149158776Z  /usr/local/go/src/runtime/sigqueue.go:147 +0x9c
2023-03-05T08:28:33.149161936Z os/signal.loop()
2023-03-05T08:28:33.149165031Z  /usr/local/go/src/os/signal/signal_unix.go:23 +0x22
2023-03-05T08:28:33.149167964Z created by os/signal.Notify.func1
2023-03-05T08:28:33.149171057Z  /usr/local/go/src/os/signal/signal.go:127 +0x44
......

2023-03-05T08:28:33.149248386Z goroutine 194484 [runnable]:
...
2023-03-05T08:28:33.149308242Z <http://github.com/longhorn/longhorn-engine/pkg/dataconn.(*Wire).Read(0xc00080e600|github.com/longhorn/longhorn-engine/pkg/dataconn.(*Wire).Read(0xc00080e600>, 0x111ef60, 0xc000120b40, 0xc00230acc0)
2023-03-05T08:28:33.149311502Z  /go/src/github.com/longhorn/longhorn-engine/pkg/dataconn/wire.go:69 +0x76
2023-03-05T08:28:33.149314623Z <http://github.com/longhorn/longhorn-engine/pkg/dataconn.(*Server).readFromWire(0xc000120b40|github.com/longhorn/longhorn-engine/pkg/dataconn.(*Server).readFromWire(0xc000120b40>, 0xc000714180)
2023-03-05T08:28:33.149331016Z  /go/src/github.com/longhorn/longhorn-engine/pkg/dataconn/server.go:37 +0x3d
2023-03-05T08:28:33.149346145Z created by <http://github.com/longhorn/longhorn-engine/pkg/dataconn.(*Server).read|github.com/longhorn/longhorn-engine/pkg/dataconn.(*Server).read>
2023-03-05T08:28:33.149350409Z  /go/src/github.com/longhorn/longhorn-engine/pkg/dataconn/server.go:60 +0x83
......
This is probably one reason. But the problematic node
har04
I found is not what you mentioned
har03
... For the thread issue, actually some users already reported it before. They fixed the problem by changing the pid limit of the container runtime. Please check this for details: https://github.com/longhorn/longhorn/issues/3178
As for the error you mentioned, You can try to fix it by deleting then recreating backing image
samhandling-image-4gr75
b

big-judge-33880

03/10/2023, 11:57 AM
Thank you very much - this is very helpful for further understanding the bits and bobs of longhorn! I see the kubepods systemd slice that runs on har-04 is indeed somewhat resource-starved when it comes to processes/threads:
Copy code
● kubepods.slice - libcontainer container kubepods.slice
     Loaded: loaded (/run/systemd/transient/kubepods.slice; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/kubepods.slice.d
             └─50-CPUShares.conf, 50-MemoryLimit.conf, 50-TasksMax.conf
     Active: active since Thu 2023-02-16 15:47:13 UTC; 3 weeks 0 days ago
      Tasks: 63609 (limit: 65536)
So I guess the consequence of this is then that a volume with 3 replicas specified could end up not being able to produce all three replicas?
Seems my task quota was taken up by an army of zombie mount processes spawned by
Copy code
cat /proc/26609/cmdline
longhorn-manager-ddaemon--engine-imagelonghornio/longhorn-engine:v1.3.2--instance-manager-imagelonghornio/longhorn-instance-manager:v1_20221003--share-manager-imagelonghornio/longhorn-share-manager:v1_20221003--backing-image-manager-imagelonghornio/backing-image-manager:v3_20221003--manager-imagelonghornio/longhorn-manager:v1.3.2--service-accountlonghorn-service-account
Of course the Buffer I/O messages keep ticking after removing the backing image that was malfunctioning, and the nodes that have that issue don’t have a pthread exhaustion issue. I wonder what it could be.
f

famous-shampoo-18483

03/17/2023, 2:08 AM
Sorry for late reply. Here the IO errors is not triggered by the backing image. The resource starving caused by zombie processes is the culprit. There was a ticket related to flooded zombie unmount process issue: https://github.com/longhorn/longhorn/issues/5296 I am not sure if this is what you encountered.
37 Views