https://rancher.com/ logo
Title
m

mysterious-rose-43856

02/10/2023, 5:13 PM
I have longhorn 1.4.0 on 4 clusters or so and have experienced unfortunate downtime that is not expected. I the volumes show as "Attached" but not ready, their workloads will not spin down. I'm investigating, but this is most unfortunate as I'm discovering this mid-migration for some important work loads.
image.png
image.png
image.png
Other clusters with recaps are rebuilding too often too
image.png
image.png
image.png
There is excessive CPU involved
f

full-crayon-745

02/10/2023, 5:44 PM
Hi. We had some issues with Longhorn not being able to complete the process of rebuilding replicas in the past. We reduced the number of concurrent replicas and increased the CPU allowance for the rebuild process in the Longhorn settings and that helped with the issue. Maybe it would help you too.
m

mysterious-rose-43856

02/10/2023, 5:44 PM
Alright
In at least one case though they are single replicas on single nodes with single work-loads. No rebuild needed.
In fact at this point I've rebooted nodes in different clusters having this issue and after boot they're "back to normal" but since that isn't a thing kuberntes itself did, it isn't really "self healing"
Something seems to be related... all the issues have a redis deployment saving backups on it. So for the time being I'm stopping that (I CAN store those in other volume types that might be more reliable).
n

narrow-egg-98197

02/12/2023, 3:58 PM
Hi @mysterious-rose-43856, I'd like to try and understand your situation a bit more. Are you using Longhorn to store Redis data? Is this happening while doing a snapshot/backup but the Redis service is being restarted?
m

mysterious-rose-43856

02/12/2023, 6:00 PM
Well I no-longer think it is related to redis, I'm thinking more and more that this is a longhorn issue. I'm testing right now with re-created volumes that were RWX to RWO (not a great long term work around, but can work if it must).
n

narrow-egg-98197

02/14/2023, 4:02 PM
It seems like the replica is not build up normally, and I saw there is an error message:
cannot add new replica larger than size 5368709120
. I think it might be the replica actual size is exceed Longhorn volume size.