Re: zfs (?) issues?

Sulev-Madis Silber Sat, 26 Apr 2025 05:19:26 -0700

On April 23, 2025 6:40:44 PM GMT+03:00, void <v...@f-m.fm> wrote:
>On Mon, Apr 21, 2025 at 04:25:16AM +0300, Sulev-Madis Silber wrote:
>> i have long running issue in my 13.4 box (amd64)
>> 
>> others don't get it at all and only suggest adding more than 4g ram
>> 
>> it manifests as some mmap or other problems i don't really get
>> 
>> basically unrestricted git consumes all the memory. 
>
>I see symptoms like this on very slow media. That media might
>be inherently slow (like microSD) or it might mean the hd/ssd is worn out [1]. 
>Programs like git and subversion read and write lots of small files, and the 
>os/filesystem might not be able to write
>to slow media as fast as git would like. 
>[1] observed failure mode of some hardware, where writes just     get slower 
>and slower.
>
>[2] the workaround where the machine *has* to use micro sd, in my
>    example, to update ports, was to download latest ports.tar.xz and
>    unzip it, rather than use git.
>
>[3] test hd performance with fio (/usr/ports/benchmarks/fio)

that might be it!? there is hdd on machine that was tested but now never really 
likes to complete the long smart tests, and short take ages. there are no 
"usual" disk errors, tho. that hdd is part of 2 disk mirror that the git runs on

but there could be fix for this that never affects people. i don't know how 
internals really are but slow io could fill the buffers up. those can be 
checked and fs could be limited. eg, simply not telling that write was ok yet. 
that would make things slower if queue is full, so git would wait. i bet that 
there are checks for it, maybe they just don't work well? it can't be just 
blindly taking writes hoping they could be committed up to storage in some 
future time

or i could be wrong and it's some other issue

i'm wondering why noone else spots it much, tho? because io could be slow due 
media being abnormally slow by design. or it could be failing. but it could 
also that influx is just past what storage can do. and this could happen in 
fast machines too. or it could be happening due accident or even attack. if i 
get this correctly. oom protect won't save any userland process here either? so 
it truly was all about kernel wanting to allocate all of the ram. which it did. 
i didn't see single userland process running iirc but i couldn't check either. 
kernel itself kept running perfectly fine after that. fix of that particular 
failure is to enable watchdog of course. i think i've seen it on another 
machine as well but never realized. or maybe it was hw there and kernel was 
also frozen. when i turned to check, i found caps bulging

if all up is correct, it could be easily tested, with gnop maybe. i don't see 
speed constriction option but i see delays. maybe even i can test it now, as it 
doesn't need huge ram just to prove the point that it fills up completely

and this is not fixed on current either? and fix is in zfs? and ufs, as tested 
by others, would not be affected... why? i know zfd does cow but. anyway, i 
can't figure it out. that's why i don't dev fs'es. maybe tell kirk even? : p

what's funny is how kernel knew to stop there? was it just because it finally 
was reaching actual ram limits. or just because writes stopped due killed git. 
i'm not sure what kernel memory full could mean. panic always? since you can't 
kill things in kernel. or it could give errors or delay? like completion of 
syscalls is delayed? i'm unsure about all this. i'll leave it for others. but 
it's often told that, like, full memory in kernel leads to panics. couldn't it 
just error out?

tl;dr - suspected issue of zfs on slow device filling up *entire* ram with 
write buffers, leaving userland killed and system in unusable state
Re: zfs (?) issues?

Reply via email to