On April 23, 2025 6:40:44 PM GMT+03:00, void <v...@f-m.fm> wrote:
>On Mon, Apr 21, 2025 at 04:25:16AM +0300, Sulev-Madis Silber wrote:
>> i have long running issue in my 13.4 box (amd64)
>>
>> others don't get it at all and only suggest adding more than 4g ram
>>
>> it manifests as some mmap or other problems i don't really get
>>
>> basically unrestricted git consumes all the memory.
>
>I see symptoms like this on very slow media. That media might
>be inherently slow (like microSD) or it might mean the hd/ssd is worn out [1].
>Programs like git and subversion read and write lots of small files, and the
>os/filesystem might not be able to write
>to slow media as fast as git would like.
>[1] observed failure mode of some hardware, where writes just get slower
>and slower.
>
>[2] the workaround where the machine *has* to use micro sd, in my
> example, to update ports, was to download latest ports.tar.xz and
> unzip it, rather than use git.
>
>[3] test hd performance with fio (/usr/ports/benchmarks/fio)
that might be it!? there is hdd on machine that was tested but now never really
likes to complete the long smart tests, and short take ages. there are no
"usual" disk errors, tho. that hdd is part of 2 disk mirror that the git runs on
but there could be fix for this that never affects people. i don't know how
internals really are but slow io could fill the buffers up. those can be
checked and fs could be limited. eg, simply not telling that write was ok yet.
that would make things slower if queue is full, so git would wait. i bet that
there are checks for it, maybe they just don't work well? it can't be just
blindly taking writes hoping they could be committed up to storage in some
future time
or i could be wrong and it's some other issue
i'm wondering why noone else spots it much, tho? because io could be slow due
media being abnormally slow by design. or it could be failing. but it could
also that influx is just past what storage can do. and this could happen in
fast machines too. or it could be happening due accident or even attack. if i
get this correctly. oom protect won't save any userland process here either? so
it truly was all about kernel wanting to allocate all of the ram. which it did.
i didn't see single userland process running iirc but i couldn't check either.
kernel itself kept running perfectly fine after that. fix of that particular
failure is to enable watchdog of course. i think i've seen it on another
machine as well but never realized. or maybe it was hw there and kernel was
also frozen. when i turned to check, i found caps bulging
if all up is correct, it could be easily tested, with gnop maybe. i don't see
speed constriction option but i see delays. maybe even i can test it now, as it
doesn't need huge ram just to prove the point that it fills up completely
and this is not fixed on current either? and fix is in zfs? and ufs, as tested
by others, would not be affected... why? i know zfd does cow but. anyway, i
can't figure it out. that's why i don't dev fs'es. maybe tell kirk even? : p
what's funny is how kernel knew to stop there? was it just because it finally
was reaching actual ram limits. or just because writes stopped due killed git.
i'm not sure what kernel memory full could mean. panic always? since you can't
kill things in kernel. or it could give errors or delay? like completion of
syscalls is delayed? i'm unsure about all this. i'll leave it for others. but
it's often told that, like, full memory in kernel leads to panics. couldn't it
just error out?
tl;dr - suspected issue of zfs on slow device filling up *entire* ram with
write buffers, leaving userland killed and system in unusable state