Re: zfs (?) issues?

Sulev-Madis Silber Tue, 22 Apr 2025 20:22:24 -0700

On April 22, 2025 8:34:35 PM GMT+03:00, Toomas Soome <tso...@me.com> wrote:
>
>
>> On 22. Apr 2025, at 18:23, Sulev-Madis Silber 
>> <freebsd-current-freebsd-org...@ketas.si.pri.ee> wrote:
>> 
>> well i don't have those errors anymore so there's nothing to give
>> 
>> i've tried to tune arc but it didn't do anything so i took those things off 
>> again
>> 
>> right now i'm looking at
>> 
>> ARC: 1487M Total, 1102M MFU, 128M MRU, 1544K Anon, 56M Header, 199M Other
>>     942M Compressed, 18G Uncompressed, 19.36:1 Ratio
>> 
>> and wonder wtf
>> 
>> i bet there's issue somewhere and i somehow can't properly recreate it. on 
>> memory pressure it does resize arc down properly so seems like i don't need 
>> any limits
>> 
>> and there's no tmpfs. it would be useless at that low memory sizes
>> 
>> the problem is that i can't figure out what all those problems are, how to 
>> recreate those conditions and how to workaround or maybe find bugs. also 
>> don't have enough hw to solely test it on. unless i can maybe try it on tiny 
>> 512m vm. and then i would need to know what to try
>> 
>> i also don't know why those git settings help me:
>> 
>> [core]
>>        packedGitWindowSize = 32m
>>        packedGitLimit = 128m
>>        preloadIndex = false
>> [diff]
>>        renameLimit = 16384
>> 
>> how to tune it from some global place. and so on. and why it would even need 
>> fiddiling so much? zfs indeed has improved a lot, previously it was quite a 
>> hell to use
>> 
>> i don't even know if this is related to mmap. even then, i don't really get 
>> what that function even does. hence then "zfs (?) issue". it might even not 
>> be zfs at all
>> 
>> there are probably multiple combined issues here
>> 
>> i also don't really buy the idea that ton of ram would automatically fix this
>> 
>> so yeah unsure what to think of this
>> 
>> some of the issues i found that others also have. some of them seem new
>> 
>> some fixes were like as if trial and errors and nobody seemed to know what's 
>> wrong even. granted, that was forum so maybe here it's better here?
>> 
>> i mean i have used below average equipment my entire life and usual case to 
>> cope with this is to just give it more time. put more swap and just wait
>> 
>> i think someone tested my git issues in 4g vm and found no issues at all? 
>> other things seem like as i only i have them
>> 
>> i also find kind of confusing that if this is hw, why i don't see any other 
>> issues
>> 
>> this is not the first time that i have found something confusing in fbsd 
>> that later turned out to be bug and was further tested and fixed by other
>> 
>> hence the current mailing list so maybe someone else has ideas. or if it has 
>> already fix. and i hope there are people with much larger labs and could 
>> easily tell / test things
>> 
>> so in the end,
>> 
>> 1) why should git on large repo cause machine to run out of memory, instead 
>> of just being as slow as it would need to be
>
>
>um, because it is buggy? Or pick some other fun reason, because the this 
>wording does not really make much sense.

how to "un"bug it? why is it allowed to trash system? i'm not expecting git run 
to cause git, sshd, getty, syslog, etc to die. i don't think anyone wants it. 
no swapping, system down in few seconds. sadly it's hard to repeat as it needs 
large repo and ton of things to update. could it give some clues? some say git 
is bad too. maybe. first vcs with sane ui for me. i'm up for better tools

>
>
>> 
>> 2) why / what are fs operations that could cause low power machine to 
>> mysteriously fail on zfs, when expected results would be slow fs behaviour
>> 
>
>define low power? in general, the failures on system with limited resources 
>hint about lack of testing and bug hunting in such systems. Over time, there 
>have been improvements, but this is almost never ending task.

like c2d 4g. i'm sure we find embedded equivalent here too. since most of my 
zfs tasks don't surprise me with outcome. only a few. how to make all of them 
be unsurprising?

>
>
>> i don't know what really happens and it's way too complex me to get all 
>> memory management that happens in kernel. i only have this wild guess that 
>> any type of caching should happen in "leftover" ram and make things faster 
>> if possible. and any fs operations that have already reported completed by 
>> kernel can't be suddenly found incomplete later. whatever that fs-related 
>> stray buildworld error was that resolved itself somehow. and what i can 
>> recreate
>> 
>
>default fs operations are asynchronous, if you want them to be “complete”, 
>that is, data on stable storage and consistent state for file system, you need 
>synchronous IO. But as always, there is price to pay.

yeah. i know. but here i'm afraid i was able to trash some of it with my 
actions. the buildworld logs that i failed to capture were about unable to 
create or find some files. or open. or... i was wtf. they never reappeared. ran 
again, no, errors. same tree. at fs, no checksum errors, scrubs are fine, etc

>
>
>> and i'm not expert in this so how do i even know?
>> 
>> what's fun is how running rsync over several tb's of data doesn't seem to 
>> cause any issues at all. this is still same machine, many would not 
>> recommend using this. different workload?
>> 
>
>If you are comparing git with rsync, you want to make sure you have up to date 
>git. There are git versions with rather nasty bugs.

latest git

>
>
>> hell knows what's all this. maybe later i could figure it out or actually 
>> save some logs or. those i didn't save as i assumed it repeats itself. 
>> didn't and it went off tmux window history
>> 
>> oh well. yes, this is questionable report but those are "heisenbugs" as 
>> well. at least some?
>> 
>
>
>Heisenbug is bug for which we do not yet know the trigger mechanism, it does 
>not mean they do not have such mechanism.

i have no idea how to test. maybe it's disk io speed. maybe cpu speed. maybe 
ram size. i didn't write git, fbsd kernel or (open)zfs. didn't build the 
machine's hardware either. but i managed to break one or more of them. i have 
broken things before. call me good tester then. some have ended up being actual 
bugs. found by "why would anybody ever do this". note that git trashing the 
fbsd when zfs is used is not only found by me. but others seemed as smart as me 
which was not helping. what is, i'm unsure. i can't even explain what i do 
apparently. so i guess it's up to me to test it in the end. this could mean 
running actual hw with different setups. or maybe if could be emulated. but it 
could also emulate bugs away. what if this someone's 4g test vm was on top 
notch hw with nvme storage and therefore was super fast io, so fs operations 
immediately complete even in their async form, never hogging system up? that 
would mean bug is left in too. it only appears in some cases. it's wild guess 
too. it might be legit bug, only surfacing on some cases. only to maybe surface 
later. imagine your vm host is experiencing resource exhaustion, causing guests 
to behave exactly like old actual hw, letting bug appear

anyway i was hoping i could find more knowledge and hw here

i also wondered if this is just my hw. but at least with 1 of 2 issues, someone 
else also reported having that problem

so yeah, hard to figure this all out

so far i tried arc limits. it went past that. is there a known way for zfs to 
take memory outside of arc? like all memory, fast?

tests done by others confirmed that ufs = ok, zfs = fail

any git pull+rm obj+make buildworld, for that i can't find any. i even tried to 
look for fixed openzfs bugs, related or unrelated to it being on fbsd host. 
it's massive beast. without fs dev exp, i could find any really

since it all seems to revolve around git. any ideas? even if it's buggy, if it 
brings other bugs out, it's good, no? it wasn't git what failed. it seems like 
git causes zfs to consume huge ton of memory. i even ran fs test tools. hoping 
to cause a failure again. and memory. i read my git config affects mmap. which, 
in my knowledge is not a method to consume all actual memory as in ram, but 
rather map files to memory somehow. i tried to run those tests as well. 
couldn't get it to fail. don't know what it is

right now i don't have test-to-destruction equipment here so i'm looking some 
nondestructive methods. or at least confirm it's an old, fixed, bug

so no idea what to think of all this. it's all also kind of stock. like i don't 
have my own changes to any of this. only blame would be to run old hw. if 
that's even the issue. i have looked into many other tuning options as well, 
didn't help or didn't find any. so i don't have methods to test, find a tunable 
or bug

i would guess i would drop it again for a year or more and come back later. 
maybe when i have 14 either on this machine or any other machine. or current or

unsure, maybe it's only me that found this. but then. not really?

tl;dr - i seem to have certain unusual zfs issues on low ram box i can't put my 
finger to, despite putting effort into it

>
>rgds,
>toomas
>
>> 
>> 
>> On April 22, 2025 3:52:11 PM GMT+03:00, Ronald Klop <ronald-li...@klop.ws> 
>> wrote:
>>> Hi,
>>> 
>>> First, instead of writing "it gives vague errors", it really helps others 
>>> on this list if you can copy-paste the errors into your email.
>>> 
>>> Second, as far as I can see FreeBSD 13.4 uses OpenZFS 2.1.14. FreeBSD 14 
>>> uses OpenZFS 2.2.X which has bugfixes and improved tuning, although I 
>>> cannot claim that will fix your issues.
>>> What you can try is to limit the growth of the ARC.
>>> 
>>> Set "sysctl vfs.zfs.arc_max=1073741824" or add this to /etc/sysctl.conf to 
>>> set the value at boot.
>>> 
>>> This will limit the ARC to 1GB. I used similar settings on small machines 
>>> without really noticing a speed difference while usability increased. You 
>>> can play a bit with the value. Maybe 512MB will be even enough for your use 
>>> case.
>>> 
>>> NB: sysctl vfs.zfs.arc_max was renamed to vfs.zfs.arc.max with arc_max as a 
>>> legacy alias, but I don't know if that already happened in 13.4.
>>> 
>>> Another thing to check is the usage of tmpfs. If you don't restrict the max 
>>> size of a tmpfs filesystem it will compete for memory. Although this will 
>>> also show an increase in swap usage.
>>> 
>>> Regards,
>>> Ronald.
>>> 
>>> 
>>> Van: Sulev-Madis Silber <freebsd-current-freebsd-org...@ketas.si.pri.ee>
>>> Datum: maandag, 21 april 2025 03:25
>>> Aan: freebsd-current <freebsd-current@freebsd.org>
>>> Onderwerp: zfs (?) issues?
>>>> 
>>>> i have long running issue in my 13.4 box (amd64)
>>>> 
>>>> others don't get it at all and only suggest adding more than 4g ram
>>>> 
>>>> it manifests as some mmap or other problems i don't really get
>>>> 
>>>> basically unrestricted git consumes all the memory. i had to turn watchdog 
>>>> on because something a git pull on ports tree causes kernel to take 100% 
>>>> of ram. it keeps killing userland off until it's just kernel running there 
>>>> happily. it never panics and killing off userland obviously makes the 
>>>> problem disappear and nothing will do any fs operations anymore
>>>> 
>>>> dovecot without tuning or with some tuning tended to do this too
>>>> 
>>>> what is it?
>>>> 
>>>> now i noticed another issue. if i happen to do too many src git pulls in a 
>>>> row, they never actually "pull" anything. and / or clean my obj tree out. 
>>>> i can't run buildworld anymore. it gives vague errors
>>>> 
>>>> if i wait a little before starting buildworld, it always works
>>>> 
>>>> what could possibly happening here? the way the buildworld fails means 
>>>> there's serious issue with fs. and how could it be fixed with waiting? it 
>>>> means that some fs operations are still going on in background
>>>> 
>>>> i have no idea what's happening here. zfs doesn't report any issues. nor 
>>>> do storage. nothing was killed with out of memory but arc usage somehow 
>>>> increased a lot. and it's compression ratio went weirdly high, like ~22:1 
>>>> or so
>>>> 
>>>> i don't know if it's acceptable zfs behaviour if it runs low on memory or 
>>>> not. how to test it. etc. and if this is fixed on 14, on stable, or on 
>>>> current. i don't have enough hw to test it on all
>>>> 
>>>> i have done other stuff on that box that might also improper for amoung of 
>>>> ram i have there but then it's just slow, nothing fails like this
>>>> 
>>>> unsure how this could be fixed or tuned or something else. or why does it 
>>>> behave like this. as opposed to usual low resource issues that just mean 
>>>> you need more time
>>>> 
>>>> i mean it would be easy to add huge amounts of ram but people could also 
>>>> want to use zfs in slightly less powerful embedded systems where lack of 
>>>> power is expected but weird fails maybe not
>>>> 
>>>> so is this a bug? a feature? something fixed? something that can't be 
>>>> fixed? what could be acceptable ram size? 8g? 16g? and why can't it just 
>>>> tune everything down and become slower as expected
>>>> 
>>>> i tried to look up on any openzfs related bugs but zfs is huge and i'm not 
>>>> fs expert either
>>>> 
>>>> i also don't know what happens while i wait. it doesn't show any serious 
>>>> io load. no cpu is taken. load is down. system is responsible
>>>> 
>>>> it all feels like bug still
>>>> 
>>>> i have wondered if this is second hand hw acting up but i checked and 
>>>> tested it as best as i could and why would it only bug out when i try more 
>>>> complex things on zfs?
>>>> 
>>>> i'm curious about using zfs on super low memory systems too, because it 
>>>> offers certain features. maybe we could fix this if whole issue is ram. or 
>>>> if it's elsewhere, maybe that too
>>>> 
>>>> i don't know what to think of this all. esp the last issue. i'm not really 
>>>> alone here with earlier issues but unsure
>>>> 
>>>> 
>>>> 
>>> 
>> 
>
>
Re: zfs (?) issues?

Reply via email to