Re: [PATCH] open_issues/bcachefs.mdwn: new file.

Samuel Thibault Tue, 09 Jan 2024 15:06:58 -0800

Applied, thanks!

jbra...@dismail.de, le sam. 06 janv. 2024 14:59:40 -0500, a ecrit:
> Well, we might as well document our conversation with Kent about bachchefs.
> 
> ---
>  open_issues/bcachefs.mdwn | 326 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 326 insertions(+)
>  create mode 100644 open_issues/bcachefs.mdwn
> 
> diff --git a/open_issues/bcachefs.mdwn b/open_issues/bcachefs.mdwn
> new file mode 100644
> index 00000000..aa39bce0
> --- /dev/null
> +++ b/open_issues/bcachefs.mdwn
> @@ -0,0 +1,326 @@
> +[[!meta copyright="Copyright © 2007, 2008, 2010, 2011 Free Software 
> Foundation,
> +Inc."]]
> +
> +[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
> +id="license" text="Permission is granted to copy, distribute and/or modify 
> this
> +document under the terms of the GNU Free Documentation License, Version 1.2 
> or
> +any later version published by the Free Software Foundation; with no 
> Invariant
> +Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the 
> license
> +is included in the section entitled [[GNU Free Documentation
> +License|/fdl]]."]]"""]]
> +
> +[[!tag open_issue_hurd]]
> +
> +The Hurd's primary filesystem is ext2, which works but lacks modern
> +features.  With ext2, Hurd users reguarly deal with filesystem
> +corruption.  Ext2 does not have a journal, so Hurd users occasionally
> +have to deal with filesystem corruption.  `fsck` can fix most of the
> +issues (with loss of random data), but without a proper journal the
> +Hurd currently is not a good a OS for long-term data storage.
> +
> +Bcachefs is a modern COW (copy-on-write) open source filesystem for
> +Linux, which intends to replace Btrfs and ZFS while having the
> +performance of ext4 or XFS.  It is almost 100,000 lines of code.
> +Btrfs is 150,000 lines of code.  Bcachefs is structured as a
> +filesystem built on top of a database.  There is a clean small
> +database transaction layer.  That core database library is maybe
> +25,000 lines of code.
> +
> +Some Hurd developers recently [[talked with
> +Bcachefs|https://youtube.com/watch?v=bcWsrYvc5Fg]] author Kent
> +Overstreat about porting bcachefs to the Hurd.  There are currently no
> +concrete plans to do so due to lack of developer man power.
> +
> +90% of the Bcachefs filesystem code builds and runs in userspace.  It
> +uses a shim layer that makes maps kernel locking primatives to
> +pthreads, the kernel io API is mapped to AIO, etc.  Bcachefs does
> +intend to eventually rewrite most or all of its current codebase into
> +rust.
> +
> +Kent is ok with us merging a shim layer for libstore that maps to the
> +Unix filesystem API.  That would be a header file that goes into the
> +bcachefs code.
> +
> +There is a somewhat working FUSE port of bcachefs, but Kent is not
> +certain that is a good way to run bcachefs in userspace.  Kent wants
> +to use the FUSE port to help in debbugging.  Suppose bcachefs starts
> +acting up, then you could switch to running it in userspace and attach
> +GDB to the running process.  This is currently not possible.
> +
> +We could port bcachefs to the Hurd's native filesystem API: libdiskfs.
> +
> +One interesting aspect of the conversation was Kent's goal of re-using
> +kernel code in userspace. The Linux kernel hashtable code is high
> +performance, resizeable, lockless, and builds and runs in userspace.
> +As long as you have liburcu, then you can use the kernel hashtable in
> +userspace on the Hurd.  This might be useful to use on the Hurd.
> +
> +Bcachefs is liscensed as GPLv2, and many of Kent's previous employers
> +own the patents, including Google. Kent is ok with potentially making
> +the license GPLv2+, as long as there was not a promise to keep
> +bcachefs GPLv2 only.
> +
> +# IRC logs
> +
> +https://logs.guix.gnu.org/hurd/2023-09-26.log
> +
> +    <solid_black>    maybe I'm wrong though, do you know much about fuse? or 
> file systems?
> +    <damo22> no i dont know much about filesystems
> +    <damo22> what is bcachefs?
> +    <solid_black>    see? :D
> +    <azert>  I agree that someone intimate in the Mach pager api, libdiskfs 
> and fuse would be great at that meeting
> +    <solid_black>    I do kind of understand Mach VM / paging, I must say
> +    <solid_black>    from the looks of it, I even understand it best among 
> those who have looked at it recently
> +    <solid_black>    and I mostly understand libdiskfs
> +    <damo22> so go to the meeting
> +    <damo22> what is fuse? do we even need it for hurd?
> +    <damo22> file systems in userspace
> +    <solid_black>    FUSE is "filesystem in user space", it's both the name 
> for the concept, and the name of Linux's specific mechanism, of offloading fs 
> to userland
> +    <damo22> yeah, i think it may be unneeded for filesystem on hurd
> +    <solid_black>    it's basically a giant hack that pretends to be a 
> fileystem implementation to the rest of the kernel, and then sends requests 
> and receives responses from a userland program that _actually_ implements the 
> fs
> +    <solid_black>    on the Hurd, *of course* filesystems are implemented in 
> userland, that's the only and tnhe natural way everything works
> +    <solid_black>    but that's where the similarities end
> +    <solid_black>    you cannot just take a linux fuse fs, using libfuse, 
> and run it on the Hurd
> +    <solid_black>    there has been a project make a library that would have 
> the same API as libfuse, but act as a Hurd translator, specifically to 
> facilitate porting linux filesystems
> +    <damo22> i imagine fuse has an api
> +    <solid_black>    last I heard, it was never completed, but who knows
> +    <solid_black>    it has a kerne    <->userland protocol and a userspace 
> library (libfuse) for implementing that protocol, yes
> +    <damo22> solid_black: you seem to know more about fuse than you admitted
> +    <solid_black>    https://www.gnu.org/software/hurd/hurd/libfuse.html 
> +    <solid_black>    I know the basics, around as much as I have just told 
> you
> +    <azert>  I think that gnucode idea was that this would be the easiest to 
> port bcachefs to the Hurd, but I doubt it would be the best
> +    <solid_black>    I have also hacked on a C++ fuse fs (darling-dmg), 
> though I don't think I interacted with the fuse parts of it much
> +    <azert>  Or even the easier
> +    <solid_black>    yeah, I don't think it'd be the best or the easiest one 
> either
> +    <damo22> if someone implemented libfuse api and made it as a hurd 
> translator, surely it would work natively?
> +    <damo22>    <braunr> zacts: the main problem seems to be the 
> interactions between the fuse file system and virtual memory (including 
> caching)
> +    <braunr> something the hurd doesn't excel at
> +    <braunr> it *may* be possible to find existing userspace implementations 
> that don't use the system cache (e.g. implement their own)
> +    <azert>  Yes, that’s a possibility that needs to be kept open for 
> discussion
> +    <nikolar>        Sounds interesting 
> +    <solid_black>    youpi: ping
> +    <youpi>  pong
> +    <solid_black>    hello!
> +    <solid_black>    any thoughts on the above discussion? are you going to 
> participate in the call that's being set up?
> +    <youpi>  I don't have time for it
> +    <youpi>  (AFAIK the fuse hurd implementation does work to some extent)
> +    <solid_black>    I should at least try out Hurd's fuse before the call, 
> good idea
> +    <solid_black>    maybe read up on the Linux's fuse
> +    <solid_black>    thoughts on using fuse vs libdiskfs for bcachefs?
> +    <youpi>  using fuse would probably be less work
> +    <youpi>  and it'd probably mean fixing things in libfuse, which can 
> benefit many other FS anyway
> +    <solid_black>    is it true that the "low level" API of libfuse is 
> unimplemented and unimplementable?
> +    <youpi>  I don't know what that "low level" API is
> +    <solid_black>    this IIUC 
> https://github.com/libfuse/libfuse/blob/master/include/fuse_lowlevel.h 
> +    <solid_black>    > libfuse offers two APIs: a "high-level", synchronous 
> API, and a "low-level" asynchronous API. In both cases, incoming requests 
> from the kernel are passed to the main program using callbacks. When using 
> the high-level API, the callbacks may work with file names and paths instead 
> of inodes, and processing of a request finishes when the callback function 
> returns. When using the low-level API, the callbacks must work with inodes 
> and responses must be se
> +    <solid_black>    nt explicitly using a separate set of API functions.
> +    <youpi>  where did you read that it'd be unimplementable ?
> +    <solid_black>    
> https://git.savannah.gnu.org/cgit/hurd/incubator.git/tree/README?h=libfuse/master
>  
> +    <solid_black>    > This is simply because it is to specific to the Linux 
> kernel and (besides that) it is not farly used now.
> +    <youpi>  In case the latter should change in the future, we might want 
> to re-think about that issue though.
> +    <solid_black>    so, sounds like it's perhaps implementable in theory, 
> but that'd require additional work and design
> +    <youpi>  see the sentence below...
> +    <solid_black>    the low-level API is what bcachefs uses
> +    <youpi>  well, additional work and design, of course
> +    <solid_black>    seems to, at least, from a quick glance
> +    <youpi>  any async API needs some
> +    <youpi>  but I don't see why it would not be possible
> +    <youpi>  mig precisely supports asynchronous stubs
> +    <solid_black>    bcachefs-tools/cmd_fusermount.c is just 1274 lines, 
> which inspires some hope
> +    <solid_black>    asynchrony is not the problem, I imagine (but I haven't 
> looked), but being too tied to Linux might be
> +    <youpi>  it's not really tied, as in it doesn't seem to use 
> linux-specific functions
> +    <youpi>  but it uses linux-like notions, which indeed need to be 
> translated to the hurdish notions
> +    <youpi>  but that's not something really tough
> +    <youpi>  just needs to be worked on
> + 
> +https://logs.guix.gnu.org/hurd/2023-09-27.log#103329
> +
> +    <solid_black> libfuse as shipped as Debian doesn't seem very
> +    functional, I can't even build a simple program against it:
> +    'i386-gnu/libfuse.so: undefined reference to `assert''
> +
> +    <solid_black>    (assert is of course a macro in glibc)
> +    <solid_black>    and it segfaults in fuse_main_real
> +    <solid_black>    lowleve fuse ops do seem to map to netfs concept 
> nicely, as far as I can see so far
> +    <solid_black>    and (again, so far) I don't see any asynchrony in how 
> bcachefs uses fuse, i.e. they always fuse_reply() inside the method 
> implementation
> +
> +    <solid_black>    but if we had to implement low-level fuse API, this 
> would be an issue
> +    <solid_black>    because netfs is syncronous
> +    <solid_black>    this is again a place where I don't think netfs is 
> actually that useful
> +    <solid_black>    libfuse should be its own standalone tranlator library, 
> a peer to lib{disk,net,triv}fs
> +    <solid_black>    yell at me if you disagree
> +    <youpi>  or perhaps make it use libdiskfs ?
> +    <youpi>  there's significant code in libdiskfs that you'd probably not 
> want to reimplement in libfuse
> +    <solid_black>    like what?
> +    <youpi>  starting a translator
> +    <youpi>  all the posix semantic bits
> +    <solid_black>    (this is another thing, I don't believe there is a 
> significant difference that explains libdiskfs and libnetfs being two 
> separate libraries. but it's too late to merge them, and I'm not an fs dev)
> +
> +    <solid_black>    starting a translator is abstracted into libfshelp 
> specifically so it can be easily reused?
> +    <solid_black>    is libdiskfs synchronous?
> +    <youpi>  I'm just saying things out of my memory
> +    <solid_black>    scratch that, diskfs does not work like that at all
> +    <youpi>  piece of it is in fshelp yes
> +    <solid_black>    it works on pagers, always
> +    <youpi>  but significant pieces are in libdiskfs too
> +    <youpi>  and you are saying you are not an FS person :)
> +    <youpi>  you do know libdiskfs etc. well beyond the average
> +    <youpi>  perhaps not the ext2 FS structure, but that's not really 
> important here
> +    <youpi>  see e.g. the short-circuits in file-get-trans.c
> +    <solid_black>    I may understand how the Hurd's translator libraries 
> work, somewhat better than the avergae person :)
> +    <youpi>  and the code around fshelp_fetch_root
> +    <solid_black>    but I don't know about how filesystems are actually 
> organized, on-disk (beyond the basics that there any inodes and superblocks 
> and journaled writes and btrees etc)
> +    <youpi>  you don't really need to know more about that
> +    <solid_black>    nor do I know the million little things about how 
> filesystem code should be written to be robust and performant
> +    <solid_black>    yeah so as I was saying, libdiskfs expects files to be 
> mappable (diskfs_get_filemap_pager_struct), and then all I/O is implemented 
> on top of that
> +    <solid_black>    e.g. to read, libdiskfs queries that pager from the 
> impl, maps it into memory, and copies data from there to the reply message
> +    <solid_black>    I must have mentioned that already, I'd like to rewrite 
> that code path some day to do less copying
> +    <solid_black>    I imagine this might speed up I/O heavy workloads
> +    <youpi>  ? it doesn't copy into the reply
> +    <youpi>  it transfers map
> +    <solid_black>    it does, let me find the code
> +    <youpi>  in some corner cases yes
> +    <youpi>  but not normal case
> +    <youpi>  https://darnassus.sceen.net/~hurd-web/hurd/io_path/ 
> +    <solid_black>    libdiskfs/rdwr-internal.c, it does pager_memcpy, which 
> is a glorified memcpy + fault handling
> +    <solid_black>    don't trust that wiki page
> +    <youpi>  why not ?
> +    <youpi>  not, pager_memcpy is not just a memcpy
> +    <youpi>  it's using vm_copy whenever it can
> +    <youpi>  i.e. map transfer
> +    <solid_black>    well yes, but doesn't the regular memcpy also attempt 
> to do that?
> +    <youpi>  it happens to do so indeed
> +    <youpi>  but that' doesn't matter: I do mean it's trying *not* copying
> +    <youpi>  by going through the mm
> +    <youpi>  note: if a wiki page is bogus, propose a fix
> +    <solid_black>    I think there was another copy on the path somewhere 
> (in the server, there's yet another in the client of course), but I can't 
> quite remember where
> +    <solid_black>    and I wouldn't rely on that vm_copy optimization
> +    <solid_black>    it's may be useful when it working, but we have to 
> design for there to not be a need to make a copy in the first place
> +    <solid_black>    ah well, pager_read_page does the other copy
> +    <youpi>  when things are not aligned etC. you'll have to do a copy anyway
> +    <solid_black>    but then again, this is all my idle observations, I'm 
> not an fs person, I haven't done any profiling, and perhaps indeed all these 
> copies are optimized away with vm_copy
> +    <youpi>  where in pager_read_page do you see a copy?
> +    <youpi>  it should be doing a store_read
> +    <youpi>  passing the pointer to the driver
> +    <solid_black>    ext2fs/pager.c:file_pager_read_page (at line 220 here, 
> but I haven't pulled in a while)
> +    <solid_black>    it does do a store_read, and that returns a buffer, and 
> then it may have to copy that into the buffer it's trying to return
> +    <solid_black>    though in the common case hopefully it'll read 
> everything in a single read op
> +    <youpi>  it's in the new_buf != *buf + offs case
> +    <youpi>  which is not supposed to be the usual case
> +    <solid_black>    but now imagine how much overhead this all is
> +    <youpi>  what? the ifs?
> +    <solid_black>    we're inside io_read, we already have a buffer where we 
> should put the data into
> +    <youpi>  I have to go give a course, gotta go
> +    <solid_black>    we could just device_read() into there
> +    <youpi>  you also want to use a cache
> +    <youpi>  otherwise it'll be the disk that'll kill yiour performance
> +    <youpi>  so at some point you do have to copy from the cache to the 
> application
> +    <youpi>  that's unavoidable
> +    <youpi>  or if it's large, you can vm_copy + copy-on-write
> +    <youpi>  but basically, the presence of the cache means you can have to 
> do copies
> +    <youpi>  and that's far less costly than re-reading from the disk
> +    <solid_black>    why can't you return the cache page directly from 
> io_read RPC?
> +    <youpi>  that's vm_copy, yes
> +    <youpi>  but then if the app modifies the piece, you have to 
> copy-on-write
> +    <youpi>  anywauy, really gottago
> +    <solid_black>    that part is handled by Mach
> +    <solid_black>    right, so once you're back: my conclusion from looking 
> at libfuse is that it should be rewritten, and should not be using netfs (nor 
> diskfs), but be its own independent translator framework
> +    <solid_black>    and it just sounds like I'm going to be the one who is 
> going to do it
> +    <solid_black>    and we could indeed use bcachefs as a testbed for the 
> low level api, and darling-dmg for the high level api
> +    <solid_black>    I installed avfs from Debian (one of the few packages 
> that depend on libfuse), and sure enough: avfs: symbol lookup error: 
> /lib/i386-gnu/libfuse.so.1: undefined symbol: assert_perror
> +    <solid_black>    upstream fuse is built with Meson 🤩️
> +    <solid_black>    I'm wondering whether this would be better done as a 
> port in the upstream libfuse, or as a Hurd-specific libfuse lookalike that 
> borrows some code from the upstream one (as now)
> +    <damo22> solid_black: what is your argument to rewrite a translator 
> framework for fuse?
> +    <damo22> i dont understand
> +    <solid_black>    hi
> +    <damo22> hi
> +    <solid_black>    basically, 1. while the concepts of libfuse *lowlevel* 
> api seem to match that of hurd / netfs, they seem sufficiently different to 
> not be easily implementable on top of netfs
> +    <solid_black>    particularly, the async-ness of it, while netfs expects 
> you to do everything synchronously
> +    <damo22> is that a bug in netfs?
> +    <solid_black>    this could be maybe made to work, by putting the netfs 
> thread doing the request to sleep on a condition variable that would get 
> signalled once the answer is provided via the fuse api... but I don't think 
> that's going to be any nicer than designing for the asynchrony from the start
> +    <solid_black>    it's not a bug, it's just a design decision, most Hurd 
> tranalators are structured that way
> +    <damo22> maybe you can rewrite netfs to be asynchronous and replace it
> +    <solid_black>    i.e.: it's rare that translators use MIG_NO_REPLY + 
> explicit reply, it's much more common to just block the thread
> +    <solid_black>    2. the current state is not "somewhat working", it's 
> "clearly broken"
> +    <damo22> why not start by trying to implement rumpdisk async
> +    <damo22> and see what parts are missing
> +    <solid_black>    wdym rumpdisk async?
> +    <damo22> rumpdisk has a todo to make it asynchronous
> +    <damo22> let me find the stub
> +    <damo22> * FIXME:
> +    <damo22> * Long term strategy:
> +    <damo22> *
> +    <damo22> * Call rump_sys_aio_read/write and return MIG_NO_REPLY from
> +    <damo22> * device_read/write, and send the mig reply once the aio 
> request has
> +    <damo22> * completed. That way, only the aio request will be kept in 
> rumpdisk
> +    <damo22> * memory instead of a whole thread structure.
> +    <solid_black>    ah right, that reminds me: we still don't have proper 
> mig support for returning errors asynchronously
> +    <damo22> if the disk driver is not asynchronous, what is the point of 
> making the filesystem asynchronous?
> +    <solid_black>    the way this works, being asynchronous or not is an 
> implementatin detail of a server
> +    <solid_black>    it doesn't matter to others, the RPC format is the same
> +    <solid_black>    there's probably not much point in asynchrony for a 
> real disk fs like bcachefs, which must be why they don't use it and reply 
> immediately
> +    <solid_black>    but imagine you're implementing an over-the-network fs 
> with fuse, then you'd want asynchrony
> +    <damo22> what is your goal here? do you want to fix libfuse?
> +    <solid_black>    I don't know
> +    <solid_black>    I'm preparing for the call with Kent
> +    <solid_black>    but it looks like I'm going to have to rewrite libfuse, 
> yes
> +    <damo22> possibly the caching is important
> +    <damo22> ie, where does it happen
> +    <solid_black>    maybe, yes
> +    <solid_black>    does fuse support mmap?
> +    <damo22> idk
> +    <damo22> good q for kent
> +    <solid_black>    one essential fs property is coherence between mmap and 
> r/w
> +    <solid_black>    so it you change a byte in an mmaped file area, a 
> read() of that byte after that should already return the new value
> +    <solid_black>    same for write() + read from memory
> +    <solid_black>    this is why libdiskfs insists on reading/writing files 
> via the pager and not via callbacks
> +    <solid_black>    I wonder how fuse deals with this
> +    <damo22> good point, no idea
> +    <solid_black>    does fuse really make the kernel handle O_CREAT / 
> O_EXCL? I can't imagine how that would work without racing
> +    <solid_black>    guess it could be done by trying opening/creating in a 
> loop, if creation itself is atomic, but this is not nice
> +    <damo22> something is still slowing down smp
> +    <damo22> it cant possibly be executing as fast as possible on all cores
> +    <damo22> if more cores are available to run threads, it should boot 
> faster not slower
> +    <azert>  Hi damo22, your reasoning would hold if the kernel wouldn’t be 
> “wasting” most of its time running in kernel mode tasks
> +    <azert>  If replacing CPU_NUMBER by a better implementation gave you a 
> two digits improvement, that kind of implies that the kernel is indeed taking 
> most of the cpu
> +    <damo22> yes i mean, something in the kernel is slowing down smp
> +    <azert>  What about vm_map and all thread tasks synchronization
> +    <azert>  ?
> +    <damo22> i dont understand how the scheduler can halt the APs in 
> machine_idle() and not end up wasting time
> +    <damo22> how does anything ever run after HLT
> +    <damo22> in that code path
> +    <damo22> if the idle thread halts the processor the only way it can wake 
> up is with an interrupt
> +    <damo22> but then, does MARK_CPU_ACTIVE() ever run?
> +    <damo22> hmm it does
> +    <azert>  I think that normally the cpu would be running scheduler code 
> and get a thread by itself.
> +    <damo22> thats not how it works
> +    <damo22> most of the cpus are in idle_continue
> +    <damo22> then on a clock interrupt or ast interrupt, they are woken to 
> choose a thread i think
> +    <damo22> s/choose/run
> +    <azert>  If they are in cpu_idle then that’s what happens, yea
> +    <azert>  But normally they wouldn’t be in cpu idle but running the 
> schedule and just a thread on their own
> +    <azert>  Cpu_idle basically turns off the cpu
> +    <azert>  To save power
> +    <damo22> every time i interrupt the kernel debugger, its in cpu-idle
> +    <damo22> i dont know if it waits until it is in that state so maybe 
> thats why
> +    <azert>  That means that there is nothing to schedule
> +    <azert>  Or yea that’s another explanation
> +    <damo22> yes, exactly i think it is seemingly running out of threads to 
> schedule
> +    <azert>  A bug in the debugger
> +    <damo22> i need to print the number of threads in the queue
> +    <youpi>  adding a show subcommand for the scheduler state would probably 
> be useful
> +    <youpi>  solid_black: btw, about copies, there's a todo in rumpdisk's 
> rumpdisk_device_read : /* directly write at *data when it is aligned */
> +    <solid_black>    youpi: indeed, that looks relevant, and wouldn't be 
> hard to do
> +    <solid_black>    ideally, it should all be zero-copy (or: minimal number 
> of copies), from the device buffer (DMA? idk how this works, can dma pages be 
> then used as regular vm pages?) all the way to the data a unix process 
> receives from read() or something like that
> +    <solid_black>    without "slow" memcpies, and ideally with little 
> vm_copies too, though transferring ages in Mach messages is ok
> +    <solid_black>    s/ages/pages/
> +    <solid_black>    read() requires ones copy purely because it writes into 
> the provided buffer (and not returns a new one), and we don't have 
> mach_msg_overwrite
> +    <solid_black>    though again one would hope vm_copy would help there
> +    <solid_black>    ...I do think that it'd be easier to port bcachefs over 
> to netfs than to rewrite libfuse though
> +    <solid_black>    but then nothing is going to motivate me to work on 
> libfuse
> +    <azert>  solid_black: I never work on things that don’t motivate me 
> somehow
> +    <azert>  Btw, if you want zerocopy for IO, I think you need to do 
> asynchronous io
> +    <azert>  At least that’s the only way for me to make sense of zerocopy
> +    <solid_black>    I don't think sync vs async has much to do with 
> zero-copy-ness? w
> +
> +
> -- 
> 2.42.0
> 
>


-- 
Samuel
---
Pour une évaluation indépendante, transparente et rigoureuse !
Je soutiens la Commission d'Évaluation de l'Inria.

Re: [PATCH] open_issues/bcachefs.mdwn: new file.

Reply via email to