Hi Samuel,

One quick detail I forgot to mention regarding the performance analysis:

The entire ~0.4s performance impact I measured is isolated exclusively to
journal_dirty_block.

To verify this, I ran an experiment where I stubbed out journal_dirty_block
so it just returned immediately (which obviously makes for a very fast, but
not very useful, journal!). With that single function bypassed, the
filesystem performs identically to vanilla Hurd.

This confirms that the background kjournald flusher, the transaction
reference counting, and the checkpointing logic add absolutely no
noticeable latency to the VFS. The overhead is strictly tied to the physics
of the memory copying and hashmap lookups in that one block which we can
improve in subsequent patches.

Thanks, Milos

On Fri, Mar 6, 2026 at 10:55 AM Milos Nikic <[email protected]> wrote:

> Hi Samuel,
>
> Thanks for reviewing my mental model on V1; I appreciate the detailed
> feedback.
>
> Attached is the v2 patch. Here is a breakdown of the architectural changes
> and refactors based on your review:
>
> 1. diskfs_node_update and the Pager
> Regarding the question, "Do we really want to update the node?": Yes, we
> must update it with every change. JBD2 works strictly at the physical block
> level, not the abstract node cache level. To capture a node change in the
> journal, the block content must be physically serialized to the transaction
> buffer. Currently, this path is diskfs_node_update -> diskfs_write_disknode
> -> journal_dirty_block.
> When wait is 0, this just copies the node details from the node-cache to
> the pager. It is strictly an in-memory serialization and is extremely fast.
> I have updated the documentation for diskfs_node_update to explicitly
> describe this behavior so future maintainers understand it isn't triggering
> synchronous disk I/O and doesn't measurably increase the latency of the
> file system.
> journal_dirty_block  is not one of the most hammered functions in
> libdiskfs/ext2 and more on that below.
>
> 2. Synchronous Wait & Factorization
> I completely agree with your factorization advice:
> write_disknode_journaled has been folded directly into
> diskfs_write_disknode, making it much cleaner.
> Regarding the wait flag: we are no longer ignoring it! Instead of blocking
> the VFS deeply in the stack, we now set an "IOU" flag on the transaction.
> This bubbles the sync requirement up to the outer RPC layer, which is the
> only place safe enough to actually sleep on the commit and thus maintain
> the POSIX sync requirement without deadlocking etc.
>
> 3. Multiple Writes to the Same Metadata Block
> "Can it happen that we write several times to the same metadata block?"
> Yes, multiple nodes can live in the same block. However, because the Mach
> pager always flushes the "latest snapshot" of the block, we don't have an
> issue with mixed or stale data hitting the disk.
> If RPCs hit while pager is actively writing that is all captured in the
> "RUNNING TRANSACTION". If it happens that that RUNNING TRANSACTION has the
> same blocks pager is committing RUNNING TRANSACTION will be forcebliy
> committed.
>
> 4. The New libdiskfs API
> I added two new opaque accessors to diskfs.h:
>
>     diskfs_journal_set_sync
>     diskfs_journal_needs_sync
>
>     This allows inner nested functions to declare a strict need for a
> POSIX sync without causing lock inversions. We only commit at the top RPC
> layer once the operation is fully complete and locks are dropped.
>
> 5. Cleanups & Ordering
>     Removed the redundant record_global_poke calls.
>     Reordered the pager write notification in journal.c to sit after the
> committing function, as the pager write happens after the journal commit.
>     Merged the ext2_journal checks inside diskfs_journal_start_transaction
> to return early.
>     Reverted the bold unlock moves.
>     Fixed the information leaks.
>     Elevated the deadlock/WAL bypass logs to ext2_warning.
>
> Performance:
> I investigated the ~0.4s (increase from 4.9s to 5.3s) regression on my SSD
> during a heavy Hurd ../configure test. By stubbing out journal_dirty_block,
> performance returned to vanilla Hurd speeds, isolating the overhead to that
> specific function.
>
> A nanosecond profile reveals the cost is evenly split across the mandatory
> physics of a block journal:
>
>     25%: Lock Contention (Global transaction serialization)
>
>     22%: Memcpy (Shadowing the 4KB blocks)
>
>     21%: Hash Find (hurd_ihash lookups for block deduplication)
>
> I was surprised to see hurd_ihash taking up nearly a quarter of the
> overhead. I added some collision mitigation, but left a further
> improvements of this patch to keep the scope tight. In the future, we could
> drop the malloc entirely using a slab allocator and optimize the hashmap to
> get this overhead closer to zero (along with introducing a "frozen data"
> concept like Linux does but that would be a bigger non localized change).
>
> Final Note on Lock Hierarchy
> The intended, deadlock-free use of the journal in libdiskfs is best
> illustrated by the CHANGE_NODE_FIELD macro in libdiskfs/priv.h
>   txn = diskfs_journal_start_transaction ();
>   pthread_mutex_lock (&np->lock);
>   (OPERATION);
>   diskfs_node_update (np, diskfs_synchronous);
>   pthread_mutex_unlock (&np->lock);
>   if (diskfs_synchronous || diskfs_journal_needs_sync (txn))
>     diskfs_journal_commit_transaction (txn);
>   else
>     diskfs_journal_stop_transaction (txn);
>
> By keeping journal operations strictly outside of the node
> locking/unlocking phases, we treat it as the outermost "lock" on the file
> system, mathematically preventing deadlocks.
>
> Kind regards,
> Milos
>
>
>
> On Thu, Mar 5, 2026 at 12:41 PM Samuel Thibault <[email protected]>
> wrote:
>
>> Hello,
>>
>> Milos Nikic, le jeu. 05 mars 2026 09:31:26 -0800, a ecrit:
>> > Hurd VFS works in 3 layers:
>> >
>> >  1. Node cache layer: The abstract node lives here and it is the ground
>> truth
>> >     of a running file system. When one does a stat myfile.txt, we get
>> the
>> >     information straight from the cache. When we create a new file, it
>> gets
>> >     placed in the cache, etc.
>> >
>> >  2. Pager layer: This is where nodes are serialized into the actual
>> physical
>> >     representation (4KB blocks) that will later be written to disk.
>> >
>> >  3. Hard drive: The physical storage that receives the bytes from the
>> pager.
>> >
>> > During normal operations (not a sync mount, fsync, etc.), the VFS
>> operates
>> > almost entirely on Layer 1: The Node cache layer. This is why it's
>> super fast.
>> > User changed atime? No problem. It just fetches a node from the node
>> cache
>> > (hash table lookup, amortized to O(1)) and updates the struct in
>> memory. And
>> > that is it.
>>
>> Yes, so that we get as efficient as possible.
>>
>> > Only when the sync interval hits (every 30 seconds by default) does the
>> Node
>> > cache get iterated and serialized to the pager layer
>> (diskfs_sync_everything ->
>> >  write_all_disknodes -> write_node -> pager_sync). So basically, at that
>> > moment, we create a snapshot of the state of the node cache and place
>> it onto
>> > the pager(s).
>>
>> It's not exactly a snapshot because the coherency between inodes and
>> data is not completely enforced (we write all disknodes before asking
>> the kernel to write back dirty pages, and then poke the writes).
>>
>> > Even then, pager_sync is called with wait = 0. It is handed to the
>> pager, which
>> > sends it to Mach. At some later time (seconds or so later), Mach sends
>> it back
>> > to the ext2 pager, which finally issues store_write to write it to
>> Layer 3 (The
>> > Hard drive). And even that depends on how the driver reorders or delays
>> it.
>> >
>> > The effect of this architecture is that when store_write is finally
>> called, the
>> > absolute latest version of the node cache snapshot is what gets written
>> to the
>> > storage. Is this basically correct?
>>
>> It seems to be so indeed.
>>
>> > Are there any edge cases or mechanics that are wrong in this model
>> > that would make us receive a "stale" node cache snapshot?
>>
>> Well, it can be "stale" if another RPC hasn't called
>> diskfs_node_update() yet, but that's what "safe" FS are all about: not
>> actually provide more than coherency of the content on the disk so fsck
>> is not suppposed to be needed. Then, if a program really wants coherency
>> between some files etc. it has to issue sync calls, dpkg does it for
>> instance.
>>
>> Samuel
>>
>

Reply via email to