Re: [PATCH 1/3] ext2fs: Add JBD2 on-disk layout headers

Samuel Thibault Tue, 03 Mar 2026 16:31:37 -0800

Hello,

Milos Nikic, le jeu. 26 févr. 2026 16:55:41 -0800, a ecrit:
> For instance when one is compiling Hurd.
> I have experienced it about 8.5% slower compilation than when journal is not
> present. (and ~12% slower when running ./configure for instance),


I'm a bit surprised by the overhead. Notably for a compilation that does
not write many files, just quite some data (but data is not journaled).

Is that because we write e.g. directories atime more often?

> but 0% slower when mass deleting files

And that one I'm a bit more surprised since that's where metadata would
be put under pressure.

> Its a large patch now...let me know if you want it split up.

It's fine as it is: all the plugging in libdiskfs makes sense along the
jb implementation

> diff --git a/ext2fs/dir.c b/ext2fs/dir.c
> index 256f3d36..fd78cd7c 100644
> --- a/ext2fs/dir.c
> +++ b/ext2fs/dir.c
> @@ -241,8 +241,7 @@ diskfs_lookup_hard (struct node *dp, const char *name, 
> enum
> lookup_type type,
>      }
> 
>    diskfs_set_node_atime (dp);
> -  if (diskfs_synchronous)
> -    diskfs_node_update (dp, 1);
> +  diskfs_node_update (dp, diskfs_synchronous);
> 
>    /* If err is set here, it's ENOENT, and we don't want to
>       think about that as an error yet. */

Do we really want to update the node? Perhaps that's what the overhead
comes from? With journaling, we don't want to write more often. We just
want to write to the journal before writing to the real place, to make
sure to keep coherency, only.

And similarly in other diskfs_node_update places.

> -/* Sync all allocation information and node NP if diskfs_synchronous. */
> +/* Sync all allocation information and node NP if diskfs_synchronous.
> +   If journaling is active, we just update memory (wait=0) and let the
> +   transaction commit handle durability. */
>  EXT2FS_EI void
>  alloc_sync (struct node *np)
>  {
> -  if (diskfs_synchronous)
> +  if (np)
>      {
> -      if (np)
> -     {
> -       diskfs_node_update (np, 1);
> -       pokel_sync (&diskfs_node_disknode (np)->indir_pokel, 1);
> -     }
> -      diskfs_set_hypermetadata (1, 0);
> +      diskfs_node_update (np, diskfs_synchronous);
> +      pokel_sync (&diskfs_node_disknode (np)->indir_pokel, 
> diskfs_synchronous);

Again, do we really want to trigger the poke?

>      }
> +  diskfs_set_hypermetadata (diskfs_synchronous, 0);

(and the related hypermetadata update)

>  }
>  #endif /* Use extern inlines.  */
>  


> diff --git a/ext2fs/inode.c b/ext2fs/inode.c
> index 8d10af01..b22c1c4f 100644
> --- a/ext2fs/inode.c
> +++ b/ext2fs/inode.c
>  /* Sync the info in NP->dn_stat and any associated format-specific
>     information to disk.  If WAIT is true, then return only after the
>     physicial media has been completely updated.  */
>  void
>  diskfs_write_disknode (struct node *np, int wait)
>  {
> -  struct ext2_inode *di = write_node (np);
> +  struct ext2_inode *di;
> +
> +  if (ext2_journal)
> +    {
> +      di = write_disknode_journaled (np, wait);

You can as well just include the content of write_disknode_journaled
here, it seems to me you will be able to factorize the write_node call
etc. and it'll be simpler in the end.

Also, we are ignoring the wait flag and just always wait? Again, I don't
see why we should be doing this. Journaling is not supposed to make
programs wait more, but just make sure that what is on the disk is
coherent.

> +      if (di)
> +        record_global_poke (di);
> +      return;
> +    }
> +
> +  di = write_node (np);
>    if (di)
>      {
>        if (wait)
>          {
> -       sync_global_ptr (di, 1);
> +          sync_global_ptr (di, 1);
>            error_t err = store_sync (store);
> +          /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */
>            if (err && err != EOPNOTSUPP)
> -            ext2_warning ("inode flush failed: %s", strerror (err));
> +            ext2_warning ("device flush failed: %s", strerror (err));
>          }
>        else
> -     record_global_poke (di);
> +        {
> +          record_global_poke (di);
> +        }
>      }
>  }


> +/**
> + * Called by the Pager (store_write hook) after writing blocks to the main 
> disk.
> + * Bulk version to handle clustered pageouts efficiently.
> + */
> +void
> +journal_notify_blocks_written (block_t start_block, size_t n_blocks)

Can it happen that we write several time to the same metadata block, and
thus have several transactions in the log for the same block, and then
we don't know if the pager write notification is only for the former or
both? and all kinds of mixtures.

This should be ordered in the journal.c file after the commiting
function, since the pager write is supposed to happen after we have
committed to the journal.


> +/* Writes the Descriptor Block + All Data Blocks (Escaped) */
> +static error_t
> +journal_write_payload (journal_t *journal, const diskfs_transaction_t *txn)
> +/* Writes the Commit Block */
> +static error_t
> +journal_write_commit_record (journal_t *journal,
> +                          diskfs_transaction_t *txn, uint32_t commit_loc)
> +/* Cleans up the transaction. */
> +static error_t
> +journal_cleanup_transaction (diskfs_transaction_t *txn, error_t err)

These should be ordered in journal.c along
journal_commit_running_transaction_locked, it's their only caller.

> +/**
> + * Ensures there is a VALID running transaction to attach to.
> + * Returns 0 on success, or error code.
> + */
> +static error_t
> +journal_start_transaction (diskfs_transaction_t **out_txn)
> +{
> +  diskfs_transaction_t *txn;
> +
> +  if (!ext2_journal)
> +    return EINVAL;
> +
> +  JOURNAL_LOCK (ext2_journal);
> +
> +  if (ext2_journal->j_free < ext2_journal->j_min_free)
> +    {
> +      JRNL_LOG_DEBUG
> +     ("[TRX] Journal full (Free: %u). Forcing checkpoint.",
> +      ext2_journal->j_free);
> +
> +      journal_force_checkpoint_locked (ext2_journal);
> +    }
> +
> +  txn = ext2_journal->j_running_transaction;
> +
> +  if (txn)
> +    {
> +      assert_backtrace (txn->t_state == T_RUNNING);
> +      txn->t_updates++;
> +    }
> +  else
> +    {
> +      txn = calloc (1, sizeof (diskfs_transaction_t));
> +      if (!txn)
> +     {
> +       JOURNAL_UNLOCK (ext2_journal);
> +       return ENOMEM;
> +     }
> +
> +      hurd_ihash_init (&txn->t_buffer_map, HURD_IHASH_NO_LOCP);
> +      txn->t_tid = ext2_journal->j_transaction_sequence++;
> +      txn->t_state = T_RUNNING;
> +      txn->t_updates = 1;
> +
> +      ext2_journal->j_running_transaction = txn;
> +      JRNL_LOG_DEBUG ("[TRX] Created NEW TID %u", txn->t_tid);
> +    }
> +
> +  JOURNAL_UNLOCK (ext2_journal);
> +  *out_txn = txn;
> +  return 0;
> +}
> +
> +diskfs_transaction_t *
> +diskfs_journal_start_transaction (void)
> +{
> +  if (ext2_journal)
> +    {
> +      diskfs_transaction_t *real_txn;
> +      error_t err = journal_start_transaction (&real_txn);

Is it really useful to separate the two? You can just return early from
diskfs_journal_start_transaction when !ext2_journal.

> +      if (err)
> +     return NULL;
> +      return real_txn;
> +    }
> +  return NULL;
> +}

> +/**
> + * Called by the pager BEFORE writing blocks to their permanent home.
> + * Tries its best to ensure WAL ordering for a range of blocks.
> + * There are cases in which that is not feasible though! (see below)
> + */
> +void
> +journal_ensure_blocks_journaled (block_t start_block, size_t n_blocks)
> +{
> +  if (!ext2_journal || n_blocks == 0)
> +    return;
> +
> +  JOURNAL_LOCK (ext2_journal);
> +
> +  diskfs_transaction_t *commit = ext2_journal->j_committing_transaction;
> +  diskfs_transaction_t *run = ext2_journal->j_running_transaction;
> +  uint32_t wait_tid = 0;
> +  int force_commit = 0;
> +  int in_committing = 0;
> +
> +  for (size_t i = 0; i < n_blocks; i++)
> +    {
> +      block_t b = start_block + i;
> +
> +      /* If ANY block is in the running transaction, we must force a commit. 
> */
> +      if (run && hurd_ihash_find (&run->t_buffer_map, (hurd_ihash_key_t) b))
> +     {
> +       force_commit = 1;
> +       wait_tid = run->t_tid;
> +       break;
> +     }
> +      /* Keep checking in case a later block is in the RUNNING state. */
> +      else if (commit
> +            && hurd_ihash_find (&commit->t_buffer_map,
> +                                (hurd_ihash_key_t) b))
> +     {
> +       if (wait_tid == 0)
> +         wait_tid = commit->t_tid;
> +       in_committing = 1;
> +     }
> +    }
> +
> +  /* The libpager MUST NOT sleep waiting for a VFS thread or a journal 
> thread.
> +     If we make it sleep, we risk a system-wide deadlock.
> +     On the other hand, if we return EAGAIN or similar, the pager
> +     will just drop the request and the block writes will be lost.
> +     Therefore, if the journal is not instantly ready to commit, we MUST 
> bypass
> +     the WAL barrier and write directly to the main disk. */
> +  if (force_commit)
> +    {
> +      /* If VFS is mutating the transaction, or the disk pipeline is already 
> full,
> +         we cannot commit right now. Bail out and bypass! */
> +      if (run->t_updates > 0 || commit != NULL)
> +     {
> +       JRNL_LOG_DEBUG
> +         ("[WARN] VM Deadlock Hazard! Bypassing WAL for RUNNING TID %u",
> +          wait_tid);

We want more than a debug log, as this is a case were we are failing to
provide the guarantee.

> +       JOURNAL_UNLOCK (ext2_journal);
> +       return;
> +     }
> +
> +      /* It is perfectly safe. The Pager will drive the commit synchronously 
> right now. */
> +      JRNL_LOG_DEBUG ("Pager forcing synchronous commit for TID %u",
> +                   wait_tid);
> +      journal_commit_running_transaction_locked (ext2_journal);
> +      return;
> +    }
> +  else if (in_committing)
> +    {
> +      /* The block is actively being written to the log by another thread.
> +         We cannot wait for it. Bail out and bypass! */
> +      JRNL_LOG_DEBUG
> +     ("[WARN] VM Deadlock Hazard! Bypassing WAL for COMMITTING TID %u",

Ditto.

I'm wondering: can it not happen "often" that we have a pager write
while the journald thread is writing? Precisely since the sync and pager
threads are not synchronized. We could rather synchronize them, making
sure to write the journal before synchronizing the pager.

> +      wait_tid);
> +    }
> +
> +  JOURNAL_UNLOCK (ext2_journal);
> +}

> diff --git a/ext2fs/pager.c b/ext2fs/pager.c
> index 1c795784..8b58ea21 100644
> --- a/ext2fs/pager.c
> +++ b/ext2fs/pager.c
> @@ -25,6 +25,7 @@
>  #include <inttypes.h>
>  #include <hurd/store.h>
>  #include "ext2fs.h"
> +#include "journal.h"
>  
>  /* XXX */
>  #include "../libpager/priv.h"
> @@ -326,6 +327,9 @@ pending_blocks_write (struct pending_blocks *pb)
>  
>        ext2_debug ("writing block %u[%ld]", pb->block, pb->num);
>  
> +      /* Lets make sure these are all already committed. */
> +      journal_ensure_blocks_journaled (pb->block, pb->num);
> +
>        if (pb->offs > 0)
>       /* Put what we're going to write into a page-aligned buffer.  */
>       {
> @@ -336,6 +340,12 @@ pending_blocks_write (struct pending_blocks *pb)
>       }
>        else
>       err = store_write (store, dev_block, pb->buf, length, &amount);
> +
> +      /* Now tell the journal about the partial/full success. */
> +      size_t written_blocks = amount >> log2_block_size;
> +      journal_notify_blocks_written (pb->block, written_blocks);
> +
> +
>        if (err)
>       return err;
>        else if (amount != length)
> @@ -477,14 +487,13 @@ file_pager_write_pages (struct node *node,
>         blk_peek = 0;
>       }
>  
> +      pthread_rwlock_unlock (lock);
>        /* Flush exactly one coalesced run; even if the loop broke early,
>           we may have a valid prefix to push.  */
>        error_t werr = pending_blocks_write (&pb);
>        if (!err)
>       err = werr;
>  
> -      pthread_rwlock_unlock (lock);
> -

This looks like a bold unlock move?

>        /* Advance only by what we actually enumerated and flushed.  */
>        done += built;
>  
> @@ -566,11 +575,11 @@ file_pager_write_page (struct node *node, vm_offset_t 
> offset, void *buf)
>        left -= block_size;
>      }
>  
> +  if (lock)
> +    pthread_rwlock_unlock (lock);
>    if (!err)
>      pending_blocks_write (&pb);
>  
> -  pthread_rwlock_unlock (&diskfs_node_disknode (node)->alloc_lock);
> -

ditto.

>    return err;
>  }
>  
> @@ -674,8 +683,17 @@ disk_pager_write_page (vm_offset_t page, void *buf)
>      }
>    else
>      {
> +      block_t start_block = offset >> log2_block_size;
> +      size_t n_blocks = length >> log2_block_size;
> +      /* Ensure that we don't have these blocks in the currently
> +       * running or committing transaction. This functiono will block
> +       * if it needs to ensure this holds true. */
> +      journal_ensure_blocks_journaled (start_block, n_blocks);
> +
>        err = store_write (store, offset >> store->log2_block_size,
>                        buf, length, &amount);
> +      size_t written_blocks = amount >> log2_block_size;
> +      journal_notify_blocks_written (start_block, written_blocks);
>        if (!err && length != amount)
>       err = EIO;
>      }
> @@ -916,9 +934,12 @@ diskfs_file_update (struct node *node, int wait)
>        ports_port_deref (pager);
>      }
>  
> -  pokel_sync (&diskfs_node_disknode (node)->indir_pokel, wait);
> +  /* If there is a journal present we will not sync metadata immediately
> +     We will let the journal do it when its ready. */
> +  int meta_wait = ext2_journal ? 0 : wait;
> +  pokel_sync (&diskfs_node_disknode (node)->indir_pokel, meta_wait);
>  
> -  diskfs_node_update (node, wait);
> +  diskfs_node_update (node, meta_wait);
>  }
>  
>  /* Invalidate any pager data associated with NODE.  */
> @@ -1569,17 +1590,43 @@ diskfs_shutdown_pager (void)
>        return 0;
>      }
>  
> +  journal_commit_running_transaction ();
>    write_all_disknodes ();
>  
>    ports_bucket_iterate (file_pager_bucket, shutdown_one);
>  
>    /* Sync everything on the the disk pager.  */
>    sync_global (1);
> +  journal_quiesce_checkpoints ();
>    store_sync (store);
>    /* Despite the name of this function, we never actually shutdown the disk
>       pager, just make sure it's synced. */
>  }
>  
> +static error_t
> +journal_sync_one (void *v_p)
> +{
> +  struct pager *p = v_p;
> +  pager_sync (p, 1);
> +  return 0;
> +}
> +
> +/**
> + * Sync all the pagers synchronously, but don't call
> + * journal_commit here. It would deadlock.
> + **/
> +void
> +journal_sync_everything (void)
> +{
> +  write_all_disknodes ();
> +  ports_bucket_iterate (file_pager_bucket, journal_sync_one);
> +  sync_global (1);
> +  error_t err = store_sync (store);
> +  /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */
> +  if (err && err != EOPNOTSUPP)
> +    ext2_warning ("device flush failed: %s", strerror (err));
> +}
> +
>  /* Sync all the pagers. */
>  void
>  diskfs_sync_everything (int wait)
> @@ -1591,6 +1638,9 @@ diskfs_sync_everything (int wait)
>        return 0;
>      }
>  
> +  /* We only commit if there is a journal and we have a running transaction 
> */
> +  journal_commit_running_transaction ();
> +
>    write_all_disknodes ();
>    ports_bucket_iterate (file_pager_bucket, sync_one);
>  
> @@ -1599,7 +1649,6 @@ diskfs_sync_everything (int wait)
>    if (wait)
>      {
>        error_t err = store_sync (store);
> -      /* Ignore EOPNOTSUPP (drivers), but warn on real I/O errors */
>        if (err && err != EOPNOTSUPP)
>          ext2_warning ("device flush failed: %s", strerror (err));
>      }
> diff --git a/ext2fs/truncate.c b/ext2fs/truncate.c
> index aa3a5a60..33cd9f7a 100644
> --- a/ext2fs/truncate.c
> +++ b/ext2fs/truncate.c
> @@ -19,6 +19,7 @@
>     Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
>  
>  #include "ext2fs.h"
> +#include "journal.h"
>  
>  #ifdef DONT_CACHE_MEMORY_OBJECTS
>  #define MAY_CACHE 0
> @@ -330,7 +331,7 @@ diskfs_truncate (struct node *node, off_t length)
>        diskfs_node_rdwr (node, (void *)zeroblock, length, block_size - offset,
>                       1, 0, 0);
>        /* Make sure that really happens to avoid leaks.  */
> -      diskfs_file_update (node, 1);
> +      diskfs_file_update (node, diskfs_synchronous);

No, we never want to leave information leaks.

>      }
>  
>    ext2_discard_prealloc (node);
> @@ -385,5 +386,6 @@ diskfs_truncate (struct node *node, off_t length)
>  
>    pthread_rwlock_unlock (&diskfs_node_disknode (node)->alloc_lock);
>  
> +  diskfs_node_update (node, diskfs_synchronous);

We already have done one above. It's an interesting case: we really want
a transaction in this case: either we have done the truncation, or we
have not, and then we don't have to rely on e2fsck to finish the work.

> @@ -43,24 +45,34 @@ diskfs_init_dir (struct node *dp, struct node *pdp, 
> struct protid *cred)
>    dp->dn_stat.st_nlink++;    /* for `.' */
>    dp->dn_set_ctime = 1;
>    err = diskfs_lookup (dp, ".", CREATE, &foo, ds, &lookupcred);
> +  diskfs_node_update (dp, diskfs_synchronous);
>    assert_backtrace (err == ENOENT);
>    err = diskfs_direnter (dp, ".", dp, ds, cred);
>    if (err)
>      {
>        dp->dn_stat.st_nlink--;
>        dp->dn_set_ctime = 1;
> +      diskfs_node_update (dp, diskfs_synchronous);

Again, no, we don't want to introduce new synchronization points. We
just want to sequentialize writing to the journal and actually writing
the data. And the same in the further places.

Samuel

Re: [PATCH 1/3] ext2fs: Add JBD2 on-disk layout headers

Reply via email to