Hi,
On 12/26/24 18:33, Julien Plissonneau Duquène wrote:
This should not make any difference in the number of write operations
necessary, and only affect ordering. The data, metadata journal and
metadata update still have to be written.
I would expect that some reordering makes it possible for fewer actual
physical write operations to happen, i.e. writes to same/neighbouring
blocks get merged/grouped (eventually by the hardware if not the kernel)
which would make a difference on both spinning devices performance (less
seeks) and solid state devices longevity (as these have larger physical
blocks), but I don't know if that's actually how it works in that case.
On SSDs, it does not matter, both because modern media lasts longer than
the rest of the computer now, and because the load balancer will largely
ignore the logical block addresses when deciding where to put data into
the physical medium anyway.
On harddisks, it absolutely makes a noticeable difference, but so does
journaling.
It would be surprising though that the dpkg man pages (among other
places) talks about performance degradations if these were not real.
ext4's delayed allocations mainly mean that the window where the inode
is zero sized is larger (can be a few seconds after dpkg exits with
--force-unsafe-io), so the problem is more observable, while on other
file systems, you more often get lucky and your files are filled with
the desired data instead of garbage.
The delayed allocations, on the other hand, allow the file system to
merge the entire allocation for the file, instead of gradually extending
it (but that can be easily fixed by using fallocate(2) ).
[filesystem level transactions]
That sounds interesting. But — do we have filesystems on Linux that can
do that already, or is this still a wishlist item? Also worth noting, at
least one well-known implementation in another OS was deprecated [1]
citing complexity and lack of popularity as the reasons for that
decision, and the feature is missing in their next-gen FS. So maybe it's
not that great after all?
It is complex to the extent that it requires the entire file system to
be designed around it, including the file system API -- suddenly you get
things like isolation levels and transaction conflicts that programs
need to be at least vaguely aware of.
It would be easier to do in Linux than in Windows, certainly, because on
Windows, file contents bypass the file system drivers entirely, and
there are additional APIs like transfer offload that would interact
badly with a transactional interface, and that would be sorely missed by
people using a SAN as storage backend.
Anyway in the current toolbox besides --force-unsafe-io we also have:
- volume or FS snapshots, for similar or better safety but not the
automatic performance gains; probably not (yet?) available on most systems
Snapshots only work if there is a way to merge them back afterwards.
What the systemd people are doing with immutable images basically goes
in the direction of snapshots -- you'd unpack the files using "unsafe"
I/O, then finally create an image, fsync() that, and then update the OS
metadata which image to load at boot.
- the auto_da_alloc ext4 mount option that AIUI should do The Right
Thing in dpkg's use case even without the fsync, actual reliability and
performance impact unknown; appears to be set by default on trixie
Yes, that inserts the missing fsync(). :>
I'd expect it to perform a little bit better than the explicit fsync()
though, because that does not impose an order of operation between
files. The downside is that it also does not force an order between the
file system updates and the rewrite of the dpkg status file.
What I could see working in dpkg would be delaying the fsync() call
until right before the rename(), which is in a separate "cleanup" round
of operations anyway for the cases that matter. The difficulty there is
that we'd have to keep the file descriptor open until then, which would
need careful management or a horrible hack so we don't run into the user
or system-wide limit for open file descriptors, and recover if we do.
- eatmydata
That just neuters fsync().
- io_uring that allows asynchronous file operations; implementation
would require important changes in dpkg; potential performance gains in
dpkg's use case are not yet evaluated AFAIK but it looks like the right
solution for that use case.
That would be Linux specific, though.
Nowadays, most machines are unlikely to be subject to power failures at
the worst time:
Yes, but we have more people running nVidia's kernel drivers now, so it
all evens out.
The decision when it is safe to skip fsync() is mostly dependent on
factors that are not visible to the dpkg process, like "will the result
of this operation be packed together into an image afterwards?", so I
doubt there is a good heuristic.
My feeling is that this is becoming less and less relevant though,
because it does not matter with SSDs.
Simon