Hi,

On 12/26/24 18:33, Julien Plissonneau Duquène wrote:

This should not make any difference in the number of write operations necessary, and only affect ordering. The data, metadata journal and metadata update still have to be written.

I would expect that some reordering makes it possible for fewer actual physical write operations to happen, i.e. writes to same/neighbouring blocks get merged/grouped (eventually by the hardware if not the kernel) which would make a difference on both spinning devices performance (less seeks) and solid state devices longevity (as these have larger physical blocks), but I don't know if that's actually how it works in that case.

On SSDs, it does not matter, both because modern media lasts longer than the rest of the computer now, and because the load balancer will largely ignore the logical block addresses when deciding where to put data into the physical medium anyway.

On harddisks, it absolutely makes a noticeable difference, but so does journaling.

It would be surprising though that the dpkg man pages (among other places) talks about performance degradations if these were not real.

ext4's delayed allocations mainly mean that the window where the inode is zero sized is larger (can be a few seconds after dpkg exits with --force-unsafe-io), so the problem is more observable, while on other file systems, you more often get lucky and your files are filled with the desired data instead of garbage.

The delayed allocations, on the other hand, allow the file system to merge the entire allocation for the file, instead of gradually extending it (but that can be easily fixed by using fallocate(2) ).

[filesystem level transactions]

That sounds interesting. But — do we have filesystems on Linux that can do that already, or is this still a wishlist item? Also worth noting, at least one well-known implementation in another OS was deprecated [1] citing complexity and lack of popularity as the reasons for that decision, and the feature is missing in their next-gen FS. So maybe it's not that great after all?

It is complex to the extent that it requires the entire file system to be designed around it, including the file system API -- suddenly you get things like isolation levels and transaction conflicts that programs need to be at least vaguely aware of.

It would be easier to do in Linux than in Windows, certainly, because on Windows, file contents bypass the file system drivers entirely, and there are additional APIs like transfer offload that would interact badly with a transactional interface, and that would be sorely missed by people using a SAN as storage backend.

Anyway in the current toolbox besides --force-unsafe-io we also have:
- volume or FS snapshots, for similar or better safety but not the automatic performance gains; probably not (yet?) available on most systems

Snapshots only work if there is a way to merge them back afterwards.

What the systemd people are doing with immutable images basically goes in the direction of snapshots -- you'd unpack the files using "unsafe" I/O, then finally create an image, fsync() that, and then update the OS metadata which image to load at boot.

- the auto_da_alloc ext4 mount option that AIUI should do The Right Thing in dpkg's use case even without the fsync, actual reliability and performance impact unknown; appears to be set by default on trixie

Yes, that inserts the missing fsync(). :>

I'd expect it to perform a little bit better than the explicit fsync() though, because that does not impose an order of operation between files. The downside is that it also does not force an order between the file system updates and the rewrite of the dpkg status file.

What I could see working in dpkg would be delaying the fsync() call until right before the rename(), which is in a separate "cleanup" round of operations anyway for the cases that matter. The difficulty there is that we'd have to keep the file descriptor open until then, which would need careful management or a horrible hack so we don't run into the user or system-wide limit for open file descriptors, and recover if we do.

- eatmydata

That just neuters fsync().

- io_uring that allows asynchronous file operations; implementation would require important changes in dpkg; potential performance gains in dpkg's use case are not yet evaluated AFAIK but it looks like the right solution for that use case.

That would be Linux specific, though.

Nowadays, most machines are unlikely to be subject to power failures at the worst time:

Yes, but we have more people running nVidia's kernel drivers now, so it all evens out.

The decision when it is safe to skip fsync() is mostly dependent on factors that are not visible to the dpkg process, like "will the result of this operation be packed together into an image afterwards?", so I doubt there is a good heuristic.

My feeling is that this is becoming less and less relevant though, because it does not matter with SSDs.

   Simon

Reply via email to