On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote: > O_DIRECT alone to a pre-allocated file on a normal file system should > result in the data being visible without any additional metadata > transactions.
Anthony, for the third time: no. O_DIRECT is a non-portable extension in Linux (taken from IRIX) and is defined as: O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The O_DIRECT flag on its own makes at an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC that data and necessary metadata are transferred. To guarantee synchronous I/O the O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion. A semantically similar (but deprecated) interface for block devices is described in raw(8). O_DIRECT does not have any meaning for data integrity, it just tells the filesystem it *should* not use the pagecache. Even if it should not various filesystem have fallbacks to buffered I/O for corner cases. It does *not* mean the actual disk cache gets flushed, and it *does* not guarantee anything about metadata which is very important. Metadata updates happen when filling sparse file, when extening the file size, when using a COW filesystem, and when converting preallocated to fully allocated extents in practice and could happen in many more cases depending on the filesystem implementation. > >Barriers are a Linux-specific implementation details that is in the > >process of going away, probably in Linux 2.6.37. But if you want > >O_DSYNC semantics with a volatile disk write cache there is no way > >around using a cache flush or the FUA bit on all I/O caused by it. > > If you have a volatile disk write cache, then we don't need O_DSYNC > semantics. If you present a volatile write cache to the guest you do indeed not need O_DSYNC and can rely on the guest sending fdatasync calls when it wants to flush the cache. But for the statement above you can replace O_DSYC with fdatasync and it will still be correct. O_DSYNC in current Linux kernels is nothing but an implicit range fdatasync after each write. > > We > >currently use the cache flush, and although I plan to experiment a bit > >more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very > >surprised if they actually are any faster. > > > > The thing I struggle with understanding is that if the guest is sending > us a write request, why are we sending the underlying disk a write + > flush request? That doesn't seem logical at all to me. We only send a cache flush request *iff* we present the guest a device without a volatile write cache so that it can assume all writes are stable and we sit on a device that does have a volatile write cache. > Even if we advertise WC disable, it should be up to the guest to decide > when to issue flushes. No. If we don't claim to have a volatile cache no guest will ever flush the cache. Which is just logially given that we just told it that we don't have a cache that needs flushing. > >ext3 and ext4 have really bad fsync implementations. Just use a better > >filesystem or bug one of it's developers if you want that fixed. But > >except for disabling the disk cache there is no way to get data integrity > >without cache flushes (the FUA bit is nothing but an implicit flush). > > > > But why are we issuing more flushes than the guest is issuing if we > don't have to worry about filesystem metadata (i.e. preallocated storage > or physical devices)? Who is "we" and what is workload/filesystem/kernel combination? Specific details and numbers please.