[Qemu-devel] Re: Caching modes

Christoph Hellwig Tue, 21 Sep 2010 07:27:52 -0700

On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:
> O_DIRECT alone to a pre-allocated file on a normal file system should 
> result in the data being visible without any additional metadata 
> transactions.


Anthony, for the third time: no.  O_DIRECT is a non-portable extension
in Linux (taken from IRIX) and is defined as:


       O_DIRECT (Since Linux 2.4.10)
              Try  to minimize cache effects of the I/O to and from this file.
              In general this will degrade performance, but it  is  useful  in
              special  situations,  such  as  when  applications  do their own
              caching.  File I/O is done directly to/from user space  buffers.
              The O_DIRECT flag on its own makes at an effort to transfer data
              synchronously, but does not give the guarantees  of  the  O_SYNC
              that  data and necessary metadata are transferred.  To guarantee
              synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
              See NOTES below for further discussion.

              A  semantically  similar  (but  deprecated)  interface for block
              devices is described in raw(8).

O_DIRECT does not have any meaning for data integrity, it just tells the
filesystem it *should* not use the pagecache.  Even if it should not
various filesystem have fallbacks to buffered I/O for corner cases.
It does *not* mean the actual disk cache gets flushed, and it *does*
not guarantee anything about metadata which is very important.

Metadata updates happen when filling sparse file, when extening the file
size, when using a COW filesystem, and when converting preallocated to
fully allocated extents in practice and could happen in many more cases
depending on the filesystem implementation.

> >Barriers are a Linux-specific implementation details that is in the
> >process of going away, probably in Linux 2.6.37.  But if you want
> >O_DSYNC semantics with a volatile disk write cache there is no way
> >around using a cache flush or the FUA bit on all I/O caused by it.
> 
> If you have a volatile disk write cache, then we don't need O_DSYNC 
> semantics.

If you present a volatile write cache to the guest you do indeed not
need O_DSYNC and can rely on the guest sending fdatasync calls when it
wants to flush the cache.  But for the statement above you can replace
O_DSYC with fdatasync and it will still be correct.  O_DSYNC in current
Linux kernels is nothing but an implicit range fdatasync after each
write.

> >   We
> >currently use the cache flush, and although I plan to experiment a bit
> >more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
> >surprised if they actually are any faster.
> >   
> 
> The thing I struggle with understanding is that if the guest is sending 
> us a write request, why are we sending the underlying disk a write + 
> flush request?  That doesn't seem logical at all to me.

We only send a cache flush request *iff* we present the guest a device
without a volatile write cache so that it can assume all writes are
stable and we sit on a device that does have a volatile write cache.

> Even if we advertise WC disable, it should be up to the guest to decide 
> when to issue flushes.

No.  If we don't claim to have a volatile cache no guest will ever flush
the cache.  Which is just logially given that we just told it that we
don't have a cache that needs flushing.

> >ext3 and ext4 have really bad fsync implementations.  Just use a better
> >filesystem or bug one of it's developers if you want that fixed.  But
> >except for disabling the disk cache there is no way to get data integrity
> >without cache flushes (the FUA bit is nothing but an implicit flush).
> >   
> 
> But why are we issuing more flushes than the guest is issuing if we 
> don't have to worry about filesystem metadata (i.e. preallocated storage 
> or physical devices)?

Who is "we" and what is workload/filesystem/kernel combination?
Specific details and numbers please.

[Qemu-devel] Re: Caching modes

Reply via email to