On 01.04.25 22:24, Eric Blake wrote:
On Tue, Mar 25, 2025 at 05:06:55PM +0100, Hanna Czenczek wrote:
We probably want to support larger write sizes than just 4k; 64k seems
nice.  However, we cannot read partial requests from the FUSE FD, we
always have to read requests in full; so our read buffer must be large
enough to accommodate potential 64k writes if we want to support that.

Always allocating FuseRequest objects with 64k buffers in them seems
wasteful, though.  But we can get around the issue by splitting the
buffer into two and using readv(): One part will hold all normal (up to
4k) write requests and all other requests, and a second part (the
"spill-over buffer") will be used only for larger write requests.  Each
FuseQueue has its own spill-over buffer, and only if we find it used
when reading a request will we move its ownership into the FuseRequest
object and allocate a new spill-over buffer for the queue.

This way, we get to support "large" write sizes without having to
allocate big buffers when they aren't used.

Also, this even reduces the size of the FuseRequest objects because the
read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
the requests we support are not quite so large (except for >4k writes),
so until now, we basically had to have useless padding in there.

With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
is easily met and we can decrease the size of the buffer portion that is
right inside of FuseRequest.

As for benchmarks, the benefit of this patch can be shown easily by
writing a 4G image (with qemu-img convert) to a FUSE export:
- Before this patch: Takes 25.6 s (14.4 s with -t none)
- After this patch: Takes 4.5 s (5.5 s with -t none)
Hmm - 64k is still small for some tasks; with qcow2, you can have
clusters up to 2M, and writing a cluster at a time seems to be a
reasonable desire.

What do you mean specifically?  Do you think it’s reasonable from a performance point of view, or for potential atomicity, or...?

Or in other storage areas, nbdcopy (from libnbd)
defaults to 256k and allows up to 32M, or we can even look at lvm that
defaults to extent size of 4M (although with lvm thin volumes,
behavior is more like qcow2 with subclusters in that you don't have to
allocate the entire extent at once).

The problem I see is that other back-ends can probably read the request header, figure out how much memory they need, and allocate the buffer.  That doesn’t work with FUSE, you got to read the whole request at once.

Maybe it makes sense to have this as a hierarchichal spillover (first
level spills over to 64k, second level spills over to 2M or whatever
ACTUAL maximum you are willing to support)?

Sounds quite complicated, so the question is what we’d want this for, whether it’d be worth it.

I also wonder whether rate-limiting should come into play at some point.  FUSE doesn’t require network, so a user could submit the same 2M buffer many times, and thus, without itself having to allocate a lot of memory, cause big memory allocations in qemu(-storage-daemon)...


Or you can feel free to ignore my comments on this patch - allowing
64k instead of 4k is ALREADY a nice change.

@@ -501,7 +547,20 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
          goto no_request;
      }
- ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
+    /*
+     * If handling the last request consumed the spill-over buffer, allocate a
+     * new one.  Align it to the block device's alignment, which admittedly is
+     * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.
What are typical block device minimum alignments?  If 4k is typical,
then it should be easy to make FUSE_IN_PLACE_WRITE_BYTES be 4k.

Yes, but no.  We’d have to allocate request_buf such that for WRITE requests, the data in it would be aligned, i.e. after the header. We cannot split it (one part for the FUSE headers, one part for the data), because that might break up non-WRITE requests, which would make them hard to handle.

Locally, I have a version of this series that does introduce a function to allow allocating buffers such that an offset within them is aligned; but in the end I didn’t see much of a performance difference, so I decided to send this simpler version instead.

Hanna

If
64k is typical, we're already doomed at being aligned.  Is being
unaligned going to cause the block layer to add bounce buffers on us,
if the block layer has a larger alignment than what we use here?

@@ -915,17 +983,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t 
offset, uint32_t size)
  }
/**
- * Handle client writes to the exported image.  @buf has the data to be written
- * and will be copied to a bounce buffer before yielding for the first time.
+ * Handle client writes to the exported image.  @in_place_buf has the first
+ * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
+ * contains the rest (if any; NULL otherwise).
+ * Data in @in_place_buf is assumed to be overwritten after yielding, so will
+ * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
+ * assumed to be exclusively owned and will be used as-is.
Makes sense, although it leads to some interesting bookkeeping.  (I'm
wondering if nbd would benefit from this split-level buffering, since
it supports reads and writes up to 32M)



Reply via email to