On Tue, Jul 22, 2025 at 02:06:04PM +0000, Bernd Schubert wrote:
> On 7/21/25 02:53, Stefan Hajnoczi wrote:
> > On Wed, Jul 16, 2025 at 02:38:24PM -0400, Brian Song wrote:
> >> This work provides an initial implementation of fuse-over-io_uring
> >> support for QEMU export. According to the fuse-over-io_uring protocol
> >> specification, the userspace side must create the same number of queues
> >> as the number of CPUs (nr_cpu), just like the kernel. Currently, each
> >> queue contains only a single SQE entry, which is used to validate the
> >> correctness of the fuse-over-io_uring functionality.
> >>
> >> All FUSE read and write operations interact with the kernel via io
> >> vectors embedded in the SQE entry during submission and CQE fetching.
> >> The req_header and op_payload members of each entry are included as
> >> parts of the io vector: req_header carries the FUSE operation header,
> >> and op_payload carries the data payload, such as file attributes in a
> >> getattr reply, file content in a read reply, or file content being
> >> written to the FUSE client in a write operation.
> >>
> >> At present, multi-threading support is still incomplete. In addition,
> >> handling connection termination and managing the "drained" state of a
> >> FUSE block export in QEMU remain as pending work.
> >>
> >> Suggested-by: Kevin Wolf <kw...@redhat.com>
> >> Suggested-by: Stefan Hajnoczi <stefa...@redhat.com>
> >> Signed-off-by: Brian Song <hibrians...@gmail.com>
> >>
> >> ---
> >>  block/export/fuse.c                  | 423 +++++++++++++++++++++++++--
> >>  docs/tools/qemu-storage-daemon.rst   |  10 +-
> >>  qapi/block-export.json               |   6 +-
> >>  storage-daemon/qemu-storage-daemon.c |   1 +
> >>  util/fdmon-io_uring.c                |   5 +-
> >>  5 files changed, 420 insertions(+), 25 deletions(-)
> > 
> > Here is feedback from a first pass over this patch.
> > 
> >>
> >> diff --git a/block/export/fuse.c b/block/export/fuse.c
> >> index c0ad4696ce..637d36186a 100644
> >> --- a/block/export/fuse.c
> >> +++ b/block/export/fuse.c
> >> @@ -48,6 +48,11 @@
> >>  #include <linux/fs.h>
> >>  #endif
> >>  
> >> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
> >> +
> >> +/* room needed in buffer to accommodate header */
> >> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
> >> +
> >>  /* Prevent overly long bounce buffer allocations */
> >>  #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
> >>  /*
> >> @@ -64,6 +69,26 @@
> >>  
> >>  typedef struct FuseExport FuseExport;
> >>  
> >> +struct FuseQueue;
> > 
> > Use "typedef struct FuseQueue FuseQueue;" here...
> > 
> >> +
> >> +typedef struct FuseRingEnt {
> >> +    /* back pointer */
> >> +    struct FuseQueue *q;
> > 
> > ...and then this can be "FuseQueue *q;" so that QEMU coding style is
> > followed.
> > 
> >> +
> >> +    /* commit id of a fuse request */
> >> +    uint64_t req_commit_id;
> >> +
> >> +    /* fuse request header and payload */
> >> +    struct fuse_uring_req_header *req_header;
> >> +    void *op_payload;
> >> +    size_t req_payload_sz;
> >> +
> >> +    /* The vector passed to the kernel */
> >> +    struct iovec iov[2];
> >> +
> >> +    CqeHandler fuse_cqe_handler;
> >> +} FuseRingEnt;
> >> +
> >>  /*
> >>   * One FUSE "queue", representing one FUSE FD from which requests are 
> >> fetched
> >>   * and processed.  Each queue is tied to an AioContext.
> >> @@ -73,6 +98,7 @@ typedef struct FuseQueue {
> >>  
> >>      AioContext *ctx;
> >>      int fuse_fd;
> >> +    int qid;
> > 
> > Could this go inside #ifdef CONFIG_LINUX_IO_URING? It seems to be
> > specific to FUSE-over-io_uring.
> > 
> >>  
> >>      /*
> >>       * The request buffer must be able to hold a full write, and/or at 
> >> least
> >> @@ -109,6 +135,17 @@ typedef struct FuseQueue {
> >>       * Free this buffer with qemu_vfree().
> >>       */
> >>      void *spillover_buf;
> >> +
> >> +#ifdef CONFIG_LINUX_IO_URING
> >> +    FuseRingEnt ent;
> >> +
> >> +    /*
> >> +     * TODO
> >> +     * Support multi-threaded FUSE over io_uring by using eventfd and 
> >> allocating
> >> +     * an extra SQE for each thread to be notified when the connection
> >> +     * shuts down.
> >> +     */
> > 
> > eventfd and the extra SQE won't be necessary because
> > aio_bh_schedule_oneshot() can be used to cause threads to execute a
> > function.
> > 
> > (I think this comment effectively says that connection shutdown still
> > needs to be implemented. The implementation details don't matter at this
> > point.)
> > 
> >> +#endif
> >>  } FuseQueue;
> >>  
> >>  /*
> >> @@ -148,6 +185,7 @@ struct FuseExport {
> >>      bool growable;
> >>      /* Whether allow_other was used as a mount option or not */
> >>      bool allow_other;
> >> +    bool is_uring;
> >>  
> >>      mode_t st_mode;
> >>      uid_t st_uid;
> >> @@ -257,6 +295,126 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
> >>      .drained_poll  = fuse_export_drained_poll,
> >>  };
> >>  
> >> +#ifdef CONFIG_LINUX_IO_URING
> >> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
> >> +
> >> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
> >> +{
> >> +    CqeHandler *cqe_handler = opaque;
> >> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, 
> >> fuse_cqe_handler);
> > 
> > Passing ent in opaque instead of cqe_handler would simplify this.
> > 
> >> +    FuseExport *exp = ent->q->exp;
> >> +
> >> +    fuse_uring_co_process_request(ent);
> >> +
> >> +    fuse_dec_in_flight(exp);
> >> +}
> >> +
> >> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
> >> +{
> >> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, 
> >> fuse_cqe_handler);
> >> +    FuseQueue *q = ent->q;
> >> +    Coroutine *co;
> >> +    FuseExport *exp = ent->q->exp;
> >> +
> >> +    int err = cqe_handler->cqe.res;
> >> +    if (err != 0) {
> >> +        /* TODO end_conn support */
> >> +
> >> +        /* -ENOTCONN is ok on umount  */
> >> +        if (err != -EINTR && err != -EOPNOTSUPP &&
> >> +            err != -EAGAIN && err != -ENOTCONN) {
> >> +            fuse_export_halt(exp);
> >> +        }
> >> +    } else {
> >> +        co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes,
> >> +                            cqe_handler);
> >> +        /* Decremented by co_fuse_uring_queue_handle_cqes() */
> >> +        fuse_inc_in_flight(q->exp);
> > 
> > Can this be moved inside co_fuse_uring_queue_handle_cqes() to avoid
> > calling inc/dec from different functions? That would make the code
> > easier to understand and more robust against future bugs.
> > 
> >> +        qemu_coroutine_enter(co);
> >> +    }
> >> +}
> >> +
> >> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
> >> +                    const unsigned int qid,
> >> +                    const unsigned int commit_id)
> >> +{
> >> +    req->qid = qid;
> >> +    req->commit_id = commit_id;
> >> +    req->flags = 0;
> >> +}
> >> +
> >> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseRingEnt 
> >> *ent,
> >> +               __u32 cmd_op)
> >> +{
> >> +    sqe->opcode = IORING_OP_URING_CMD;
> >> +
> >> +    sqe->fd = ent->q->fuse_fd;
> >> +    sqe->rw_flags = 0;
> >> +    sqe->ioprio = 0;
> >> +    sqe->off = 0;
> >> +
> >> +    sqe->cmd_op = cmd_op;
> >> +    sqe->__pad1 = 0;
> >> +}
> >> +
> >> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void 
> >> *opaque)
> >> +{
> >> +    FuseQueue *q = opaque;
> >> +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
> >> +
> >> +    fuse_uring_sqe_prepare(sqe, &q->ent, FUSE_IO_URING_CMD_REGISTER);
> >> +
> >> +    sqe->addr = (uint64_t)(q->ent.iov);
> >> +    sqe->len = 2;
> >> +
> >> +    fuse_uring_sqe_set_req_data(req, q->qid, 0);
> >> +}
> >> +
> >> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
> >> +{
> >> +    /*
> >> +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
> >> +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is 
> >> set by
> >> +     * the kernel by default. Also, max_write should not exceed
> >> +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
> >> +     */
> >> +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
> >> +
> >> +    if (!(out->flags & FUSE_MAX_PAGES)) {
> >> +        /*
> >> +         * bufsize = MIN(FUSE_DEFAULT_MAX_PAGES_PER_REQ *
> >> +         *       qemu_real_host_page_size() + FUSE_BUFFER_HEADER_SIZE, 
> >> bufsize);
> >> +         */
> >> +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * 
> >> qemu_real_host_page_size()
> >> +                         + FUSE_BUFFER_HEADER_SIZE;
> >> +    }
> >> +
> >> +    for (int i = 0; i < exp->num_queues; i++) {
> >> +        FuseQueue *q = &exp->queues[i];
> >> +
> >> +        q->ent.q = q;
> >> +
> >> +        q->ent.req_header = g_malloc0(sizeof(struct 
> >> fuse_uring_req_header));
> > 
> > It's probably easier to embed the header as a FuseRingEnt field instead
> > of heap allocating it.
> 
> Hmm well. So we have two additional patch in the DDN branch for which I 
> didn't have time to upstream them yet. These patches allow to pin these
> buffers/pages and with that the application doing IO can directly write
> into the buffer - saves context swithes. The initial RFC kernel patches
> were using mmaped buffers and when I had to switch to userspace buffers,
> performance went badly down. I didn't run real benchmarks, but just
> xfstests  - with mmapped buffers it was running like 3 times faster than
> legacy fuse and that advantage got lost with normal buffers that
> get mapped per request. Switching to pinned buffers brought back the
> fast xfstest runs.
> Issue is that the buffer needs to be page aligned - which is why libfuse
> takes an extra allocation here.
> In libfuse I should probably make this optional, as pinned buffers
> will mostly only work for root (needs locked memory).
> 
> In principle I would need to document these details somewhere, I should
> probably create blog or so.

That sounds like something we'd like to try out, although we try to
minimize privileges needed for qemu-storage-daemon.

A question about pinning and resource limits: qemu-storage-daemon uses
O_DIRECT for I/O as non-root without hitting mlock resource limits. Is
there a difference between pin_user_pages(FOLL_PIN) (for direct I/O) and
pin_user_pages(FOLL_PIN | FOLL_LONGTERM) (possibly for
FUSE-over-io_uring pinning) in terms of resource limit behavior?

Thanks,
Stefan

Attachment: signature.asc
Description: PGP signature

Reply via email to