On 7/21/25 02:53, Stefan Hajnoczi wrote:
> On Wed, Jul 16, 2025 at 02:38:24PM -0400, Brian Song wrote:
>> This work provides an initial implementation of fuse-over-io_uring
>> support for QEMU export. According to the fuse-over-io_uring protocol
>> specification, the userspace side must create the same number of queues
>> as the number of CPUs (nr_cpu), just like the kernel. Currently, each
>> queue contains only a single SQE entry, which is used to validate the
>> correctness of the fuse-over-io_uring functionality.
>>
>> All FUSE read and write operations interact with the kernel via io
>> vectors embedded in the SQE entry during submission and CQE fetching.
>> The req_header and op_payload members of each entry are included as
>> parts of the io vector: req_header carries the FUSE operation header,
>> and op_payload carries the data payload, such as file attributes in a
>> getattr reply, file content in a read reply, or file content being
>> written to the FUSE client in a write operation.
>>
>> At present, multi-threading support is still incomplete. In addition,
>> handling connection termination and managing the "drained" state of a
>> FUSE block export in QEMU remain as pending work.
>>
>> Suggested-by: Kevin Wolf <kw...@redhat.com>
>> Suggested-by: Stefan Hajnoczi <stefa...@redhat.com>
>> Signed-off-by: Brian Song <hibrians...@gmail.com>
>>
>> ---
>>  block/export/fuse.c                  | 423 +++++++++++++++++++++++++--
>>  docs/tools/qemu-storage-daemon.rst   |  10 +-
>>  qapi/block-export.json               |   6 +-
>>  storage-daemon/qemu-storage-daemon.c |   1 +
>>  util/fdmon-io_uring.c                |   5 +-
>>  5 files changed, 420 insertions(+), 25 deletions(-)
> 
> Here is feedback from a first pass over this patch.
> 
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index c0ad4696ce..637d36186a 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c
>> @@ -48,6 +48,11 @@
>>  #include <linux/fs.h>
>>  #endif
>>  
>> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
>> +
>> +/* room needed in buffer to accommodate header */
>> +#define FUSE_BUFFER_HEADER_SIZE 0x1000
>> +
>>  /* Prevent overly long bounce buffer allocations */
>>  #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
>>  /*
>> @@ -64,6 +69,26 @@
>>  
>>  typedef struct FuseExport FuseExport;
>>  
>> +struct FuseQueue;
> 
> Use "typedef struct FuseQueue FuseQueue;" here...
> 
>> +
>> +typedef struct FuseRingEnt {
>> +    /* back pointer */
>> +    struct FuseQueue *q;
> 
> ...and then this can be "FuseQueue *q;" so that QEMU coding style is
> followed.
> 
>> +
>> +    /* commit id of a fuse request */
>> +    uint64_t req_commit_id;
>> +
>> +    /* fuse request header and payload */
>> +    struct fuse_uring_req_header *req_header;
>> +    void *op_payload;
>> +    size_t req_payload_sz;
>> +
>> +    /* The vector passed to the kernel */
>> +    struct iovec iov[2];
>> +
>> +    CqeHandler fuse_cqe_handler;
>> +} FuseRingEnt;
>> +
>>  /*
>>   * One FUSE "queue", representing one FUSE FD from which requests are 
>> fetched
>>   * and processed.  Each queue is tied to an AioContext.
>> @@ -73,6 +98,7 @@ typedef struct FuseQueue {
>>  
>>      AioContext *ctx;
>>      int fuse_fd;
>> +    int qid;
> 
> Could this go inside #ifdef CONFIG_LINUX_IO_URING? It seems to be
> specific to FUSE-over-io_uring.
> 
>>  
>>      /*
>>       * The request buffer must be able to hold a full write, and/or at least
>> @@ -109,6 +135,17 @@ typedef struct FuseQueue {
>>       * Free this buffer with qemu_vfree().
>>       */
>>      void *spillover_buf;
>> +
>> +#ifdef CONFIG_LINUX_IO_URING
>> +    FuseRingEnt ent;
>> +
>> +    /*
>> +     * TODO
>> +     * Support multi-threaded FUSE over io_uring by using eventfd and 
>> allocating
>> +     * an extra SQE for each thread to be notified when the connection
>> +     * shuts down.
>> +     */
> 
> eventfd and the extra SQE won't be necessary because
> aio_bh_schedule_oneshot() can be used to cause threads to execute a
> function.
> 
> (I think this comment effectively says that connection shutdown still
> needs to be implemented. The implementation details don't matter at this
> point.)
> 
>> +#endif
>>  } FuseQueue;
>>  
>>  /*
>> @@ -148,6 +185,7 @@ struct FuseExport {
>>      bool growable;
>>      /* Whether allow_other was used as a mount option or not */
>>      bool allow_other;
>> +    bool is_uring;
>>  
>>      mode_t st_mode;
>>      uid_t st_uid;
>> @@ -257,6 +295,126 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
>>      .drained_poll  = fuse_export_drained_poll,
>>  };
>>  
>> +#ifdef CONFIG_LINUX_IO_URING
>> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent);
>> +
>> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque)
>> +{
>> +    CqeHandler *cqe_handler = opaque;
>> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, 
>> fuse_cqe_handler);
> 
> Passing ent in opaque instead of cqe_handler would simplify this.
> 
>> +    FuseExport *exp = ent->q->exp;
>> +
>> +    fuse_uring_co_process_request(ent);
>> +
>> +    fuse_dec_in_flight(exp);
>> +}
>> +
>> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler)
>> +{
>> +    FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, 
>> fuse_cqe_handler);
>> +    FuseQueue *q = ent->q;
>> +    Coroutine *co;
>> +    FuseExport *exp = ent->q->exp;
>> +
>> +    int err = cqe_handler->cqe.res;
>> +    if (err != 0) {
>> +        /* TODO end_conn support */
>> +
>> +        /* -ENOTCONN is ok on umount  */
>> +        if (err != -EINTR && err != -EOPNOTSUPP &&
>> +            err != -EAGAIN && err != -ENOTCONN) {
>> +            fuse_export_halt(exp);
>> +        }
>> +    } else {
>> +        co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes,
>> +                            cqe_handler);
>> +        /* Decremented by co_fuse_uring_queue_handle_cqes() */
>> +        fuse_inc_in_flight(q->exp);
> 
> Can this be moved inside co_fuse_uring_queue_handle_cqes() to avoid
> calling inc/dec from different functions? That would make the code
> easier to understand and more robust against future bugs.
> 
>> +        qemu_coroutine_enter(co);
>> +    }
>> +}
>> +
>> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req,
>> +                    const unsigned int qid,
>> +                    const unsigned int commit_id)
>> +{
>> +    req->qid = qid;
>> +    req->commit_id = commit_id;
>> +    req->flags = 0;
>> +}
>> +
>> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseRingEnt 
>> *ent,
>> +               __u32 cmd_op)
>> +{
>> +    sqe->opcode = IORING_OP_URING_CMD;
>> +
>> +    sqe->fd = ent->q->fuse_fd;
>> +    sqe->rw_flags = 0;
>> +    sqe->ioprio = 0;
>> +    sqe->off = 0;
>> +
>> +    sqe->cmd_op = cmd_op;
>> +    sqe->__pad1 = 0;
>> +}
>> +
>> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void 
>> *opaque)
>> +{
>> +    FuseQueue *q = opaque;
>> +    struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0];
>> +
>> +    fuse_uring_sqe_prepare(sqe, &q->ent, FUSE_IO_URING_CMD_REGISTER);
>> +
>> +    sqe->addr = (uint64_t)(q->ent.iov);
>> +    sqe->len = 2;
>> +
>> +    fuse_uring_sqe_set_req_data(req, q->qid, 0);
>> +}
>> +
>> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out)
>> +{
>> +    /*
>> +     * Since we didn't enable the FUSE_MAX_PAGES feature, the value of
>> +     * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is set 
>> by
>> +     * the kernel by default. Also, max_write should not exceed
>> +     * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE.
>> +     */
>> +    size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE;
>> +
>> +    if (!(out->flags & FUSE_MAX_PAGES)) {
>> +        /*
>> +         * bufsize = MIN(FUSE_DEFAULT_MAX_PAGES_PER_REQ *
>> +         *       qemu_real_host_page_size() + FUSE_BUFFER_HEADER_SIZE, 
>> bufsize);
>> +         */
>> +        bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * 
>> qemu_real_host_page_size()
>> +                         + FUSE_BUFFER_HEADER_SIZE;
>> +    }
>> +
>> +    for (int i = 0; i < exp->num_queues; i++) {
>> +        FuseQueue *q = &exp->queues[i];
>> +
>> +        q->ent.q = q;
>> +
>> +        q->ent.req_header = g_malloc0(sizeof(struct fuse_uring_req_header));
> 
> It's probably easier to embed the header as a FuseRingEnt field instead
> of heap allocating it.

Hmm well. So we have two additional patch in the DDN branch for which I 
didn't have time to upstream them yet. These patches allow to pin these
buffers/pages and with that the application doing IO can directly write
into the buffer - saves context swithes. The initial RFC kernel patches
were using mmaped buffers and when I had to switch to userspace buffers,
performance went badly down. I didn't run real benchmarks, but just
xfstests  - with mmapped buffers it was running like 3 times faster than
legacy fuse and that advantage got lost with normal buffers that
get mapped per request. Switching to pinned buffers brought back the
fast xfstest runs.
Issue is that the buffer needs to be page aligned - which is why libfuse
takes an extra allocation here.
In libfuse I should probably make this optional, as pinned buffers
will mostly only work for root (needs locked memory).

In principle I would need to document these details somewhere, I should
probably create blog or so.


Cheers,
Bernd






Reply via email to