On Tue, Jul 22, 2025 at 02:06:04PM +0000, Bernd Schubert wrote: > On 7/21/25 02:53, Stefan Hajnoczi wrote: > > On Wed, Jul 16, 2025 at 02:38:24PM -0400, Brian Song wrote: > >> This work provides an initial implementation of fuse-over-io_uring > >> support for QEMU export. According to the fuse-over-io_uring protocol > >> specification, the userspace side must create the same number of queues > >> as the number of CPUs (nr_cpu), just like the kernel. Currently, each > >> queue contains only a single SQE entry, which is used to validate the > >> correctness of the fuse-over-io_uring functionality. > >> > >> All FUSE read and write operations interact with the kernel via io > >> vectors embedded in the SQE entry during submission and CQE fetching. > >> The req_header and op_payload members of each entry are included as > >> parts of the io vector: req_header carries the FUSE operation header, > >> and op_payload carries the data payload, such as file attributes in a > >> getattr reply, file content in a read reply, or file content being > >> written to the FUSE client in a write operation. > >> > >> At present, multi-threading support is still incomplete. In addition, > >> handling connection termination and managing the "drained" state of a > >> FUSE block export in QEMU remain as pending work. > >> > >> Suggested-by: Kevin Wolf <kw...@redhat.com> > >> Suggested-by: Stefan Hajnoczi <stefa...@redhat.com> > >> Signed-off-by: Brian Song <hibrians...@gmail.com> > >> > >> --- > >> block/export/fuse.c | 423 +++++++++++++++++++++++++-- > >> docs/tools/qemu-storage-daemon.rst | 10 +- > >> qapi/block-export.json | 6 +- > >> storage-daemon/qemu-storage-daemon.c | 1 + > >> util/fdmon-io_uring.c | 5 +- > >> 5 files changed, 420 insertions(+), 25 deletions(-) > > > > Here is feedback from a first pass over this patch. > > > >> > >> diff --git a/block/export/fuse.c b/block/export/fuse.c > >> index c0ad4696ce..637d36186a 100644 > >> --- a/block/export/fuse.c > >> +++ b/block/export/fuse.c > >> @@ -48,6 +48,11 @@ > >> #include <linux/fs.h> > >> #endif > >> > >> +#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32 > >> + > >> +/* room needed in buffer to accommodate header */ > >> +#define FUSE_BUFFER_HEADER_SIZE 0x1000 > >> + > >> /* Prevent overly long bounce buffer allocations */ > >> #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024)) > >> /* > >> @@ -64,6 +69,26 @@ > >> > >> typedef struct FuseExport FuseExport; > >> > >> +struct FuseQueue; > > > > Use "typedef struct FuseQueue FuseQueue;" here... > > > >> + > >> +typedef struct FuseRingEnt { > >> + /* back pointer */ > >> + struct FuseQueue *q; > > > > ...and then this can be "FuseQueue *q;" so that QEMU coding style is > > followed. > > > >> + > >> + /* commit id of a fuse request */ > >> + uint64_t req_commit_id; > >> + > >> + /* fuse request header and payload */ > >> + struct fuse_uring_req_header *req_header; > >> + void *op_payload; > >> + size_t req_payload_sz; > >> + > >> + /* The vector passed to the kernel */ > >> + struct iovec iov[2]; > >> + > >> + CqeHandler fuse_cqe_handler; > >> +} FuseRingEnt; > >> + > >> /* > >> * One FUSE "queue", representing one FUSE FD from which requests are > >> fetched > >> * and processed. Each queue is tied to an AioContext. > >> @@ -73,6 +98,7 @@ typedef struct FuseQueue { > >> > >> AioContext *ctx; > >> int fuse_fd; > >> + int qid; > > > > Could this go inside #ifdef CONFIG_LINUX_IO_URING? It seems to be > > specific to FUSE-over-io_uring. > > > >> > >> /* > >> * The request buffer must be able to hold a full write, and/or at > >> least > >> @@ -109,6 +135,17 @@ typedef struct FuseQueue { > >> * Free this buffer with qemu_vfree(). > >> */ > >> void *spillover_buf; > >> + > >> +#ifdef CONFIG_LINUX_IO_URING > >> + FuseRingEnt ent; > >> + > >> + /* > >> + * TODO > >> + * Support multi-threaded FUSE over io_uring by using eventfd and > >> allocating > >> + * an extra SQE for each thread to be notified when the connection > >> + * shuts down. > >> + */ > > > > eventfd and the extra SQE won't be necessary because > > aio_bh_schedule_oneshot() can be used to cause threads to execute a > > function. > > > > (I think this comment effectively says that connection shutdown still > > needs to be implemented. The implementation details don't matter at this > > point.) > > > >> +#endif > >> } FuseQueue; > >> > >> /* > >> @@ -148,6 +185,7 @@ struct FuseExport { > >> bool growable; > >> /* Whether allow_other was used as a mount option or not */ > >> bool allow_other; > >> + bool is_uring; > >> > >> mode_t st_mode; > >> uid_t st_uid; > >> @@ -257,6 +295,126 @@ static const BlockDevOps fuse_export_blk_dev_ops = { > >> .drained_poll = fuse_export_drained_poll, > >> }; > >> > >> +#ifdef CONFIG_LINUX_IO_URING > >> +static void coroutine_fn fuse_uring_co_process_request(FuseRingEnt *ent); > >> + > >> +static void coroutine_fn co_fuse_uring_queue_handle_cqes(void *opaque) > >> +{ > >> + CqeHandler *cqe_handler = opaque; > >> + FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, > >> fuse_cqe_handler); > > > > Passing ent in opaque instead of cqe_handler would simplify this. > > > >> + FuseExport *exp = ent->q->exp; > >> + > >> + fuse_uring_co_process_request(ent); > >> + > >> + fuse_dec_in_flight(exp); > >> +} > >> + > >> +static void fuse_uring_cqe_handler(CqeHandler *cqe_handler) > >> +{ > >> + FuseRingEnt *ent = container_of(cqe_handler, FuseRingEnt, > >> fuse_cqe_handler); > >> + FuseQueue *q = ent->q; > >> + Coroutine *co; > >> + FuseExport *exp = ent->q->exp; > >> + > >> + int err = cqe_handler->cqe.res; > >> + if (err != 0) { > >> + /* TODO end_conn support */ > >> + > >> + /* -ENOTCONN is ok on umount */ > >> + if (err != -EINTR && err != -EOPNOTSUPP && > >> + err != -EAGAIN && err != -ENOTCONN) { > >> + fuse_export_halt(exp); > >> + } > >> + } else { > >> + co = qemu_coroutine_create(co_fuse_uring_queue_handle_cqes, > >> + cqe_handler); > >> + /* Decremented by co_fuse_uring_queue_handle_cqes() */ > >> + fuse_inc_in_flight(q->exp); > > > > Can this be moved inside co_fuse_uring_queue_handle_cqes() to avoid > > calling inc/dec from different functions? That would make the code > > easier to understand and more robust against future bugs. > > > >> + qemu_coroutine_enter(co); > >> + } > >> +} > >> + > >> +static void fuse_uring_sqe_set_req_data(struct fuse_uring_cmd_req *req, > >> + const unsigned int qid, > >> + const unsigned int commit_id) > >> +{ > >> + req->qid = qid; > >> + req->commit_id = commit_id; > >> + req->flags = 0; > >> +} > >> + > >> +static void fuse_uring_sqe_prepare(struct io_uring_sqe *sqe, FuseRingEnt > >> *ent, > >> + __u32 cmd_op) > >> +{ > >> + sqe->opcode = IORING_OP_URING_CMD; > >> + > >> + sqe->fd = ent->q->fuse_fd; > >> + sqe->rw_flags = 0; > >> + sqe->ioprio = 0; > >> + sqe->off = 0; > >> + > >> + sqe->cmd_op = cmd_op; > >> + sqe->__pad1 = 0; > >> +} > >> + > >> +static void fuse_uring_prep_sqe_register(struct io_uring_sqe *sqe, void > >> *opaque) > >> +{ > >> + FuseQueue *q = opaque; > >> + struct fuse_uring_cmd_req *req = (void *)&sqe->cmd[0]; > >> + > >> + fuse_uring_sqe_prepare(sqe, &q->ent, FUSE_IO_URING_CMD_REGISTER); > >> + > >> + sqe->addr = (uint64_t)(q->ent.iov); > >> + sqe->len = 2; > >> + > >> + fuse_uring_sqe_set_req_data(req, q->qid, 0); > >> +} > >> + > >> +static void fuse_uring_start(FuseExport *exp, struct fuse_init_out *out) > >> +{ > >> + /* > >> + * Since we didn't enable the FUSE_MAX_PAGES feature, the value of > >> + * fc->max_pages should be FUSE_DEFAULT_MAX_PAGES_PER_REQ, which is > >> set by > >> + * the kernel by default. Also, max_write should not exceed > >> + * FUSE_DEFAULT_MAX_PAGES_PER_REQ * PAGE_SIZE. > >> + */ > >> + size_t bufsize = out->max_write + FUSE_BUFFER_HEADER_SIZE; > >> + > >> + if (!(out->flags & FUSE_MAX_PAGES)) { > >> + /* > >> + * bufsize = MIN(FUSE_DEFAULT_MAX_PAGES_PER_REQ * > >> + * qemu_real_host_page_size() + FUSE_BUFFER_HEADER_SIZE, > >> bufsize); > >> + */ > >> + bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * > >> qemu_real_host_page_size() > >> + + FUSE_BUFFER_HEADER_SIZE; > >> + } > >> + > >> + for (int i = 0; i < exp->num_queues; i++) { > >> + FuseQueue *q = &exp->queues[i]; > >> + > >> + q->ent.q = q; > >> + > >> + q->ent.req_header = g_malloc0(sizeof(struct > >> fuse_uring_req_header)); > > > > It's probably easier to embed the header as a FuseRingEnt field instead > > of heap allocating it. > > Hmm well. So we have two additional patch in the DDN branch for which I > didn't have time to upstream them yet. These patches allow to pin these > buffers/pages and with that the application doing IO can directly write > into the buffer - saves context swithes. The initial RFC kernel patches > were using mmaped buffers and when I had to switch to userspace buffers, > performance went badly down. I didn't run real benchmarks, but just > xfstests - with mmapped buffers it was running like 3 times faster than > legacy fuse and that advantage got lost with normal buffers that > get mapped per request. Switching to pinned buffers brought back the > fast xfstest runs. > Issue is that the buffer needs to be page aligned - which is why libfuse > takes an extra allocation here. > In libfuse I should probably make this optional, as pinned buffers > will mostly only work for root (needs locked memory). > > In principle I would need to document these details somewhere, I should > probably create blog or so.
That sounds like something we'd like to try out, although we try to minimize privileges needed for qemu-storage-daemon. A question about pinning and resource limits: qemu-storage-daemon uses O_DIRECT for I/O as non-root without hitting mlock resource limits. Is there a difference between pin_user_pages(FOLL_PIN) (for direct I/O) and pin_user_pages(FOLL_PIN | FOLL_LONGTERM) (possibly for FUSE-over-io_uring pinning) in terms of resource limit behavior? Thanks, Stefan
signature.asc
Description: PGP signature