On Thu, May 29, 2025 at 04:11:21PM -0500, Eric Blake wrote: > On Wed, May 28, 2025 at 03:09:16PM -0400, Stefan Hajnoczi wrote: > > AioContext has its own io_uring instance for file descriptor monitoring. > > The disk I/O io_uring code was developed separately. Originally I > > thought the characteristics of file descriptor monitoring and disk I/O > > were too different, requiring separate io_uring instances. > > > > Now it has become clear to me that it's feasible to share a single > > io_uring instance for file descriptor monitoring and disk I/O. We're not > > using io_uring's IOPOLL feature or anything else that would require a > > separate instance. > > > > Unify block/io_uring.c and util/fdmon-io_uring.c using the new > > aio_add_sqe() API that allows user-defined io_uring sqe submission. Now > > block/io_uring.c just needs to submit readv/writev/fsync and most of the > > io_uring-specific logic is handled by fdmon-io_uring.c. > > > > There are two immediate advantages: > > 1. Fewer system calls. There is no need to monitor the disk I/O io_uring > > ring fd from the file descriptor monitoring io_uring instance. Disk > > I/O completions are now picked up directly. Also, sqes are > > accumulated in the sq ring until the end of the event loop iteration > > and there are fewer io_uring_enter(2) syscalls. > > 2. Less code duplication. > > > > Signed-off-by: Stefan Hajnoczi <[email protected]> > > --- > > Comments below, but looks sane to me. > > Reviewed-by: Eric Blake <[email protected]> > > > include/block/aio.h | 7 - > > include/block/raw-aio.h | 5 - > > block/file-posix.c | 38 ++-- > > block/io_uring.c | 489 ++++++++++------------------------------ > > stubs/io_uring.c | 32 --- > > util/async.c | 35 --- > > util/fdmon-io_uring.c | 6 + > > block/trace-events | 12 +- > > stubs/meson.build | 3 - > > util/trace-events | 4 + > > 10 files changed, 139 insertions(+), 492 deletions(-) > > delete mode 100644 stubs/io_uring.c > > > > diff --git a/include/block/aio.h b/include/block/aio.h > > index 95beef28c3..fbb45cca74 100644 > > --- a/include/block/aio.h > > +++ b/include/block/aio.h > > @@ -291,8 +291,6 @@ struct AioContext { > > struct LinuxAioState *linux_aio; > > #endif > > #ifdef CONFIG_LINUX_IO_URING > > - LuringState *linux_io_uring; > > - > > /* State for file descriptor monitoring using Linux io_uring */ > > struct io_uring fdmon_io_uring; > > AioHandlerSList submit_list; > > @@ -597,11 +595,6 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext > > *ctx, Error **errp); > > /* Return the LinuxAioState bound to this AioContext */ > > struct LinuxAioState *aio_get_linux_aio(AioContext *ctx); > > > > -/* Setup the LuringState bound to this AioContext */ > > -LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp); > > - > > -/* Return the LuringState bound to this AioContext */ > > -LuringState *aio_get_linux_io_uring(AioContext *ctx); > > /** > > * aio_timer_new_with_attrs: > > * @ctx: the aio context > > diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h > > index 6570244496..30e5fc9a9f 100644 > > --- a/include/block/raw-aio.h > > +++ b/include/block/raw-aio.h > > @@ -74,15 +74,10 @@ static inline bool laio_has_fua(void) > > #endif > > /* io_uring.c - Linux io_uring implementation */ > > #ifdef CONFIG_LINUX_IO_URING > > -LuringState *luring_init(Error **errp); > > -void luring_cleanup(LuringState *s); > > - > > /* luring_co_submit: submit I/O requests in the thread's current > > AioContext. */ > > int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t > > offset, > > QEMUIOVector *qiov, int type, > > BdrvRequestFlags flags); > > -void luring_detach_aio_context(LuringState *s, AioContext *old_context); > > -void luring_attach_aio_context(LuringState *s, AioContext *new_context); > > bool luring_has_fua(void); > > #else > > static inline bool luring_has_fua(void) > > diff --git a/block/file-posix.c b/block/file-posix.c > > index 9b5f08ccb2..d1f1fc3a77 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -755,14 +755,23 @@ static int raw_open_common(BlockDriverState *bs, > > QDict *options, > > } > > #endif /* !defined(CONFIG_LINUX_AIO) */ > > > > -#ifndef CONFIG_LINUX_IO_URING > > if (s->use_linux_io_uring) { > > +#ifdef CONFIG_LINUX_IO_URING > > + if (!aio_has_io_uring()) { > > Compared to the old code... [1] > > > + error_setg(errp, "aio=io_uring was specified, but is not " > > + "available (disabled via io_uring_disabled " > > + "sysctl or blocked by container runtime " > > + "seccomp policy?)"); > > + ret = -EINVAL; > > + goto fail; > > + } > > +#else > > error_setg(errp, "aio=io_uring was specified, but is not supported > > " > > "in this build."); > > While here, let's get rid of the trailing '.' in the error_setg call.
Sounds good.
>
>
> > ret = -EINVAL;
> > goto fail;
> > - }
> > #endif /* !defined(CONFIG_LINUX_IO_URING) */
> > + }
> >
> > s->has_discard = true;
> > s->has_write_zeroes = true;
> > @@ -2522,27 +2531,6 @@ static bool bdrv_qiov_is_aligned(BlockDriverState
> > *bs, QEMUIOVector *qiov)
> > return true;
> > }
> >
> > -#ifdef CONFIG_LINUX_IO_URING
> > -static inline bool raw_check_linux_io_uring(BDRVRawState *s)
> > -{
> > - Error *local_err = NULL;
> > - AioContext *ctx;
> > -
> > - if (!s->use_linux_io_uring) {
> > - return false;
> > - }
> > -
> > - ctx = qemu_get_current_aio_context();
> > - if (unlikely(!aio_setup_linux_io_uring(ctx, &local_err))) {
>
> [1]... is there a reason you dropped the unlikely() wrapper?
raw_check_linux_io_uring() was called from the I/O code path where a
branch prediction hint might make sense for performance reasons. The
raw_open_common() you mentioned above is not in the I/O code path.
Also, the new code in raw_open_common() is not the same as
raw_check_linux_io_uring(). aio_setup_linux_io_uring() no longer exists
in the new code. The patch removes this code entirely, it doesn't just
drop unlikely().
>
> > - error_reportf_err(local_err, "Unable to use linux io_uring, "
> > - "falling back to thread pool: ");
> > - s->use_linux_io_uring = false;
> > - return false;
> > - }
> > - return true;
> > -}
> > -#endif
> > -
> > #ifdef CONFIG_LINUX_AIO
> > static inline bool raw_check_linux_aio(BDRVRawState *s)
> > {
> > @@ -2595,7 +2583,7 @@ static int coroutine_fn raw_co_prw(BlockDriverState
> > *bs, int64_t *offset_ptr,
> > if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
> > type |= QEMU_AIO_MISALIGNED;
> > #ifdef CONFIG_LINUX_IO_URING
> > - } else if (raw_check_linux_io_uring(s)) {
> > + } else if (s->use_linux_io_uring) {
> > assert(qiov->size == bytes);
> > ret = luring_co_submit(bs, s->fd, offset, qiov, type, flags);
> > goto out;
> > @@ -2692,7 +2680,7 @@ static int coroutine_fn
> > raw_co_flush_to_disk(BlockDriverState *bs)
> > };
> >
> > #ifdef CONFIG_LINUX_IO_URING
> > - if (raw_check_linux_io_uring(s)) {
> > + if (s->use_linux_io_uring) {
> > return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
> > }
> > #endif
> > diff --git a/block/io_uring.c b/block/io_uring.c
> > index dd4f304910..dd930ee57e 100644
> > --- a/block/io_uring.c
> > +++ b/block/io_uring.c
> > @@ -11,28 +11,20 @@
> > #include "qemu/osdep.h"
> > #include <liburing.h>
> > #include "block/aio.h"
> > -#include "qemu/queue.h"
> > #include "block/block.h"
> > #include "block/raw-aio.h"
> > #include "qemu/coroutine.h"
> > -#include "qemu/defer-call.h"
> > -#include "qapi/error.h"
> > #include "system/block-backend.h"
> > #include "trace.h"
> >
> > -/* Only used for assertions. */
> > -#include "qemu/coroutine_int.h"
> > -
> > -/* io_uring ring size */
> > -#define MAX_ENTRIES 128
> > -
> > -typedef struct LuringAIOCB {
> > +typedef struct {
> > Coroutine *co;
> > - struct io_uring_sqe sqeq;
> > - ssize_t ret;
> > QEMUIOVector *qiov;
> > - bool is_read;
> > - QSIMPLEQ_ENTRY(LuringAIOCB) next;
> > + uint64_t offset;
> > + ssize_t ret;
> > + int type;
> > + int fd;
> > + BdrvRequestFlags flags;
> >
> > /*
> > * Buffered reads may require resubmission, see
> > @@ -40,36 +32,51 @@ typedef struct LuringAIOCB {
> > */
> > int total_read;
> > QEMUIOVector resubmit_qiov;
> > -} LuringAIOCB;
> >
> > -typedef struct LuringQueue {
> > - unsigned int in_queue;
> > - unsigned int in_flight;
> > - bool blocked;
> > - QSIMPLEQ_HEAD(, LuringAIOCB) submit_queue;
> > -} LuringQueue;
> > + CqeHandler cqe_handler;
> > +} LuringRequest;
> >
> > -struct LuringState {
> > - AioContext *aio_context;
> > -
> > - struct io_uring ring;
> > -
> > - /* No locking required, only accessed from AioContext home thread */
> > - LuringQueue io_q;
> > -
> > - QEMUBH *completion_bh;
> > -};
> > -
> > -/**
> > - * luring_resubmit:
> > - *
> > - * Resubmit a request by appending it to submit_queue. The caller must
> > ensure
> > - * that ioq_submit() is called later so that submit_queue requests are
> > started.
> > - */
> > -static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
> > +static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
> > {
> > - QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
> > - s->io_q.in_queue++;
> > + LuringRequest *req = opaque;
> > + QEMUIOVector *qiov = req->qiov;
> > + uint64_t offset = req->offset;
> > + int fd = req->fd;
> > + BdrvRequestFlags flags = req->flags;
> > +
> > + switch (req->type) {
> > + case QEMU_AIO_WRITE:
> > +#ifdef HAVE_IO_URING_PREP_WRITEV2
> > + {
> > + int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
> > + io_uring_prep_writev2(sqe, fd, qiov->iov,
> > + qiov->niov, offset, luring_flags);
> > + }
> > +#else
> > + assert(flags == 0);
> > + io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
>
> Hmm. 'man io_uring_prep_writev2' states:
>
> Unless an application explicitly needs to pass in more than one
> iovec,
> it is more efficient to use io_uring_prep_write(3) rather than
> this
> function, as no state has to be maintained for a non-vectored IO
> re‐
> quest.
>
> Obviously, if we want luring_flags of RWF_DSYNC to be set, we have to
> use the newer interface; but if that flag is not present, should we be
> conditionally falling back to the simpler interface when qiov->niov ==
> 1?
Thanks for the suggestion. I will try it out and benchmark it for v2.
>
> In fact, even when qiov->niov > 1, can we unvector it ourselves into
> multiple io_uring_prep_write() calls, since the whole point of the
> uring is that we aren't making syscalls, so more ops on the uring
> should still be cheaper? But that's a question for a followup patch
> (still, given that your series is RFC for performance reasons, it may
> be worth investigating).
That's more complicated. I won't pursue that for the time being.
signature.asc
Description: PGP signature
