On Fri, Oct 10, 2025 at 06:20:33PM +0200, Kevin Wolf wrote:
> Am 10.10.2025 um 17:23 hat Kevin Wolf geschrieben:
> > Am 10.09.2025 um 19:57 hat Stefan Hajnoczi geschrieben:
> > > Introduce the aio_add_sqe() API for submitting io_uring requests in the
> > > current AioContext. This allows other components in QEMU, like the block
> > > layer, to take advantage of io_uring features without creating their own
> > > io_uring context.
> > > 
> > > This API supports nested event loops just like file descriptor
> > > monitoring and BHs do. This comes at a complexity cost: a BH is required
> > > to dispatch CQE callbacks and they are placed on a list so that a nested
> > > event loop can invoke its parent's pending CQE callbacks. If you're
> > > wondering why CqeHandler exists instead of just a callback function
> > > pointer, this is why.
> > 
> > This is a mechanism that we know from other places in the code like the
> > Linux AIO or indeed the old io_uring block driver code, because a BH is
> > the only thing that makes sure that the main loop will call into the
> > code again later.
> > 
> > Do we really need it here, though? This _is_ literally the main loop
> > implementation, we don't have to make the main loop call us.
> > .need_wait() checks io_uring_cq_ready(), so as long as there are
> > unprocessed completions, we know that .wait() will be called in nested
> > event loops. We just can't take more than one completion out of the
> > queue to process them later for this to work, but have to process them
> > one by one as we get them from the ring. But that's what we already do.
> > 
> > Am I missing something?
> 
> I think what I missed is that we probably don't want to call arbitrary
> callbacks from .wait(), but only in the dispatching phase. At the same

Yes, exactly. The polling/wait phase of aio_poll() is sensitive since it
measures time, holds ctx->list_lock, and has ctx->notify_me enabled. In
that state we cannot call arbitrary code.

> time, we need to fill the ready_list during .wait() and can't do that
> later, so we do have to go through all CQEs here. The only way I can see
> to get rid of the extra list - which I would really like to see - would
> be processing the CQEs twice (once during .wait() for the internal ones
> and once during the dispatch phase for the add_sqe() ones). That's a bit
> annoying.
> 
> Either way, even if we keep the list, scheduling and cancelling BHs from
> the fdmon still doesn't feel quite right to me. I wonder if we shouldn't
> introduce a .dispatch() callback in FDMonOps that could run the
> cqe_handler_ready_list for fdmon-io_uring. That would also make the
> interface more consistent with the set of callbacks we have for GSource,
> and maybe eventually simplify deduplicating them.
> 
> Then you also don't need the ugly optimisation in the next patch that
> fixes the slowness of scheduling BHs in .wait() by moving io_uring code
> to the AioContext core.

You're right, cqe_handler_bh is ugly. Either every dispatched callback
in the event loop should be its own BH (and BHs are the one and only
list of callbacks) or I can add the .dispatch() callback you suggested.
I need to play with the code a bit to compare the approaches. I'll
choose an approach for the next revision of this series.

Attachment: signature.asc
Description: PGP signature

Reply via email to