On Fri, Oct 10, 2025 at 06:20:33PM +0200, Kevin Wolf wrote: > Am 10.10.2025 um 17:23 hat Kevin Wolf geschrieben: > > Am 10.09.2025 um 19:57 hat Stefan Hajnoczi geschrieben: > > > Introduce the aio_add_sqe() API for submitting io_uring requests in the > > > current AioContext. This allows other components in QEMU, like the block > > > layer, to take advantage of io_uring features without creating their own > > > io_uring context. > > > > > > This API supports nested event loops just like file descriptor > > > monitoring and BHs do. This comes at a complexity cost: a BH is required > > > to dispatch CQE callbacks and they are placed on a list so that a nested > > > event loop can invoke its parent's pending CQE callbacks. If you're > > > wondering why CqeHandler exists instead of just a callback function > > > pointer, this is why. > > > > This is a mechanism that we know from other places in the code like the > > Linux AIO or indeed the old io_uring block driver code, because a BH is > > the only thing that makes sure that the main loop will call into the > > code again later. > > > > Do we really need it here, though? This _is_ literally the main loop > > implementation, we don't have to make the main loop call us. > > .need_wait() checks io_uring_cq_ready(), so as long as there are > > unprocessed completions, we know that .wait() will be called in nested > > event loops. We just can't take more than one completion out of the > > queue to process them later for this to work, but have to process them > > one by one as we get them from the ring. But that's what we already do. > > > > Am I missing something? > > I think what I missed is that we probably don't want to call arbitrary > callbacks from .wait(), but only in the dispatching phase. At the same
Yes, exactly. The polling/wait phase of aio_poll() is sensitive since it measures time, holds ctx->list_lock, and has ctx->notify_me enabled. In that state we cannot call arbitrary code. > time, we need to fill the ready_list during .wait() and can't do that > later, so we do have to go through all CQEs here. The only way I can see > to get rid of the extra list - which I would really like to see - would > be processing the CQEs twice (once during .wait() for the internal ones > and once during the dispatch phase for the add_sqe() ones). That's a bit > annoying. > > Either way, even if we keep the list, scheduling and cancelling BHs from > the fdmon still doesn't feel quite right to me. I wonder if we shouldn't > introduce a .dispatch() callback in FDMonOps that could run the > cqe_handler_ready_list for fdmon-io_uring. That would also make the > interface more consistent with the set of callbacks we have for GSource, > and maybe eventually simplify deduplicating them. > > Then you also don't need the ugly optimisation in the next patch that > fixes the slowness of scheduling BHs in .wait() by moving io_uring code > to the AioContext core. You're right, cqe_handler_bh is ugly. Either every dispatched callback in the event loop should be its own BH (and BHs are the one and only list of callbacks) or I can add the .dispatch() callback you suggested. I need to play with the code a bit to compare the approaches. I'll choose an approach for the next revision of this series.
signature.asc
Description: PGP signature
