Fiona Ebner <[email protected]> writes: > The fact that the snapshot_save_job_bh() is scheduled in the main > loop's qemu_aio_context AioContext means that it might get executed > during a vCPU thread's aio_poll(). But saving of the VM state cannot > happen while the guest or devices are active and can lead to assertion > failures. See issue #2111 for two examples. Avoid the problem by > scheduling the snapshot_save_job_bh() in the iohandler AioContext, > which is not polled by vCPU threads. > > Solves Issue #2111. > > This change also solves the following issue: > > Since commit effd60c878 ("monitor: only run coroutine commands in > qemu_aio_context"), the 'snapshot-save' QMP call would not respond > right after starting the job anymore, but only after the job finished, > which can take a long time. The reason is, because after commit > effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext. > When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the > coroutine cannot be entered immediately anymore, but needs to be > scheduled to the main loop's qemu_aio_context AioContext. But > snapshot_save_job_bh() was scheduled first to the same AioContext and > thus gets executed first. > > Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111 > Signed-off-by: Fiona Ebner <[email protected]> > --- > > While initial smoke testing seems fine, I'm not familiar enough with > this to rule out any pitfalls with the approach. Any reason why > scheduling to the iohandler AioContext could be wrong here?
FWIW, I couldn't find any reason why this would not work. But let's see what Stefan and Paolo have to say. > > Should the same be done for the snapshot_load_job_bh and > snapshot_delete_job_bh to keep it consistent? Yep, I think it makes sense. > > migration/savevm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/migration/savevm.c b/migration/savevm.c > index c621f2359b..0086b76ab0 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -3459,7 +3459,7 @@ static int coroutine_fn snapshot_save_job_run(Job *job, > Error **errp) > SnapshotJob *s = container_of(job, SnapshotJob, common); > s->errp = errp; > s->co = qemu_coroutine_self(); > - aio_bh_schedule_oneshot(qemu_get_aio_context(), > + aio_bh_schedule_oneshot(iohandler_get_aio_context(), > snapshot_save_job_bh, job); > qemu_coroutine_yield(); > return s->ret ? 0 : -1;
