On 09.10.25 22:16, Raphael Norwitz wrote:
My apologies for the late review here. I appreciate the need to work
around these issues but I do feel the approach complicates Qemu
significantly and it may be possible to achieve similar results
managing state inside the backend. More comments inline.
I like a lot of the cleanups here - maybe consider breaking out a
series with some of the cleanups?
Of course, I thought about that too.
On Wed, Aug 13, 2025 at 12:56 PM Vladimir Sementsov-Ogievskiy
<[email protected]> wrote:
Hi all!
Local migration of vhost-user-blk requires non-trivial actions
from management layer, it should provide a new connection for new
QEMU process and handle disk operation movement from one connection
to another.
Such switching, including reinitialization of vhost-user connection,
draining disk requests, etc, adds significant value to local migration
downtime.
I see how draining IO requests adds downtime and is impactful. That
said, we need to start-stop the device anyways
No, with this series and new feature enabled we don't have this drain,
see
if (dev->backend_transfer) {
return 0;
}
at start of do_vhost_virtqueue_stop().
so I'm not convinced
that setting up mappings and sending messages back and forth are
impactful enough to warrant adding a whole new migration mode. Am I
missing anything here?
In management layer we have to manage two end-points for remote
disk, and accompany a safe switch from one to another. That's
complicated and often long procedure, which contributes an
average delay of 0.6 seconds, and (which is worse) ~2.4 seconds
in p99.
Of course, you may say "just rewrite your management layer to
work better":) But that's not simple, and we came to idea, that
we can do the whole local migration at QEMU side, not touching
backend at all.
The main benefit: fewer participants. We don't rely on management layer
and vhost-user server to do proper things for migration. Backend even
don't know, that QEMU is updated. This makes the whole process
simpler and therefore safer.
The disk service may also be temporarily down at some time, which of course has
a bad effect on live migration and its freeze-time. We avoid this
issue with my series (as we don't communicate to the backend in
any way during migration, and disk service should not manage any
endpoints switching)
Note also, that my series is not a precedent in QEMU, and not a totally new
mode.
Steve Sistare works on the idea to pass backends through UNIX socket, and it
is now merged as cpr-transfer and cpr-exec migration modes, and supports
VFIO devices.
So, my work shares this existing concept on vhost-user-blk and virtio-net,
and may be used as part of cpr-transfer / cpr-exec, or in separate.
This all leads to an idea: why not to just pass all we need from
old QEMU process to the new one (including open file descriptors),
and don't touch the backend at all? This way, the vhost user backend
server will not even know, that QEMU process is changed, as live
vhost-user connection is migrated.
Alternatively, if it really is about avoiding IO draining, what if
Qemu advertised a new vhost-user protocol feature which would query
whether the backend already has state for the device? Then, if the
backend indicates that it does, Qemu and the backend can take a
different path in vhost-user, exchanging relevant information,
including the descriptor indexes for the VQs such that draining can be
avoided. I expect that could be implemented to cut down a lot of the
other vhost-user overhead anyways (i.e. you could skip setting the
memory table). If nothing else it would probably help other device
types take advantage of this without adding more options to Qemu.
Hmm, if say only about draining, as I understand, the only thing we need
is support migrating of "inflight region". This done in the series,
and we are also preparing a separate feature to support migrating
inflight region for remote migration.
But, for local migration we want more: remove disk service from
the process at all, to have a guaranteed small downtime for live updates.
independent of any problems which may occur on disk service side.
Why freeze-time is more sensitive for live-updates than for remote
migration? Because we have to run a lot of live-update operations:
simply update all the vms in the cloud to a new version. Remote
migration happens much less frequently: when we need to move all
vms from physical server to reboot it (or repair it, serve it, etc).
So, I still believe, that migrating backend states through QEMU migration
stream makes sense in general, and for vhost-user-blk it works well too.
--
Best regards,
Vladimir