On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> From: Juraj Marcin <[email protected]>
> 
> Currently, when postcopy starts, the source VM starts switchover and
> sends a package containing the state of all non-postcopiable devices.
> When the destination loads this package, the switchover is complete and
> the destination VM starts. However, if the device state load fails or
> the destination side crashes, the source side is already in
> POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
> up-to-date machine state as the destination has not yet started.
> 
> This patch introduces a new POSTCOPY_DEVICE state which is active
> while the destination machine is loading the device state, is not yet
> running, and the source side can be resumed in case of a migration
> failure.
> 
> To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
> side uses a PONG message that is a response to a PING message processed
> just before the POSTCOPY_RUN command that starts the destination VM.
> Thus, this change does not require any changes on the destination side
> and is effective even with older destination versions.

Thanks, this will help libvirt as we think that the migration can be
safely aborted unless we successfully called "cont" and thus we just
kill QEMU on the destination. But since QEMU on the source already
entered postcopy-active, we can't cancel the migration and the result is
a paused VM with no way of recovering it.

This series will make the situation better as the source will stay in
postcopy-device until the destination successfully loads device data.
There's still room for some enhancement though. Depending on how fast
this loading is libvirt may issue cont before device data is loaded (the
destination is already in postcopy-active at this point), which always
succeeds as it only marks the domain to be autostarted, but the actual
start may fail later. When discussing this with Juraj we agreed on
introducing the new postcopy-device state on the destination as well to
make sure libvirt will only call cont once device data was successfully
loaded so that we always get a proper result when running cont. But it
may still fail when locking disks fails (not sure if this is the only
way cont may fail). In this case we cannot cancel the migration on the
source as it is already in postcopy-active and we can't recover
migration either as the CPUs are not running on the destination. Ideally
we'd have a way of canceling the migration in postocpy-active if we are
sure CPUs were not started yet. Alternatively a possibility to recover
migration would work as well.

Jirka


Reply via email to