On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote: > From: Juraj Marcin <[email protected]> > > Currently, when postcopy starts, the source VM starts switchover and > sends a package containing the state of all non-postcopiable devices. > When the destination loads this package, the switchover is complete and > the destination VM starts. However, if the device state load fails or > the destination side crashes, the source side is already in > POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most > up-to-date machine state as the destination has not yet started. > > This patch introduces a new POSTCOPY_DEVICE state which is active > while the destination machine is loading the device state, is not yet > running, and the source side can be resumed in case of a migration > failure. > > To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source > side uses a PONG message that is a response to a PING message processed > just before the POSTCOPY_RUN command that starts the destination VM. > Thus, this change does not require any changes on the destination side > and is effective even with older destination versions.
Thanks, this will help libvirt as we think that the migration can be safely aborted unless we successfully called "cont" and thus we just kill QEMU on the destination. But since QEMU on the source already entered postcopy-active, we can't cancel the migration and the result is a paused VM with no way of recovering it. This series will make the situation better as the source will stay in postcopy-device until the destination successfully loads device data. There's still room for some enhancement though. Depending on how fast this loading is libvirt may issue cont before device data is loaded (the destination is already in postcopy-active at this point), which always succeeds as it only marks the domain to be autostarted, but the actual start may fail later. When discussing this with Juraj we agreed on introducing the new postcopy-device state on the destination as well to make sure libvirt will only call cont once device data was successfully loaded so that we always get a proper result when running cont. But it may still fail when locking disks fails (not sure if this is the only way cont may fail). In this case we cannot cancel the migration on the source as it is already in postcopy-active and we can't recover migration either as the CPUs are not running on the destination. Ideally we'd have a way of canceling the migration in postocpy-active if we are sure CPUs were not started yet. Alternatively a possibility to recover migration would work as well. Jirka
