Peter Xu <[email protected]> writes: > Devices may opt-in migration FAILED notifiers to be invoked when migration > fails. Currently, the notifications happen in migration_cleanup(). It is > normally fine, but maybe not ideal if there's dependency of the fallback > v.s. VM starts. > > This patch moves the FAILED notification earlier, so that if the failure > happened during switchover, it'll notify before VM restart. >
The change to FAILED in patch 2 should come to this patch to avoid having a window where the notification only happens at the end. > After walking over all existing FAILED notifier users, I got the conclusion > that this should also be a cleaner approach at least from design POV. > > We have these notifier users, where the first two do not need to trap > FAILED: > > |----------------------------+-------------------------------------+---------------------| > | device | handler | events > needed | > |----------------------------+-------------------------------------+---------------------| > | gicv3 | kvm_arm_gicv3_notifier | DONE > | > | vfio_iommufd / vfio_legacy | vfio_cpr_reboot_notifier | SETUP > | > | cpr-exec | cpr_exec_notifier | FAILED, > DONE | > | virtio-net | virtio_net_migration_state_notifier | SETUP, > FAILED | > | vfio | vfio_migration_state_notifier | FAILED > | > | vdpa | vdpa_net_migration_state_notifier | SETUP, > FAILED | > | spice [*] | migration_state_notifier | SETUP, > FAILED, DONE | > |----------------------------+-------------------------------------+---------------------| > > For cpr-exec, it tries to cleanup some cpr-exec specific fd or env > variables. This should be fine either way, as long as before > migration_cleanup(). > > For virtio-net, we need to re-plug the primary device back to guest in the > failover mode. Likely benign. > > VFIO needs to re-start the device if FAILED. IIUC it should do it before > vm_start(), if the VFIO device can be put into a STOPed state due to > migration, we should logically make it running again before vCPUs run. > > VDPA will disable SVQ when migration is FAILED. Likely benign too, but > looks better if we can do it before resuming vCPUs. > > For spice, we should rely on "spice_server_migrate_end(false)" to retake > the ownership. Benign, but looks more reasonable if the spice client does > it before VM runs again. > > Note that this change may introduce slightly more downtime, if the > migration failed exactly at the switchover phase. But that's very rare, > and even if it happens, none of above expects a long delay, but a short > one, likely will be buried in the total downtime even if failed. > > Cc: Cédric Le Goater <[email protected]> > Cc: Marc-André Lureau <[email protected]> > Signed-off-by: Peter Xu <[email protected]> > --- > migration/migration.c | 20 ++++++++++++++++---- > 1 file changed, 16 insertions(+), 4 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 91775f8472..1d9a2fc068 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1481,7 +1481,6 @@ static void > migration_cleanup_json_writer(MigrationState *s) > > static void migration_cleanup(MigrationState *s) > { > - MigrationEventType type; > QEMUFile *tmp = NULL; > > trace_migration_cleanup(); > @@ -1535,9 +1534,15 @@ static void migration_cleanup(MigrationState *s) > /* It is used on info migrate. We can't free it */ > error_report_err(error_copy(s->error)); > } > - type = migration_has_failed(s) ? MIG_EVENT_PRECOPY_FAILED : > - MIG_EVENT_PRECOPY_DONE; > - migration_call_notifiers(s, type, NULL); > + > + /* > + * FAILED notification should have already happened. Notify DONE if > + * migration completed successfully. > + */ > + if (!migration_has_failed(s)) { > + migration_call_notifiers(s, MIG_EVENT_PRECOPY_DONE, NULL); > + } > + > yank_unregister_instance(MIGRATION_YANK_INSTANCE); > } > > @@ -3589,6 +3594,13 @@ static void migration_iteration_finish(MigrationState > *s) > error_free(local_err); > break; > } > + > + /* > + * Notify FAILED before starting VM, so that devices can invoke > + * necessary fallbacks before vCPUs run again. > + */ > + migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL); > + > if (runstate_is_live(s->vm_old_state)) { > if (!runstate_check(RUN_STATE_SHUTDOWN)) { > vm_start();
