Peter Xu <pet...@redhat.com> writes: > This patch proposes a flag to maintain disk activation status globally. It > mostly rewrites disk activation mgmt for QEMU, including COLO and QMP > command xen_save_devices_state. > > Backgrounds > =========== > > We have two problems on disk activations, one resolved, one not. > > Problem 1: disk activation recover (for switchover interruptions) > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > When migration is either cancelled or failed during switchover, especially > when after the disks are inactivated, QEMU needs to remember re-activate > the disks again before vm starts. > > It used to be done separately in two paths: one in qmp_migrate_cancel(), > the other one in the failure path of migration_completion(). > > It used to be fixed in different commits, all over the places in QEMU. So > these are the relevant changes I saw, I'm not sure if it's complete list: > > - In 2016, commit fe904ea824 ("migration: regain control of images when > migration fails to complete") > > - In 2017, commit 1d2acc3162 ("migration: re-active images while migration > been canceled after inactive them") > > - In 2023, commit 6dab4c93ec ("migration: Attempt disk reactivation in > more failure scenarios") > > Now since we have a slightly better picture maybe we can unify the > reactivation in a single path. > > One side benefit of doing so is, we can move the disk operation outside QMP > command "migrate_cancel". It's possible that in the future we may want to > make "migrate_cancel" be OOB-compatible, while that requires the command > doesn't need BQL in the first place. This will already do that and make > migrate_cancel command lightweight. > > Problem 2: disk invalidation on top of invalidated disks > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > This is an unresolved bug for current QEMU. Link in "Resolves:" at the > end. It turns out besides the src switchover phase (problem 1 above), QEMU > also needs to remember block activation on destination. > > Consider two continuous migration in a row, where the VM was always paused. > In that scenario, the disks are not activated even until migration > completed in the 1st round. When the 2nd round starts, if QEMU doesn't > know the status of the disks, it needs to try inactivate the disk again. > > Here the issue is the block layer API bdrv_inactivate_all() will crash a > QEMU if invoked on already inactive disks for the 2nd migration. For > detail, see the bug link at the end. > > Implementation > ============== > > This patch proposes to maintain disk activation with a global flag, so we > know: > > - If we used to inactivate disks for migration, but migration got > cancelled, or failed, QEMU will know it should reactivate the disks. > > - On incoming side, if the disks are never activated but then another > migration is triggered, QEMU should be able to tell that inactivate is > not needed for the 2nd migration. > > We used to have disk_inactive, but it only solves the 1st issue, not the > 2nd. Also, it's done in completely separate paths so it's extremely hard > to follow either how the flag changes, or the duration that the flag is > valid, and when we will reactivate the disks. > > Convert the existing disk_inactive flag into that global flag (also invert > its naming), and maintain the disk activation status for the whole > lifecycle of qemu. That includes the incoming QEMU. > > Put both of the error cases of source migration (failure, cancelled) > together into migration_iteration_finish(), which will be invoked for > either of the scenario. So from that part QEMU should behave the same as > before. However with such global maintenance on disk activation status, we > not only cleanup quite a few temporary paths that we try to maintain the > disk activation status (e.g. in postcopy code), meanwhile it fixes the > crash for problem 2 in one shot. > > For freshly started QEMU, the flag is initialized to TRUE showing that the > QEMU owns the disks by default. > > For incoming migrated QEMU, the flag will be initialized to FALSE once and > for all showing that the dest QEMU doesn't own the disks until switchover. > That is guaranteed by the "once" variable. > > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2395 > Signed-off-by: Peter Xu <pet...@redhat.com>
Reviewed-by: Fabiano Rosas <faro...@suse.de> Just a note about errp below and a comment about a pre-existing condition, but nothing that requires action. > --- > include/migration/misc.h | 4 ++ > migration/migration.h | 6 +-- > migration/block-active.c | 94 ++++++++++++++++++++++++++++++++++++++++ > migration/colo.c | 2 +- > migration/migration.c | 80 ++++++++-------------------------- > migration/savevm.c | 33 ++++++-------- > monitor/qmp-cmds.c | 8 +--- > migration/meson.build | 1 + > migration/trace-events | 3 ++ > 9 files changed, 140 insertions(+), 91 deletions(-) > create mode 100644 migration/block-active.c > > diff --git a/include/migration/misc.h b/include/migration/misc.h > index 804eb23c06..e68a473feb 100644 > --- a/include/migration/misc.h > +++ b/include/migration/misc.h > @@ -106,4 +106,8 @@ bool migration_incoming_postcopy_advised(void); > /* True if background snapshot is active */ > bool migration_in_bg_snapshot(void); > > +/* Wrapper for block active/inactive operations */ > +bool migration_block_activate(Error **errp); > +bool migration_block_inactivate(void); > + > #endif > diff --git a/migration/migration.h b/migration/migration.h > index 3857905c0e..fab3cad2b9 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -370,9 +370,6 @@ struct MigrationState { > /* Flag set once the migration thread is running (and needs joining) */ > bool migration_thread_running; > > - /* Flag set once the migration thread called bdrv_inactivate_all */ > - bool block_inactive; > - > /* Migration is waiting for guest to unplug device */ > QemuSemaphore wait_unplug_sem; > > @@ -556,4 +553,7 @@ void migration_bitmap_sync_precopy(bool last_stage); > /* migration/block-dirty-bitmap.c */ > void dirty_bitmap_mig_init(void); > > +/* migration/block-active.c */ > +void migration_block_active_setup(bool active); > + > #endif > diff --git a/migration/block-active.c b/migration/block-active.c > new file mode 100644 > index 0000000000..d477cf8182 > --- /dev/null > +++ b/migration/block-active.c > @@ -0,0 +1,94 @@ > +/* > + * Block activation tracking for migration purpose > + * > + * SPDX-License-Identifier: GPL-2.0-or-later > + * > + * Copyright (C) 2024 Red Hat, Inc. > + */ > +#include "qemu/osdep.h" > +#include "block/block.h" > +#include "qapi/error.h" > +#include "migration/migration.h" > +#include "qemu/error-report.h" > +#include "trace.h" > + > +/* > + * Migration-only cache to remember the block layer activation status. > + * Protected by BQL. > + * > + * We need this because.. > + * > + * - Migration can fail after block devices are invalidated (during > + * switchover phase). When that happens, we need to be able to recover > + * the block drive status by re-activating them. > + * > + * - Currently bdrv_inactivate_all() is not safe to be invoked on top of > + * invalidated drives (even if bdrv_activate_all() is actually safe to be > + * called any time!). It means remembering this could help migration to > + * make sure it won't invalidate twice in a row, crashing QEMU. It can > + * happen when we migrate a PAUSED VM from host1 to host2, then migrate > + * again to host3 without starting it. TODO: a cleaner solution is to > + * allow safe invoke of bdrv_inactivate_all() at anytime, like > + * bdrv_activate_all(). > + * > + * For freshly started QEMU, the flag is initialized to TRUE reflecting the > + * scenario where QEMU owns block device ownerships. > + * > + * For incoming QEMU taking a migration stream, the flag is initialized to > + * FALSE reflecting that the incoming side doesn't own the block devices, > + * not until switchover happens. > + */ > +static bool migration_block_active; > + > +/* Setup the disk activation status */ > +void migration_block_active_setup(bool active) > +{ > + migration_block_active = active; > +} > + > +bool migration_block_activate(Error **errp) > +{ > + ERRP_GUARD(); > + > + assert(bql_locked()); > + > + if (migration_block_active) { > + trace_migration_block_activation("active-skipped"); > + return true; > + } > + > + trace_migration_block_activation("active"); > + > + bdrv_activate_all(errp); > + if (*errp) { > + error_report_err(error_copy(*errp)); > + return false; > + } > + > + migration_block_active = true; > + return true; > +} > + > +bool migration_block_inactivate(void) > +{ > + int ret; > + > + assert(bql_locked()); > + > + if (!migration_block_active) { > + trace_migration_block_activation("inactive-skipped"); > + return true; > + } > + > + trace_migration_block_activation("inactive"); > + > + ret = bdrv_inactivate_all(); > + if (ret) { > + error_report("%s: bdrv_inactivate_all() failed: %d", > + __func__, ret); > + return false; > + } > + > + migration_block_active = false; > + return true; > +} > diff --git a/migration/colo.c b/migration/colo.c > index 9590f281d0..ae3387a7a4 100644 > --- a/migration/colo.c > +++ b/migration/colo.c > @@ -836,7 +836,7 @@ static void *colo_process_incoming_thread(void *opaque) > > /* Make sure all file formats throw away their mutable metadata */ > bql_lock(); > - bdrv_activate_all(&local_err); > + migration_block_activate(&local_err); > bql_unlock(); > if (local_err) { > error_report_err(local_err); > diff --git a/migration/migration.c b/migration/migration.c > index ba5deec5bc..d755ccb03d 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -738,7 +738,6 @@ static void qemu_start_incoming_migration(const char > *uri, bool has_channels, > > static void process_incoming_migration_bh(void *opaque) > { > - Error *local_err = NULL; > MigrationIncomingState *mis = opaque; > > trace_vmstate_downtime_checkpoint("dst-precopy-bh-enter"); > @@ -769,11 +768,7 @@ static void process_incoming_migration_bh(void *opaque) > * Make sure all file formats throw away their mutable > * metadata. If error, don't restart the VM yet. > */ > - bdrv_activate_all(&local_err); > - if (local_err) { > - error_report_err(local_err); > - local_err = NULL; > - } else { > + if (migration_block_activate(NULL)) { > vm_start(); > } > } else { > @@ -1552,16 +1547,6 @@ static void migrate_fd_cancel(MigrationState *s) > } > } > } > - if (s->state == MIGRATION_STATUS_CANCELLING && s->block_inactive) { > - Error *local_err = NULL; > - > - bdrv_activate_all(&local_err); > - if (local_err) { > - error_report_err(local_err); > - } else { > - s->block_inactive = false; > - } > - } > } > > void migration_add_notifier_mode(NotifierWithReturn *notify, > @@ -1860,6 +1845,12 @@ void qmp_migrate_incoming(const char *uri, bool > has_channels, > return; > } > > + /* > + * Newly setup incoming QEMU. Mark the block active state to reflect > + * that the src currently owns the disks. > + */ > + migration_block_active_setup(false); > + > once = false; > } > > @@ -2512,7 +2503,6 @@ static int postcopy_start(MigrationState *ms, Error > **errp) > QIOChannelBuffer *bioc; > QEMUFile *fb; > uint64_t bandwidth = migrate_max_postcopy_bandwidth(); > - bool restart_block = false; > int cur_state = MIGRATION_STATUS_ACTIVE; > > if (migrate_postcopy_preempt()) { > @@ -2548,13 +2538,10 @@ static int postcopy_start(MigrationState *ms, Error > **errp) > goto fail; > } > > - ret = bdrv_inactivate_all(); > - if (ret < 0) { > - error_setg_errno(errp, -ret, "%s: Failed in bdrv_inactivate_all()", > - __func__); > + if (!migration_block_inactivate()) { > + error_setg(errp, "%s: Failed in bdrv_inactivate_all()", __func__); > goto fail; > } > - restart_block = true; > > /* > * Cause any non-postcopiable, but iterative devices to > @@ -2624,8 +2611,6 @@ static int postcopy_start(MigrationState *ms, Error > **errp) > goto fail_closefb; > } > > - restart_block = false; > - > /* Now send that blob */ > if (qemu_savevm_send_packaged(ms->to_dst_file, bioc->data, bioc->usage)) > { > error_setg(errp, "%s: Failed to send packaged data", __func__); > @@ -2670,17 +2655,7 @@ fail_closefb: > fail: > migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, > MIGRATION_STATUS_FAILED); > - if (restart_block) { > - /* A failure happened early enough that we know the destination > hasn't > - * accessed block devices, so we're safe to recover. > - */ > - Error *local_err = NULL; > - > - bdrv_activate_all(&local_err); > - if (local_err) { > - error_report_err(local_err); > - } > - } > + migration_block_activate(NULL); > migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL); > bql_unlock(); > return -1; > @@ -2778,31 +2753,6 @@ static void > migration_completion_postcopy(MigrationState *s) > trace_migration_completion_postcopy_end_after_complete(); > } > > -static void migration_completion_failed(MigrationState *s, > - int current_active_state) > -{ > - if (s->block_inactive && (s->state == MIGRATION_STATUS_ACTIVE || > - s->state == MIGRATION_STATUS_DEVICE)) { > - /* > - * If not doing postcopy, vm_start() will be called: let's > - * regain control on images. > - */ > - Error *local_err = NULL; > - > - bql_lock(); > - bdrv_activate_all(&local_err); > - if (local_err) { > - error_report_err(local_err); > - } else { > - s->block_inactive = false; > - } > - bql_unlock(); > - } > - > - migrate_set_state(&s->state, current_active_state, > - MIGRATION_STATUS_FAILED); > -} > - > /** > * migration_completion: Used by migration_thread when there's not much left. > * The caller 'breaks' the loop when this returns. > @@ -2856,7 +2806,8 @@ fail: > error_free(local_err); > } > > - migration_completion_failed(s, current_active_state); > + migrate_set_state(&s->state, current_active_state, > + MIGRATION_STATUS_FAILED); > } > > /** > @@ -3286,6 +3237,11 @@ static void migration_iteration_finish(MigrationState > *s) > case MIGRATION_STATUS_FAILED: > case MIGRATION_STATUS_CANCELLED: > case MIGRATION_STATUS_CANCELLING: Pre-existing, but can we even reach here with CANCELLED? If we can start the VM with both CANCELLED and CANCELLING, that means the MIG_EVENT_PRECOPY_FAILED notifier is not being consistently called. So I think CANCELLED here must be unreachable... > + /* > + * Re-activate the block drives if they're inactivated. Note, COLO > + * shouldn't use block_active at all, so it should be no-op there. > + */ > + migration_block_activate(NULL); > if (runstate_is_live(s->vm_old_state)) { > if (!runstate_check(RUN_STATE_SHUTDOWN)) { > vm_start(); > @@ -3858,6 +3814,8 @@ static void migration_instance_init(Object *obj) > ms->state = MIGRATION_STATUS_NONE; > ms->mbps = -1; > ms->pages_per_second = -1; > + /* Freshly started QEMU owns all the block devices */ > + migration_block_active_setup(true); > qemu_sem_init(&ms->pause_sem, 0); > qemu_mutex_init(&ms->error_mutex); > > diff --git a/migration/savevm.c b/migration/savevm.c > index 706b77ffab..969a994a85 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -1547,19 +1547,18 @@ int > qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, > } > > if (inactivate_disks) { > - /* Inactivate before sending QEMU_VM_EOF so that the > - * bdrv_activate_all() on the other end won't fail. */ > - ret = bdrv_inactivate_all(); > - if (ret) { > - error_setg(&local_err, "%s: bdrv_inactivate_all() failed (%d)", > - __func__, ret); > + /* > + * Inactivate before sending QEMU_VM_EOF so that the > + * bdrv_activate_all() on the other end won't fail. > + */ > + if (!migration_block_inactivate()) { > + error_setg(&local_err, "%s: bdrv_inactivate_all() failed", > + __func__); > migrate_set_error(ms, local_err); > error_report_err(local_err); > - qemu_file_set_error(f, ret); > + qemu_file_set_error(f, -EFAULT); > return ret; > } > - /* Remember that we did this */ > - s->block_inactive = true; > } > if (!in_postcopy) { > /* Postcopy stream will still be going */ > @@ -2123,7 +2122,6 @@ static int > loadvm_postcopy_handle_listen(MigrationIncomingState *mis) > > static void loadvm_postcopy_handle_run_bh(void *opaque) > { > - Error *local_err = NULL; > MigrationIncomingState *mis = opaque; > > trace_vmstate_downtime_checkpoint("dst-postcopy-bh-enter"); > @@ -2146,12 +2144,11 @@ static void loadvm_postcopy_handle_run_bh(void > *opaque) > * Make sure all file formats throw away their mutable metadata. > * If we get an error here, just don't restart the VM yet. > */ > - bdrv_activate_all(&local_err); > + bool success = migration_block_activate(NULL); > + > > trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cache-invalidated"); > - if (local_err) { > - error_report_err(local_err); > - local_err = NULL; > - } else { > + > + if (success) { > vm_start(); > } > } else { > @@ -3193,11 +3190,7 @@ void qmp_xen_save_devices_state(const char *filename, > bool has_live, bool live, > * side of the migration take control of the images. > */ > if (live && !saved_vm_running) { > - ret = bdrv_inactivate_all(); > - if (ret) { > - error_setg(errp, "%s: bdrv_inactivate_all() failed (%d)", > - __func__, ret); > - } > + migration_block_inactivate(); > } > } > > diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c > index 76f21e8af3..6f76d9beaf 100644 > --- a/monitor/qmp-cmds.c > +++ b/monitor/qmp-cmds.c > @@ -31,6 +31,7 @@ > #include "qapi/type-helpers.h" > #include "hw/mem/memory-device.h" > #include "hw/intc/intc.h" > +#include "migration/misc.h" > > NameInfo *qmp_query_name(Error **errp) > { > @@ -103,13 +104,8 @@ void qmp_cont(Error **errp) > * Continuing after completed migration. Images have been > * inactivated to allow the destination to take control. Need to > * get control back now. > - * > - * If there are no inactive block nodes (e.g. because the VM was > - * just paused rather than completing a migration), > - * bdrv_inactivate_all() simply doesn't do anything. > */ > - bdrv_activate_all(&local_err); > - if (local_err) { > + if (!migration_block_activate(&local_err)) { > error_propagate(errp, local_err); Could use errp directly here. > return; > } > diff --git a/migration/meson.build b/migration/meson.build > index d53cf3417a..dac687ee3a 100644 > --- a/migration/meson.build > +++ b/migration/meson.build > @@ -11,6 +11,7 @@ migration_files = files( > > system_ss.add(files( > 'block-dirty-bitmap.c', > + 'block-active.c', > 'channel.c', > 'channel-block.c', > 'cpu-throttle.c', > diff --git a/migration/trace-events b/migration/trace-events > index bb0e0cc6dc..b82a1c5e40 100644 > --- a/migration/trace-events > +++ b/migration/trace-events > @@ -383,3 +383,6 @@ migration_pagecache_insert(void) "Error allocating page" > # cpu-throttle.c > cpu_throttle_set(int new_throttle_pct) "set guest CPU throttled by %d%%" > cpu_throttle_dirty_sync(void) "" > + > +# block-active.c > +migration_block_activation(const char *name) "%s"