Hello Avihai,

On 10/20/24 15:01, Avihai Horon wrote:
When the VM is shut down vfio_vmstate_change/_prepare() are called to
transition the VFIO device state to STOP. They are called after
migration_shutdown() and thus, by this time, the migration object is
already freed (more specifically, MigrationState->qemu_file_lock is
already destroyed).

In this case, if there is an error in vfio_vmstate_change/_prepare(), it
calls migration_file_set_error() which tries to lock the already
destroyed MigrationState->qemu_file_lock, leading to the following
assert:

   qemu-system-x86_64: ../util/qemu-thread-posix.c:92: qemu_mutex_lock_impl: 
Assertion `mutex->initialized' failed.

Fix this by not setting migration file error in the shut down flow.

Fixes: 20c64c8a51a4 ("migration: migration_file_set_error")
Signed-off-by: Avihai Horon <avih...@nvidia.com>
---
  hw/vfio/migration.c | 31 +++++++++++++++++++++----------
  1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 992dc3b102..1c44b036ea 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -783,6 +783,25 @@ static const SaveVMHandlers savevm_vfio_handlers = {
/* ---------------------------------------------------------------------- */ +static void vfio_vmstate_change_error_report(int ret, Error *err,
+                                             RunState state)
+{
+    if (state == RUN_STATE_SHUTDOWN) {
+        /*
+         * If VM is being shut down, migration object might have already been
+         * freed, so just report the error.
+         */
+        error_report_err(err);
+        return;
+    }
+
+    /*
+     * Migration should be aborted in this case, but vm_state_notify()
+     * currently does not support reporting failures.
+     */
+    migration_file_set_error(ret, err);
+}

This seems correct, but testing the machine's runtime state to
work around the fact that 'current_migration' is not safe to
use reveals a flaw.

In vfio, migration_is_setup_or_active() is unsafe. So are the
calls to vfio_set_migration_error().


This comment seems in contradiction with the migration code :

/* When we add fault tolerance, we could have several
   migrations at once.  For now we don't need to add
   dynamic creation of migration */

Why is 'current_migration' allocated and destroyed today ?

Could we replace it with a singleton and switch the state to
MIGRATION_STATUS_NONE instead of destroying it ?

If not, should 'current_migration' be set to NULL in
migration_shutdown() and its value tested before accessing
to any of its fields, in a thread safe manner if necessary.

Thanks,

C.



Reply via email to