On 15/12/2025 17.12, Peter Xu wrote:
On Mon, Dec 15, 2025 at 04:28:39PM +0100, Kevin Wolf wrote:
Am 15.12.2025 um 16:11 hat Thomas Huth geschrieben:
On 15/12/2025 14.42, Kevin Wolf wrote:
Am 12.12.2025 um 22:26 hat Fabiano Rosas geschrieben:
Thomas Huth <[email protected]> writes:

On 08/12/2025 16.26, Fabiano Rosas wrote:
Stefan Hajnoczi <[email protected]> writes:

On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
From: Thomas Huth <[email protected]>

When shutting down a guest that is currently in progress of being
migrated, there is a chance that QEMU might crash during bdrv_delete().
The backtrace looks like this:

    Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.

    [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
    0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at 
../../devel/qemu/block.c:5560
    5560                QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
    (gdb) bt
    #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at 
../../devel/qemu/block.c:5560
    #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
    Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0


How does the migration thread reaches here? Is this from
migration_block_inactivate()?

Unfortunately, gdb was not very helpful here (claiming that it cannot access
the memory and stack anymore), so I had to do some printf debugging. This is
what seems to happen:

Main thread: qemu_cleanup() calls  migration_shutdown() -->
migration_cancel() which signals the migration thread to cancel the migration.

Migration thread: migration_thread() got kicked out the loop and calls
migration_iteration_finish(), which tries to get the BQL via bql_lock() but
that is currently held by another thread, so the migration thread is blocked
here.

Main thread: qemu_cleanup() advances to bdrv_close_all() that uses
blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name
'libvirt-1-storage' gets deleted via bdrv_delete() that way.


Has qmp_blockdev_del() ever been called to remove the BDS from the
monitor_bdrv_states list? Otherwise your debugging seems to indicate
blockdev_close_all_bdrv_states() is dropping the last reference to bs,
but it's still accessible from bdrv_next() via
bdrv_next_monitor_owned().

The reference that blockdev_close_all_bdrv_states() drops is the monitor
reference. So is this the right fix (completely untested, but matches
what qmp_blockdev_del() does)?

Kevin

diff --git a/blockdev.c b/blockdev.c
index dbd1d4d3e80..6e86c6262f9 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -686,6 +686,7 @@ void blockdev_close_all_bdrv_states(void)

       GLOBAL_STATE_CODE();
       QTAILQ_FOREACH_SAFE(bs, &monitor_bdrv_states, monitor_list, next_bs) {
+        QTAILQ_REMOVE(&monitor_bdrv_states, bs, monitor_list);
           bdrv_unref(bs);
       }
   }

Thanks a lot, Kevin! This looks like the right fix for me - I gave it
a try and it fixes the crash indeed!

Good. I think something like your patch would still be good for 11.0.
Having undefined order in shutdown is just asking for trouble. So it
would be good if we could be sure that migration is out of the way when
migration_shutdown() returns.

I sent the above as a proper patch to fix the immediate problem for
10.2.

Thanks all!

Does that completely fix the problem?

Yes, Kevin's patch fixes the crash!

 If so, IMHO we don't need the
migration change anymore until later.  As replied in the other email, it
was at least intentional a few years ago (when introducing
migration_shutdown()) to not join() the migration thread here.

Fine for me.

 Thomas


Reply via email to