On Wed, Oct 22, 2025 at 1:57 AM Vladimir Sementsov-Ogievskiy
<[email protected]> wrote:
>
> If my assumptions match, they should be mentioned in commit message.
This is my understanding of the problem. I'll try to be more clear in
the next submission.
> Ok, so, the commit make a work-around: avoid listing the invalid node
> in query-named-block-nodes by removing it earlier (after the commit,
> the node will be removed on final bdrv_unref() in bdrv_cor_filter_drop().
>
> This doesn't seem safe: what if someone else have another reference on
> the filter node? Or will have in future? The bug will raise again.
>
> More correct would be fix the core problem: bdrv_refresh_filename()
> crashes for detached implicit filters.
>
> Could you check, will something like this fix the reproducer:
I tested this patch last week and again today:
diff --git a/block.c b/block.c
index 8848e9a7ed..a0246a65a5 100644
--- a/block.c
+++ b/block.c
@@ -8067,6 +8067,12 @@ void bdrv_refresh_filename(BlockDriverState *bs)
}
if (bs->implicit) {
+ /* Avoid races where an implicit node is removed from the graph but
+ * not yet unrefed/closed */
+ if (QLIST_EMPTY(&bs->children)) {
+ return;
+ }
+
/* For implicit nodes, just copy everything from the single child */
child = QLIST_FIRST(&bs->children);
assert(QLIST_NEXT(child, next) == NULL);
This resolves the first segfault, but it produces another one in
bdrv_co_get_allocated_file_size. Here's the trace from
query-named-block-nodes:
qmp_query_named_block_nodes
bdrv_named_nodes_list
bdrv_block_device_info (block/qapi.c L44)
bdrv_query_image_info (block/qapi.c L336)
bdrv_do_query_node_info (block/qapi.c L238)
bdrv_get_allocated_file_size (include/block/block-io.h L90)
bdrv_co_get_allocated_file_size (block.c L6011) (called through
co_wrapper_bdrv_lock)
And the bt from gdb:
#0 bdrv_co_get_allocated_file_size (bs=<optimized out>)
at /usr/src/qemu-1:10.1.0+ds-5ubuntu2+test8/b/qemu/block.c:6018
#1 0x0000631d078522be in bdrv_co_get_allocated_file_size_entry
(opaque=0x7ffd375c5dd0) at block/block-gen.c:288
#2 0x0000631d07929ec2 in coroutine_trampoline
(i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:175
#3 0x0000772778862f60 in ?? ()
at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:66 from
target:/lib/x86_64-linux-gnu/libc.so.6
#4 0x00007ffd375c5ae0 in ?? ()
#5 0x0000000000000000 in ?? ()
I need to spend a little more time getting this reproduced properly
using a debug build; it's not clear to me (yet) what is actually
causing the segfault here. I'll work on putting together another patch
to fix this too.
> Other ideas:
>
> Maybe, when removing the filter child, we should set its bs->drv to NULL,
> as filter node becomes nonfunctional.
Based on a quick look through of the code, this would cause
query-named-block-nodes to either return nothing or fail outright; it doesn't
seem like the right solution.
> Or, set bs->implicit to false, which will activate generic path in
> bdrv_refresh_filename (I'm not sure it will work good for the detached
> filter node)
This will run into the same second segfault as above; it also feels like
a workaround instead of a real fix :/
> Hmm, I also think that mentioned
> bdc4c4c5e372756a "backup: Remove nodes from job in .clean()" is
> at least incomplete.
>
> In block-layer, having reference counters in bs, we can't rely on
> "hmm, let's just drop this one reference, and hope, that after next
> function, the node will be completely removed".
>
> So, we either need a guarantee, that filter node is completely removed
> when we detach it (OK, detaching should just fail if it doesn't lead
> to complete removal), or we should be prepared to existing detached
> filters, and handle them correctly.
Thanks, this is helpful. I originally wrote off this approach due to the second
segfault but I'll revisit it. I think I'll submit a new series instead
of a v3 of this
patch unless you'd prefer a v3.
Thanks for your help!
~Wesley