** Merge proposal linked: https://code.launchpad.net/~whershberger/ubuntu/+source/qemu/+git/qemu/+merge/500070
** Merge proposal linked: https://code.launchpad.net/~whershberger/ubuntu/+source/qemu/+git/qemu/+merge/500071 ** Merge proposal linked: https://code.launchpad.net/~whershberger/ubuntu/+source/qemu/+git/qemu/+merge/500072 -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/2126951 Title: `block-stream` segfault with concurrent `query-named-block-nodes` Status in QEMU: Fix Released Status in qemu package in Ubuntu: In Progress Status in qemu source package in Jammy: In Progress Status in qemu source package in Noble: In Progress Status in qemu source package in Plucky: Won't Fix Status in qemu source package in Questing: In Progress Status in qemu source package in Resolute: In Progress Bug description: [ Impact ] When running `block-stream` and `query-named-block-nodes` concurrently, a null-pointer dereference causes QEMU to segfault. The original reporter of this issue experienced the bug while performing concurrent libvirt `virDomainBlockPull` calls on the same VM/different disks. The race condition occurs at the end of the `block-stream` QMP; libvirt's handler for a completed `block-stream` (`qemuBlockJobProcessEventCompletedPull` [1]) calls `query-named- block-nodes` (see "libvirt trace" below for a full trace). This occurs in every version of QEMU shipped with Ubuntu, 22.04 thru 25.10. [1] qemuBlockJobProcessEventCompletedPull [ Test Plan ] ``` sudo apt install libvirt-daemon-system virtinst ``` In `query-named-block-nodes.sh`: ```sh #!/bin/bash while true; do virsh qemu-monitor-command "$1" query-named-block-nodes > /dev/null done ``` In `blockrebase-crash.sh`: ```sh #!/bin/bash set -ex domain="$1" if [ -z "${domain}" ]; then echo "Missing domain name" exit 1 fi ./query-named-block-nodes.sh "${domain}" & query_pid=$! while [ -n "$(virsh list --uuid)" ]; do snap="snap0-$(uuidgen)" virsh snapshot-create-as "${domain}" \ --name "${snap}" \ --disk-only file= \ --diskspec vda,snapshot=no \ --diskspec "vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_${snap}.qcow2" \ --atomic \ --no-metadata virsh blockpull "${domain}" vdb while bjr=$(virsh blockjob "$domain" vdb); do if [[ "$bjr" == *"No current block job for"* ]] ; then break; fi; done; done kill "${query_pid}" ``` `provision.sh` (`Ctrl + ]` after boot): ```sh #!/bin/bash set -ex wget https://cloud-images.ubuntu.com/daily/server/noble/current/noble- server-cloudimg-amd64.img sudo cp noble-server-cloudimg-amd64.img /var/lib/libvirt/images/n0-root.qcow2 sudo qemu-img create -f qcow2 /var/lib/libvirt/images/n0-blk0.qcow2 10G touch network-config touch meta-data touch user-data virt-install \ -n n0 \ --description "Test noble minimal" \ --os-variant=ubuntu24.04 \ --ram=1024 --vcpus=2 \ --import \ --disk path=/var/lib/libvirt/images/n0-root.qcow2,bus=virtio,cache=writethrough,size=10 \ --disk path=/var/lib/libvirt/images/n0-blk0.qcow2,bus=virtio,cache=writethrough,size=10 \ --graphics none \ --network network=default \ --cloud-init user-data="user-data,meta-data=meta-data,network-config=network-config" ``` And run the script to cause the crash (you may need to manually kill query-named-block-jobs.sh): ```sh chmod 755 provision.sh blockrebase-crash.sh query-named-block-nodes.sh ./provision.sh ./blockrebase-crash n0 ``` Expected behavior: `blockrebase-crash.sh` runs until "No space left on device" Actual behavior: QEMU crashes after a few iterations: ``` Block Pull: [81.05 %]+ bjr= + [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]] ++ virsh blockjob n0 vdb Block Pull: [97.87 %]+ bjr= + [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]] ++ virsh blockjob n0 vdb error: Unable to read from monitor: Connection reset by peer error: Unable to read from monitor: Connection reset by peer + bjr= ++ virsh list --uuid + '[' -n 4eed8ba4-300b-4488-a520-510e5b544f57 ']' ++ uuidgen + snap=snap0-88be23e5-696c-445d-870a-abe5f7df56c0 + virsh snapshot-create-as n0 --name snap0-88be23e5-696c-445d-870a-abe5f7df56c0 --disk-only file= --diskspec vda,snapshot=no --diskspec vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_snap0-88be23e5-696c-445d-870a-abe5f7df56c0.qcow2 --atomic --no-metadata error: Requested operation is not valid: domain is not running Domain snapshot snap0-88be23e5-696c-445d-870a-abe5f7df56c0 created + virsh blockpull n0 vdb error: Requested operation is not valid: domain is not running error: Requested operation is not valid: domain is not running wesley@nv0:~$ error: Requested operation is not valid: domain is not running ``` [ Where problems could occur ] The only codepaths affected by this change are `block-stream` and `blockdev-backup` [1][2]. If the code is somehow broken, we would expect to see failures when executing these QMP commands (or the libvirt APIs that use them, `virDomainBlockPull` and `virDomainBackupBegin` [3][4]). As noted in the upstream commit message, the change does cause an additional flush to occur during `blockdev-backup` QMPs. The patch that was ultimately merged upstream was a revert of most of [5]. _That_ patch was a workaround for a blockdev permissions issue that was later resolved in [6] (see the end of [7] and replies for upstream discussion). Both [5] and [6] are present in QEMU 6.2.0, so the assumptions that led us to the upstream solution hold for Jammy. [1] https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.block-stream [2] https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.blockdev-backup [3] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockPull [4] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin [5] https://gitlab.com/qemu-project/qemu/-/commit/3108a15cf09 [6] https://gitlab.com/qemu-project/qemu/-/commit/3860c0201924d [7] https://lists.gnu.org/archive/html/qemu-devel/2025-10/msg06800.html [ Other info ] Backtrace from the coredump (source at [1]): ``` #0 bdrv_refresh_filename (bs=0x5efed72f8350) at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:8082 #1 0x00005efea73cf9dc in bdrv_block_device_info (blk=0x0, bs=0x5efed72f8350, flat=true, errp=0x7ffeb829ebd8) at block/qapi.c:62 #2 0x00005efea7391ed3 in bdrv_named_nodes_list (flat=<optimized out>, errp=0x7ffeb829ebd8) at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:6275 #3 0x00005efea7471993 in qmp_query_named_block_nodes (has_flat=<optimized out>, flat=<optimized out>, errp=0x7ffeb829ebd8) at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/blockdev.c:2834 #4 qmp_marshal_query_named_block_nodes (args=<optimized out>, ret=0x7f2b753beec0, errp=0x7f2b753beec8) at qapi/qapi-commands-block-core.c:553 #5 0x00005efea74f03a5 in do_qmp_dispatch_bh (opaque=0x7f2b753beed0) at qapi/qmp-dispatch.c:128 #6 0x00005efea75108e6 in aio_bh_poll (ctx=0x5efed6f3f430) at util/async.c:219 #7 0x00005efea74ffdb2 in aio_dispatch (ctx=0x5efed6f3f430) at util/aio-posix.c:436 #8 0x00005efea7512846 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:361 #9 0x00007f2b77809bfb in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #10 0x00007f2b77809e70 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #11 0x00005efea7517228 in glib_pollfds_poll () at util/main-loop.c:287 #12 os_host_main_loop_wait (timeout=0) at util/main-loop.c:310 #13 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:589 #14 0x00005efea7140482 in qemu_main_loop () at system/runstate.c:905 #15 0x00005efea744e4e8 in qemu_default_main (opaque=opaque@entry=0x0) at system/main.c:50 #16 0x00005efea6e76319 in main (argc=<optimized out>, argv=<optimized out>) at system/main.c:93 ``` The libvirt logs suggest that the crash occurs right at the end of the blockjob, since it reaches "concluded" state before crashing. I assumed that this was one of: - `stream_clean` is freeing/modifying the `cor_filter_bs` without holding a lock that it needs to [2][3] - `bdrv_refresh_filename` needs to handle the possibility that the QLIST of children for a filter bs could be NULL [1] Ultimately the fix was neither of these [4]; `bdrv_refresh_filename` should not be able to observe a NULL list of children. `query-named-block-nodes` iterates the global list of block nodes `graph_bdrv_states` [5]. The offending block node (the `cor_filter_bs`, added during a `block-stream`) was removed from the list of block nodes _for the disk_ when the operation finished, but not removed from the global list of block nodes until later (this is the window for the race). The patch keeps the block node in the disk's list until it is dropped at the end of the blockjob. [1] https://git.launchpad.net/ubuntu/+source/qemu/tree/block.c?h=ubuntu/questing-devel#n8071 [2] https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n131 [3] https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n340 [4] https://gitlab.com/qemu-project/qemu/-/commit/9dbfd4e28dd11a83f54c371fade8d49a63d6dc1e [5] https://gitlab.com/qemu-project/qemu/-/blob/v10.1.0/block.c?ref_type=tags#L72 [ libvirt trace ] `qemuBlockJobProcessEventCompletedPull` [1] `qemuBlockJobProcessEventCompletedPullBitmaps` [2] `qemuBlockGetNamedNodeData` [3] `qemuMonitorBlockGetNamedNodeData` [4] `qemuMonitorJSONBlockGetNamedNodeData` [5] `qemuMonitorJSONQueryNamedBlockNodes` [6] [1] https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n870 [2] https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n807 [3] https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_block.c?h=applied/ubuntu/questing-devel#n2925 [4] https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor.c?h=applied/ubuntu/questing-devel#n2039 [5] https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2816 [6] https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2159 To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/2126951/+subscriptions
