** Merge proposal linked:
   
https://code.launchpad.net/~whershberger/ubuntu/+source/qemu/+git/qemu/+merge/500070

** Merge proposal linked:
   
https://code.launchpad.net/~whershberger/ubuntu/+source/qemu/+git/qemu/+merge/500071

** Merge proposal linked:
   
https://code.launchpad.net/~whershberger/ubuntu/+source/qemu/+git/qemu/+merge/500072

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/2126951

Title:
  `block-stream` segfault with concurrent `query-named-block-nodes`

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  In Progress
Status in qemu source package in Jammy:
  In Progress
Status in qemu source package in Noble:
  In Progress
Status in qemu source package in Plucky:
  Won't Fix
Status in qemu source package in Questing:
  In Progress
Status in qemu source package in Resolute:
  In Progress

Bug description:
  [ Impact ]

  When running `block-stream` and `query-named-block-nodes`
  concurrently, a null-pointer dereference causes QEMU to segfault.

  The original reporter of this issue experienced the bug while
  performing concurrent libvirt `virDomainBlockPull` calls on the same
  VM/different disks. The race condition occurs at the end of the
  `block-stream` QMP; libvirt's handler for a completed `block-stream`
  (`qemuBlockJobProcessEventCompletedPull` [1]) calls `query-named-
  block-nodes` (see "libvirt trace" below for a full trace).

  This occurs in every version of QEMU shipped with Ubuntu, 22.04 thru
  25.10.

  [1] qemuBlockJobProcessEventCompletedPull

  [ Test Plan ]

  ```
  sudo apt install libvirt-daemon-system virtinst
  ```

  In `query-named-block-nodes.sh`:
  ```sh
  #!/bin/bash

  while true; do
      virsh qemu-monitor-command "$1" query-named-block-nodes > /dev/null
  done
  ```

  In `blockrebase-crash.sh`:
  ```sh
  #!/bin/bash

  set -ex

  domain="$1"

  if [ -z "${domain}" ]; then
      echo "Missing domain name"
      exit 1
  fi

  ./query-named-block-nodes.sh "${domain}" &
  query_pid=$!

  while [ -n "$(virsh list --uuid)" ]; do
      snap="snap0-$(uuidgen)"

      virsh snapshot-create-as "${domain}" \
          --name "${snap}" \
          --disk-only file= \
          --diskspec vda,snapshot=no \
          --diskspec 
"vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_${snap}.qcow2" \
          --atomic \
          --no-metadata

      virsh blockpull "${domain}" vdb

      while bjr=$(virsh blockjob "$domain" vdb); do
          if [[ "$bjr" == *"No current block job for"* ]] ; then
              break;
          fi;
      done;
  done

  kill "${query_pid}"
  ```

  `provision.sh` (`Ctrl + ]` after boot):
  ```sh
  #!/bin/bash

  set -ex

  wget https://cloud-images.ubuntu.com/daily/server/noble/current/noble-
  server-cloudimg-amd64.img

  sudo cp noble-server-cloudimg-amd64.img /var/lib/libvirt/images/n0-root.qcow2
  sudo qemu-img create -f qcow2 /var/lib/libvirt/images/n0-blk0.qcow2 10G

  touch network-config
  touch meta-data
  touch user-data

  virt-install \
    -n n0 \
    --description "Test noble minimal" \
    --os-variant=ubuntu24.04 \
    --ram=1024 --vcpus=2 \
    --import \
    --disk 
path=/var/lib/libvirt/images/n0-root.qcow2,bus=virtio,cache=writethrough,size=10
 \
    --disk 
path=/var/lib/libvirt/images/n0-blk0.qcow2,bus=virtio,cache=writethrough,size=10
 \
    --graphics none \
    --network network=default \
    --cloud-init 
user-data="user-data,meta-data=meta-data,network-config=network-config"
  ```

  And run the script to cause the crash (you may need to manually kill
  query-named-block-jobs.sh):
  ```sh
  chmod 755 provision.sh blockrebase-crash.sh query-named-block-nodes.sh
  ./provision.sh
  ./blockrebase-crash n0
  ```

  Expected behavior: `blockrebase-crash.sh` runs until "No space left on
  device"

  Actual behavior: QEMU crashes after a few iterations:
  ```
  Block Pull: [81.05 %]+ bjr=
  + [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]]
  ++ virsh blockjob n0 vdb
  Block Pull: [97.87 %]+ bjr=
  + [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]]
  ++ virsh blockjob n0 vdb
  error: Unable to read from monitor: Connection reset by peer
  error: Unable to read from monitor: Connection reset by peer
  + bjr=
  ++ virsh list --uuid
  + '[' -n 4eed8ba4-300b-4488-a520-510e5b544f57 ']'
  ++ uuidgen
  + snap=snap0-88be23e5-696c-445d-870a-abe5f7df56c0
  + virsh snapshot-create-as n0 --name 
snap0-88be23e5-696c-445d-870a-abe5f7df56c0 --disk-only file= --diskspec 
vda,snapshot=no --diskspec 
vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_snap0-88be23e5-696c-445d-870a-abe5f7df56c0.qcow2
 --atomic --no-metadata
  error: Requested operation is not valid: domain is not running
  Domain snapshot snap0-88be23e5-696c-445d-870a-abe5f7df56c0 created
  + virsh blockpull n0 vdb
  error: Requested operation is not valid: domain is not running
  error: Requested operation is not valid: domain is not running

  wesley@nv0:~$ error: Requested operation is not valid: domain is not running
  ```

  [ Where problems could occur ]

  The only codepaths affected by this change are `block-stream` and
  `blockdev-backup` [1][2]. If the code is somehow broken, we would
  expect to see failures when executing these QMP commands (or the
  libvirt APIs that use them, `virDomainBlockPull` and
  `virDomainBackupBegin` [3][4]).

  As noted in the upstream commit message, the change does cause an
  additional flush to occur during `blockdev-backup` QMPs.

  The patch that was ultimately merged upstream was a revert of most of
  [5]. _That_ patch was a workaround for a blockdev permissions issue
  that was later resolved in [6] (see the end of [7] and replies for
  upstream discussion). Both [5] and [6] are present in QEMU 6.2.0, so
  the assumptions that led us to the upstream solution hold for Jammy.

  [1] 
https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.block-stream
  [2] 
https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.blockdev-backup
  [3] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockPull
  [4] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin
  [5] https://gitlab.com/qemu-project/qemu/-/commit/3108a15cf09
  [6] https://gitlab.com/qemu-project/qemu/-/commit/3860c0201924d
  [7] https://lists.gnu.org/archive/html/qemu-devel/2025-10/msg06800.html

  [ Other info ]

  Backtrace from the coredump (source at [1]):
  ```
  #0  bdrv_refresh_filename (bs=0x5efed72f8350) at 
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:8082
  #1  0x00005efea73cf9dc in bdrv_block_device_info (blk=0x0, bs=0x5efed72f8350, 
flat=true, errp=0x7ffeb829ebd8)
      at block/qapi.c:62
  #2  0x00005efea7391ed3 in bdrv_named_nodes_list (flat=<optimized out>, 
errp=0x7ffeb829ebd8)
      at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:6275
  #3  0x00005efea7471993 in qmp_query_named_block_nodes (has_flat=<optimized 
out>, flat=<optimized out>,
      errp=0x7ffeb829ebd8) at 
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/blockdev.c:2834
  #4  qmp_marshal_query_named_block_nodes (args=<optimized out>, 
ret=0x7f2b753beec0, errp=0x7f2b753beec8)
      at qapi/qapi-commands-block-core.c:553
  #5  0x00005efea74f03a5 in do_qmp_dispatch_bh (opaque=0x7f2b753beed0) at 
qapi/qmp-dispatch.c:128
  #6  0x00005efea75108e6 in aio_bh_poll (ctx=0x5efed6f3f430) at util/async.c:219
  #7  0x00005efea74ffdb2 in aio_dispatch (ctx=0x5efed6f3f430) at 
util/aio-posix.c:436
  #8  0x00005efea7512846 in aio_ctx_dispatch (source=<optimized out>, 
callback=<optimized out>,
      user_data=<optimized out>) at util/async.c:361
  #9  0x00007f2b77809bfb in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #10 0x00007f2b77809e70 in g_main_context_dispatch () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
  #11 0x00005efea7517228 in glib_pollfds_poll () at util/main-loop.c:287
  #12 os_host_main_loop_wait (timeout=0) at util/main-loop.c:310
  #13 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:589
  #14 0x00005efea7140482 in qemu_main_loop () at system/runstate.c:905
  #15 0x00005efea744e4e8 in qemu_default_main (opaque=opaque@entry=0x0) at 
system/main.c:50
  #16 0x00005efea6e76319 in main (argc=<optimized out>, argv=<optimized out>) 
at system/main.c:93
  ```

  The libvirt logs suggest that the crash occurs right at the end of the 
blockjob, since it reaches "concluded" state before crashing. I assumed that 
this was one of:
  - `stream_clean` is freeing/modifying the `cor_filter_bs` without holding a 
lock that it needs to [2][3]
  - `bdrv_refresh_filename` needs to handle the possibility that the QLIST of 
children for a filter bs could be NULL [1]

  Ultimately the fix was neither of these [4]; `bdrv_refresh_filename`
  should not be able to observe a NULL list of children.

  `query-named-block-nodes` iterates the global list of block nodes
  `graph_bdrv_states` [5]. The offending block node (the
  `cor_filter_bs`, added during a `block-stream`) was removed from the
  list of block nodes _for the disk_ when the operation finished, but
  not removed from the global list of block nodes until later (this is
  the window for the race). The patch keeps the block node in the disk's
  list until it is dropped at the end of the blockjob.

  [1] 
https://git.launchpad.net/ubuntu/+source/qemu/tree/block.c?h=ubuntu/questing-devel#n8071
  [2] 
https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n131
  [3] 
https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n340
  [4] 
https://gitlab.com/qemu-project/qemu/-/commit/9dbfd4e28dd11a83f54c371fade8d49a63d6dc1e
  [5] 
https://gitlab.com/qemu-project/qemu/-/blob/v10.1.0/block.c?ref_type=tags#L72

  [ libvirt trace ]
  `qemuBlockJobProcessEventCompletedPull` [1]
  `qemuBlockJobProcessEventCompletedPullBitmaps` [2]
  `qemuBlockGetNamedNodeData` [3]
  `qemuMonitorBlockGetNamedNodeData` [4]
  `qemuMonitorJSONBlockGetNamedNodeData` [5]
  `qemuMonitorJSONQueryNamedBlockNodes` [6]

  [1] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n870
  [2] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n807
  [3] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_block.c?h=applied/ubuntu/questing-devel#n2925
  [4] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor.c?h=applied/ubuntu/questing-devel#n2039
  [5] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2816
  [6] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2159

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/2126951/+subscriptions


Reply via email to