On Fri, Jul 16, 2021 at 11:42:02AM +0000, Thanos Makatos wrote: > > -----Original Message----- > > From: Peter Xu <pet...@redhat.com> > > Sent: 15 July 2021 19:35 > > To: Thanos Makatos <thanos.maka...@nutanix.com> > > Cc: Paolo Bonzini <pbonz...@redhat.com>; Markus Armbruster > > <arm...@redhat.com>; QEMU Devel Mailing List <qemu- > > de...@nongnu.org>; John Levon <john.le...@nutanix.com>; John G > > Johnson <john.g.john...@oracle.com> > > Subject: Re: Question on memory commit during MR finalize() > > > > On Thu, Jul 15, 2021 at 02:27:48PM +0000, Thanos Makatos wrote: > > > Hi Peter, > > > > Hi, Thanos, > > > > > We're hitting this issue using a QEMU branch where JJ is using vfio-user > > > as > > the transport for multiprocess-qemu > > (https://urldefense.proofpoint.com/v2/url?u=https- > > 3A__github.com_oracle_qemu_issues_9&d=DwIBaQ&c=s883GpUCOChKOHi > > ocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE&m=9nFuGF9kg5l > > ZsKPi03BNzo9pckG8DlodVG0LuEofnKw&s=dcp70CIgJljcWFwSRZm5zZRJj80jX > > XERLwpbH6ZcgzQ&e= ). We can reproduce it fairly reliably by migrating a > > virtual SPDK NVMe controller (the NVMf/vfio-user target with experimental > > migration support, https://urldefense.proofpoint.com/v2/url?u=https- > > 3A__review.spdk.io_gerrit_c_spdk_spdk_- > > 2B_7617_14&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw > > 6ogtti46atk736SI4vgsJiUKIyDE&m=9nFuGF9kg5lZsKPi03BNzo9pckG8DlodVG0 > > LuEofnKw&s=iXolOQM5sYj4IB-cf__Ta8jgKXZqisYE-uuwq6qnbLo&e= ). I can > > provide detailed repro instructions but first I want to make sure we're not > > missing any patches. > > > > I don't think you missed any bug fix patches, as the issue I mentioned can > > only be trigger with my own branch at that time, and that's fixed when my > > patchset got merged. > > > > However if you encountered the same issue, it's possible that there's an > > incorrect use of qemu memory/cpu API too somewhere there so similar > > issue is triggered. For example, in my case it was run_on_cpu() called > > incorrectly within memory layout changing so BQL is released without being > > noticed. > > > > I've got a series that tries to expose these hard to debug issues: > > > > https://urldefense.proofpoint.com/v2/url?u=https- > > 3A__lore.kernel.org_qemu-2Ddevel_20200421162108.594796-2D1-2Dpeterx- > > 40redhat.com_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJ > > vtw6ogtti46atk736SI4vgsJiUKIyDE&m=9nFuGF9kg5lZsKPi03BNzo9pckG8Dlod > > VG0LuEofnKw&s=kQRJEb4CQmxEirS-III15QJz_phzhCYLIgjOF-SB9Pk&e= > > > > Obviously the series didn't track enough interest so it didn't get merged. > > However maybe that's also something useful to what you're debugging, so > > you can apply those patches onto your branch and see the stack when it > > reproduces again. Logically with these sanity patches it could fail earlier > > than > > what you've hit right now (which I believe should be within the RCU thread; > > btw it would be interesting to share your stack too when it's hit) and it > > could > > provide more useful information. > > > > I saw that the old series won't apply onto master any more, so I rebased it > > and pushed it here (with one patch dropped since someone wrote a similar > > patch and got merged, so there're only 7 patches in the new tree): > > > > https://urldefense.proofpoint.com/v2/url?u=https- > > 3A__github.com_xzpeter_qemu_tree_memory- > > 2Dsanity&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6og > > tti46atk736SI4vgsJiUKIyDE&m=9nFuGF9kg5lZsKPi03BNzo9pckG8DlodVG0LuE > > ofnKw&s=G-8FV-H-VcZTgCVRfTEVKo1GALIk2PqBvTdAcAXFoZ0&e= > > > > No guarantee it'll help, but IMHO worth trying. > > The memory-sanity branch fails to build: > > ./configure --prefix=/opt/qemu-xzpeter --target-list=x86_64-linux-user > --enable-debug > make -j 8 > ... > [697/973] Linking target qemu-x86_64 > FAILED: qemu-x86_64 > c++ -o qemu-x86_64 libcommon.fa.p/cpus-common.c.o > libcommon.fa.p/page-vary-common.c.o libcommon.fa.p/disas_i386.c.o > libcommon.fa.p/disas_capstone.c.o libcommon.fa.p/hw_core_cpu-common.c.o > libcommon.fa.p/ebpf_ebpf_rss-stub.c.o libcommon.fa.p/accel_accel-user.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_user_excp_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_user_seg_helper.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_x86_64_signal.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_x86_64_cpu_loop.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_cpu.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_gdbstub.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_xsave_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_cpu-dump.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_sev-stub.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_kvm_kvm-stub.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_bpt_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_cc_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_excp_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_fpu_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_int_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_mem_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_misc_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_mpx_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_seg_helper.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_tcg-cpu.c.o > libqemu-x86_64-linux-user.fa.p/target_i386_tcg_translate.c.o > libqemu-x86_64-linux-user.fa.p/trace_control-target.c.o > libqemu-x86_64-linux-user.fa.p/cpu.c.o > libqemu-x86_64-linux-user.fa.p/disas.c.o > libqemu-x86_64-linux-user.fa.p/gdbstub.c.o > libqemu-x86_64-linux-user.fa.p/page-vary.c.o > libqemu-x86_64-linux-user.fa.p/tcg_optimize.c.o > libqemu-x86_64-linux-user.fa.p/tcg_region.c.o > libqemu-x86_64-linux-user.fa.p/tcg_tcg.c.o > libqemu-x86_64-linux-user.fa.p/tcg_tcg-common.c.o > libqemu-x86_64-linux-user.fa.p/tcg_tcg-op.c.o > libqemu-x86_64-linux-user.fa.p/tcg_tcg-op-gvec.c.o > libqemu-x86_64-linux-user.fa.p/tcg_tcg-op-vec.c.o > libqemu-x86_64-linux-user.fa.p/fpu_softfloat.c.o > libqemu-x86_64-linux-user.fa.p/accel_accel-common.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_tcg-all.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_cpu-exec-common.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_cpu-exec.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_tcg-runtime-gvec.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_tcg-runtime.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_translate-all.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_translator.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_user-exec.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_user-exec-stub.c.o > libqemu-x86_64-linux-user.fa.p/accel_tcg_plugin-gen.c.o > libqemu-x86_64-linux-user.fa.p/accel_stubs_hax-stub.c.o > libqemu-x86_64-linux-user.fa.p/accel_stubs_xen-stub.c.o > libqemu-x86_64-linux-user.fa.p/accel_stubs_kvm-stub.c.o > libqemu-x86_64-linux-user.fa.p/plugins_loader.c.o > libqemu-x86_64-linux-user.fa.p/plugins_core.c.o > libqemu-x86_64-linux-user.fa.p/plugins_api.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_elfload.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_exit.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_fd-trans.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_linuxload.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_main.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_mmap.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_safe-syscall.S.o > libqemu-x86_64-linux-user.fa.p/linux-user_signal.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_strace.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_syscall.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_uaccess.c.o > libqemu-x86_64-linux-user.fa.p/linux-user_uname.c.o > libqemu-x86_64-linux-user.fa.p/thunk.c.o > libqemu-x86_64-linux-user.fa.p/meson-generated_.._x86_64-linux-user-gdbstub-xml.c.o > > libqemu-x86_64-linux-user.fa.p/meson-generated_.._trace_generated-helpers.c.o > -Wl,--as-needed -Wl,--no-undefined -pie -Wl,--whole-archive libhwcore.fa > libqom.fa -Wl,--no-whole-archive -Wl,--warn-common -Wl,-z,relro -Wl,-z,now > -m64 -fstack-protector-strong -Wl,--start-group libcapstone.a libqemuutil.a > libhwcore.fa libqom.fa -ldl > -Wl,--dynamic-list=/root/src/qemu/build/qemu-plugins-ld.symbols -lrt -lutil > -lm -pthread -Wl,--export-dynamic -lgmodule-2.0 -lglib-2.0 -lstdc++ > -Wl,--end-group > /usr/bin/ld: libcommon.fa.p/cpus-common.c.o: in function `do_run_on_cpu': > /root/src/qemu/build/../cpus-common.c:153: undefined reference to > `qemu_cond_wait_iothread' > collect2: error: ld returned 1 exit status > [698/973] Compiling C object > tests/fp/libsoftfloat.a.p/berkeley-softfloat-3_source_f32_to_ui64_r_minMag.c.o > [699/973] Compiling C object > tests/fp/libsoftfloat.a.p/berkeley-softfloat-3_source_f32_to_i32_r_minMag.c.o > [700/973] Compiling C object > tests/fp/libsoftfloat.a.p/berkeley-softfloat-3_source_f32_to_f16.c.o > [701/973] Compiling C object > tests/fp/libsoftfloat.a.p/berkeley-softfloat-3_source_f32_to_f64.c.o > [702/973] Compiling C object > tests/fp/libsoftfloat.a.p/berkeley-softfloat-3_source_f32_to_i64_r_minMag.c.o > [703/973] Compiling C object > tests/fp/libsoftfloat.a.p/berkeley-softfloat-3_source_f32_to_extF80M.c.o > [704/973] Compiling C object > tests/fp/libsoftfloat.a.p/berkeley-softfloat-3_source_f32_to_extF80.c.o > ninja: build stopped: subcommand failed. > make[1]: *** [Makefile:154: run-ninja] Error 1 > make[1]: Leaving directory '/root/src/qemu/build' > make: *** [GNUmakefile:11: all] Error 2
So it fails linux-user... I can fix the compilation, but it should pass x86_64-softmmu. More importantly - are you using linux-user binaries? The thing is my branch will only be helpful to debug BQL related issues, so if that's the case then please ignore the branch as linux-user shouldn't be using bql, then my branch won't help. > > Regarding the stack trace, I can very easily reproduce it on our branch, I > know exactly where to set the breakpoint: > > (gdb) r > Starting prThread 0x7fffeffff7 In: __pthread_cond_waitu host -enable-kvm -smp > 4 -nographic -m 2G -object > memory-backend-file,id=mem0,size=2G,mem-path=/dev/hugepages,share=on,prealloc=yes, > -numa node,memdev=mem0 -L88 PC: 0x7ffff772700cuThread 8 "qemu-system-x86" > received signal SIGUSR1, User defined signal 1. > f58c1 GI_raise > > 50 > 58f7bb > #0 0x00007ffff758f7bb in __GI_raise (sig=sig@entry=6) at > ../sysdeps/unix/sysv/linux/raise.c:50 > #1 0x00007ffff757a535 in __GI_abort () at abort.c:79 > #2 0x0000555555c9301e in kvm_set_phys_mem (kml=0x5555568ee830, > section=0x7ffff58c05e0, add=true) at ../accel/kvm/kvm-all.c:1194 > #3 0x0000555555c930cd in kvm_region_add (listener=0x5555568ee830, > section=0x7ffff58c05e0) at ../accel/kvm/kvm-all.c:1211 > #4 0x0000555555bd6c9e in address_space_update_topology_pass > (as=0x555556648420 <address_space_memory>, old_view=0x555556f21730, > new_view=0x7ffff0001cb0, adding=true) at ../softmmu/memory.c:971 > #5 0x0000555555bd6f98 in address_space_set_flatview (as=0x555556648420 > <address_space_memory>) at ../softmmu/memory.c:1047 > #6 0x0000555555bd713f in memory_region_transaction_commit () at > ../softmmu/memory.c:1099 > #7 0x0000555555bd89a5 in memory_region_finalize (obj=0x555556e21800) at > ../softmmu/memory.c:1751 > #8 0x0000555555cca132 in object_deinit (obj=0x555556e21800, > type=0x5555566a8f80) at ../qom/object.c:673 > #9 0x0000555555cca1a4 in object_finalize (data=0x555556e21800) at > ../qom/object.c:687 > #10 0x0000555555ccb196 in object_unref (objptr=0x555556e21800) at > ../qom/object.c:1186 > #11 0x0000555555bb11f0 in phys_section_destroy (mr=0x555556e21800) at > ../softmmu/physmem.c:1171 > #12 0x0000555555bb124a in phys_sections_free (map=0x5555572cf9a0) at > ../softmmu/physmem.c:1180 > #13 0x0000555555bb4632 in address_space_dispatch_free (d=0x5555572cf990) at > ../softmmu/physmem.c:2562 > #14 0x0000555555bd4485 in flatview_destroy (view=0x5555572cf950) at > ../softmmu/memory.c:291 > #15 0x0000555555e367e8 in call_rcu_thread (opaque=0x0) at ../util/rcu.c:281 > #16 0x0000555555e68e57 in qemu_thread_start (args=0x555556665e30) at > ../util/qemu-thread-posix.c:521 > #17 0x00007ffff7720fa3 in start_thread (arg=<optimized out>) at > pthread_create.c:486lot=10, start=0xfebd1000, size=0x1000: File exists > #18 0x00007ffff76514cf in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Yes indeed it looks alike. -- Peter Xu