Verification successful with the kernel in xenial-proposed. Updating verification tags.
Details ======= Guest: --- root@mfo-sf194614:~# apt-cache madison linux-image-4.4.0-139-generic linux-image-4.4.0-139-generic | 4.4.0-139.165 | http://archive.ubuntu.com/ubuntu/ xenial-proposed/main amd64 Packages root@mfo-sf194614:~# apt-get -y install linux-image-4.4.0-139-generic root@mfo-sf194614:~# reboot root@mfo-sf194614:~# uname -a Linux mfo-sf194614 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux root@mfo-sf194614:~# ls -ld /sys/block/sda lrwxrwxrwx 1 root root 0 Oct 25 13:51 /sys/block/sda -> ../devices/pci0000:00/0000:00:07.0/virtio4/host2/target2:0:0/2:0:0:0/block/sda root@mfo-sf194614:~# lspci -s 7.0 00:07.0 SCSI storage controller: Red Hat, Inc Virtio SCSI root@mfo-sf194614:~# echo 9 > /proc/sys/kernel/printk root@mfo-sf194614:~# scsi_logging_level --set --mlqueue=4 --mlcomplete=2 root@mfo-sf194614:~# dd if=/dev/sda of=/dev/null count=1 [ 139.412030] sd 2:0:0:0: [sda] tag#1 Send: scmd 0xffff880036e614f0 [ 139.415541] sd 2:0:0:0: [sda] tag#1 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00 [ 139.418561] sd 2:0:0:0: [sda] tag#1 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK [ 139.421753] sd 2:0:0:0: [sda] tag#1 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00 [ 139.424561] sd 2:0:0:0: tag#1 Send: scmd 0xffff880036e614f0 [ 139.427081] sd 2:0:0:0: tag#1 CDB: Test Unit Ready 00 00 00 00 00 00 [ 139.428997] sd 2:0:0:0: tag#1 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK [ 139.430982] sd 2:0:0:0: tag#1 CDB: Test Unit Ready 00 00 00 00 00 00 root@mfo-sf194614:~# cat /sys/module/virtio_scsi/sections/.text 0xffffffffc0002000 Host: --- mfo@rotom:~/ddebs$ vscsi_text='0xffffffffc0002000' mfo@rotom:~/ddebs$ virsh qemu-monitor-command mfo-sf194614 --hmp 'gdbserver' mfo@rotom:~/ddebs$ wget http://ddebs.ubuntu.com/pool/main/l/linux/linux-image-4.4.0-139-generic-dbgsym_4.4.0-139.165_amd64.ddeb mfo@rotom:~/ddebs$ dpkg-deb -x linux-image-4.4.0-139-generic-dbgsym_4.4.0-139.165_amd64.ddeb linux-image-4.4.0-139-generic-dbgsym_4.4.0-139.165_amd64/ mfo@rotom:~/ddebs$ cd linux-image-4.4.0-139-generic-dbgsym_4.4.0-139.165_amd64/usr/lib/debug/ gdb boot/vmlinux-4.4.0-139-generic \ -ex 'set confirm off' \ -ex "add-symbol-file lib/modules/4.4.0-139-generic/kernel/drivers/scsi/virtio_scsi.ko $vscsi_text" \ -ex 'break virtscsi_queuecommand' \ -ex 'target remote localhost:1234' \ -ex 'continue' Guest --- root@mfo-sf194614:~# dd if=/dev/sda of=/dev/null count=1 2>/dev/null [ 435.462432] sd 2:0:0:0: [sda] tag#1 Send: scmd 0xffff880036e614f0 [ 435.465679] sd 2:0:0:0: [sda] tag#1 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00 <freeze> Host GDB: --- Thread 2 hit Breakpoint 1, virtscsi_queuecommand (vscsi=0xffff88013a6cf7d8, req_vq=0xffff88013a6cf9e8, sc=0xffff880036e614f0) at /build/linux-PrHwV2/linux-4.4.0/drivers/scsi/virtio_scsi.c:534 534 /build/linux-PrHwV2/linux-4.4.0/drivers/scsi/virtio_scsi.c: No such file or directory. (gdb) break virtio_ring.c:196 Breakpoint 2 at 0xffffffff814e383f: virtio_ring.c:196. (3 locations) (gdb) info break Num Type Disp Enb Address What 1 breakpoint keep y 0xffffffffc0003330 in virtscsi_queuecommand at /build/linux-PrHwV2/linux-4.4.0/drivers/scsi/virtio_scsi.c:534 breakpoint already hit 1 time 2 breakpoint keep y <MULTIPLE> 2.1 y 0xffffffff814e383f in virtqueue_add_sgs at /build/linux-PrHwV2/linux-4.4.0/drivers/virtio/virtio_ring.c:196 2.2 y 0xffffffff814e3dec in virtqueue_add_inbuf at /build/linux-PrHwV2/linux-4.4.0/drivers/virtio/virtio_ring.c:196 2.3 y 0xffffffff814e409c in virtqueue_add_outbuf at /build/linux-PrHwV2/linux-4.4.0/drivers/virtio/virtio_ring.c:196 (gdb) disable 1 (gdb) disable 2.2 (gdb) disable 2.3 (gdb) continue Continuing. Thread 2 hit Breakpoint 2, virtqueue_add (gfp=<optimized out>, data=<optimized out>, in_sgs=<optimized out>, out_sgs=<optimized out>, total_sg=<optimized out>, sgs=<optimized out>, _vq=<optimized out>) at /build/linux-PrHwV2/linux-4.4.0/drivers/virtio/virtio_ring.c:197 197 /build/linux-PrHwV2/linux-4.4.0/drivers/virtio/virtio_ring.c: No such file or directory. (gdb) disass Dump of assembler code for function virtqueue_add_sgs: ... => 0xffffffff814e383f <+175>: cmp %ecx,%ebx 0xffffffff814e3841 <+177>: ja 0xffffffff814e3a15 <virtqueue_add_sgs+645> ... (gdb) info reg $ecx $ebx ecx 0x80 128 ebx 0x1 1 (gdb) set $ecx = 0 (gdb) info reg $ecx $ebx ecx 0x0 0 ebx 0x1 1 (gdb) disable 2.1 (gdb) break virtscsi_target_destroy Breakpoint 3 at 0xffffffffc0002020: file /build/linux-PrHwV2/linux-4.4.0/drivers/scsi/virtio_scsi.c, line 791. (gdb) continue Continuing. Guest --- root@mfo-sf194614:~# dd if=/dev/sda of=/dev/null count=1 2>/dev/null [ 435.462432] sd 2:0:0:0: [sda] tag#1 Send: scmd 0xffff880036e614f0 [ 435.465679] sd 2:0:0:0: [sda] tag#1 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00 <freeze> <continued after GDB above> [ 435.474443] sd 2:0:0:0: [sda] tag#1 queuecommand : request rejected [ 435.504597] scsi host2: unblocking host at zero depth [ 435.507603] sd 2:0:0:0: [sda] tag#1 Send: scmd 0xffff880036e614f0 [ 435.510896] sd 2:0:0:0: [sda] tag#1 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00 [ 435.515713] sd 2:0:0:0: [sda] tag#1 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK [ 435.520838] sd 2:0:0:0: [sda] tag#1 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00 [ 435.525044] sd 2:0:0:0: tag#1 Send: scmd 0xffff880036e614f0 [ 435.527795] sd 2:0:0:0: tag#1 CDB: Test Unit Ready 00 00 00 00 00 00 [ 435.553066] sd 2:0:0:0: tag#1 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK [ 435.555822] sd 2:0:0:0: tag#1 CDB: Test Unit Ready 00 00 00 00 00 00 Host --- mfo@rotom:~$ virsh qemu-monitor-command mfo-sf194614 --hmp 'device_del scsi1-0-0-0' Guest --- [ 549.088876] sd 2:0:0:0: [sda] Synchronizing SCSI cache <...> [ 549.152828] sd 2:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Host GDB --- Thread 1 hit Breakpoint 3, virtscsi_target_destroy (starget=0xffff880036c62800) at /build/linux-PrHwV2/linux-4.4.0/drivers/scsi/virtio_scsi.c:791 791 /build/linux-PrHwV2/linux-4.4.0/drivers/scsi/virtio_scsi.c: No such file or directory. (gdb) disass Dump of assembler code for function virtscsi_target_destroy: => 0xffffffffc0002020 <+0>: nopl 0x0(%rax,%rax,1) 0xffffffffc0002025 <+5>: push %rbp 0xffffffffc0002026 <+6>: mov 0x328(%rdi),%rdi 0xffffffffc000202d <+13>: mov %rsp,%rbp 0xffffffffc0002030 <+16>: mov 0x4(%rdi),%eax 0xffffffffc0002033 <+19>: test %eax,%eax 0xffffffffc0002035 <+21>: je 0xffffffffc0002040 <virtscsi_target_destroy+32> 0xffffffffc0002037 <+23>: pause 0xffffffffc0002039 <+25>: mov 0x4(%rdi),%eax 0xffffffffc000203c <+28>: test %eax,%eax 0xffffffffc000203e <+30>: jne 0xffffffffc0002037 <virtscsi_target_destroy+23> 0xffffffffc0002040 <+32>: callq 0xffffffff811f64a0 <kfree> 0xffffffffc0002045 <+37>: pop %rbp 0xffffffffc0002046 <+38>: retq End of assembler dump. (gdb) si (gdb) (gdb) (gdb) (gdb) (gdb) disass Dump of assembler code for function virtscsi_target_destroy: 0xffffffffc0002020 <+0>: nopl 0x0(%rax,%rax,1) 0xffffffffc0002025 <+5>: push %rbp 0xffffffffc0002026 <+6>: mov 0x328(%rdi),%rdi 0xffffffffc000202d <+13>: mov %rsp,%rbp 0xffffffffc0002030 <+16>: mov 0x4(%rdi),%eax => 0xffffffffc0002033 <+19>: test %eax,%eax 0xffffffffc0002035 <+21>: je 0xffffffffc0002040 <virtscsi_target_destroy+32> 0xffffffffc0002037 <+23>: pause 0xffffffffc0002039 <+25>: mov 0x4(%rdi),%eax 0xffffffffc000203c <+28>: test %eax,%eax 0xffffffffc000203e <+30>: jne 0xffffffffc0002037 <virtscsi_target_destroy+23> 0xffffffffc0002040 <+32>: callq 0xffffffff811f64a0 <kfree> 0xffffffffc0002045 <+37>: pop %rbp 0xffffffffc0002046 <+38>: retq End of assembler dump. (gdb) info reg $eax eax 0x0 0 That looks good! :-) (gdb) continue Host --- And indeed, the CPU usage for the QEMU process remains low. Problem resolved. mfo@rotom:~$ pid=$(ps ax | grep qemu | grep mfo-sf194614 | cut -d' ' -f1) mfo@rotom:~$ top -b -d 1 -n 10 -p $pid | grep qemu 10234 libvirt+ 20 0 9016056 768008 23260 S 0.0 0.6 1:05.98 qemu-system-x86 10234 libvirt+ 20 0 9016056 768008 23260 S 0.0 0.6 1:05.98 qemu-system-x86 10234 libvirt+ 20 0 9016056 768008 23260 S 0.0 0.6 1:05.98 qemu-system-x86 10234 libvirt+ 20 0 9016056 768008 23260 S 0.0 0.6 1:05.98 qemu-system-x86 10234 libvirt+ 20 0 9016056 768008 23260 S 0.0 0.6 1:05.98 qemu-system-x86 10234 libvirt+ 20 0 9016056 768008 23260 S 0.0 0.6 1:05.98 qemu-system-x86 10234 libvirt+ 20 0 9016056 768008 23260 S 0.0 0.6 1:05.98 qemu-system-x86 10234 libvirt+ 20 0 9016056 768008 23260 S 0.0 0.6 1:05.98 qemu-system-x86 10234 libvirt+ 20 0 9016056 768008 23260 S 1.0 0.6 1:05.99 qemu-system-x86 10234 libvirt+ 20 0 9016056 768008 23260 S 0.0 0.6 1:05.99 qemu-system-x86 ** Tags removed: verification-needed-xenial ** Tags added: verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1798110 Title: xenial: virtio-scsi: CPU soft lockup due to loop in virtscsi_target_destroy() Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Fix Committed Bug description: [Impact] * Detaching virtio-scsi disk in Xenial guest can cause CPU soft lockup in guest (and take 100% CPU in host). * It may prevent further progress on other tasks that depend on resources locked earlier in the SCSI target removal stack, and/or impact other SCSI functionality. * The fix resolves a corner case in the requests counter in the virtio SCSI target, which impacts a downstream (SAUCE) patch in the virtio-scsi target removal handler that depends on the requests counter value to be zero. [Test Case] * See LP #1798110 (this bug)'s comment #3 (too long for this section -- synthetic case with GDB+QEMU) and comment #4 (organic test case in cloud instance). [Regression Potential] * It seem low -- this only affects the SCSI command requeue path with regards to the reference counter, which is only used with real chance of problems in our downstream patch (which is now passing this testcase). * The other less serious issue would be decrementing it to a negative / < 0 value, which is not possible with this driver logic (see commit message), because the reqs counter is always incremented before calling virtscsi_queuecommand(), where this decrement operation is inserted. [Original Description] A customer reported a CPU soft lockup on Trusty HWE kernel from Xenial when detaching a virtio-scsi drive, and provided a crashdump that shows 2 things: 1) The soft locked up CPU is waiting for another CPU to finish something, and that does not happen because the other CPU is infinitely looping in virtscsi_target_destroy(). 2) The loop happens because the 'tgt->reqs' counter is non-zero, and that probably happened due to a missing decrement in SCSI command requeue path, exercised when the virtio ring is full. The reported problem itself happens because of a downstream/SAUCE patch, coupled with the problem of the missing decrement for the reqs counter. Introducing a decrement in the SCSI command requeue path resolves the problem, verified synthetically with QEMU+GDB and with test-case/loop provided by the customer as problem reproducer. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798110/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp