Out of the IRC discussions documenting potentially related issues: - this bug: KVM: Host-Kernel: Xenial-GA, Qemu: Xenial-Ocata, Guest: Bionic - bug 1722311 KVM: Host-Kernel: Xenial-GA, Qemu: Xenial, Guest: Artful - some relation to cache pressure - bug 1713751 AWS: triggered by Xenial kernel update, supposed fixed but shown up again and again - bug 1655842 Host-Kernel: Xenial-GA, Qemu: Xenial, Guest: Artful - some relation to cache pressure These might after all just run into the same soft lockup symptom, but I thought it was worth to mention for thos enot reading the IRC log.
These cases seem to somewhat agree on: - Recent guest kernel - Xenial Host kernel - some memory pressure To get further I thought some sort of local reproducer for the kernel Team to work on easier than needing a full cloud. But so far I failed at setting such a local case up (http://paste.ubuntu.com/25916781/). Thanks Laney for the openstack based repro description. @Laney I found it interesting that you essentially only needed to start+reboot. I assume on the host you had other workload goes on in the background (since it is lcy01)? If you'd have any sort of non-busy but otherwise comparable system - could you check to confirm the assumption we have so far that there all is fine? If yes - then the memory pressure theory gets more likely, if not we can focus on simpler reproducers - so we can only win by that check. Crossing fingers for jsalisbury's hope that 4.14 might already have a fix. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1730717 Title: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]" Status in linux package in Ubuntu: Triaged Status in qemu-kvm package in Ubuntu: New Status in linux source package in Artful: Triaged Status in qemu-kvm source package in Artful: New Status in linux source package in Bionic: Triaged Status in qemu-kvm source package in Bionic: New Bug description: This is impacting us for ubuntu autopkgtests. Eventually the whole region ends up dying because each worker is hit by this bug in turn and backs off until the next reset (6 hourly). 17.10 (and bionic) guests are sometimes failing to reboot. When this happens, you see the following in the console [[0;32m OK [0m] Reached target Shutdown. [ 191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1] [ 226.702150] INFO: rcu_sched detected stalls on CPUs/tasks: [ 226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187) [ 226.706093] All QSes seen, last rcu_sched kthread activity 15002 (4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 One host that exhibits this behaviour was: Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux guest running: Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 (Ubuntu 4.13.0-16.19-generic 4.13.4) The affected cloud region is running the xenial/Ocata cloud archive, so the version of qemu-kvm in there may also be relevant. Here's how I reproduced it in lcy01: $ for n in {1..30}; do nova boot --flavor m1.small --image ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; done $ <ssh to each instance> sudo reboot # wait a minute or so for the instances to all reboot $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} | tail; done On bad instances you'll see the "soft lockup" message - on good it'll reboot as normal. We've seen good and bad instances on multiple compute hosts - it doesn't feel to me like a host problem but rather a race condition somewhere that's somehow either triggered or triggered much more often by what lcy01 is running. I always saw this on the first reboot - never on first boot, and never on n>1th boot. (But if it's a race then that might not mean much.) I'll attach a bad and a good console-log for reference. If you're at Canonical then see internal rt #107135 for some other details. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp