Ok, here's my analysis of the latest dump. There are 3 kernel migrate threads waiting; this is the cause of the softlockup - specifically pid 101 on cpu 13 is where the softlockup (and then panic, due to panic on softlockup enabled) happens, and the other 2 migrate threads (pid 79 and 151) are also waiting. All are waiting for multi_cpu_stop to finish. The way multi_cpu_stop works is: the caller sets up one or more cpus to coordinate stopping; in multi_cpu_stop, the state machine moves from MULTI_STOP_PREPARE through disable irqs, to run (the provided function), to exit when done. However, only the specified cpus (in the cpumask) will run the function. The state machine doesn't proceed to the next step until all cpus have processed the current state.
This is where the problem comes in. In this case, it's a migration of tasks from one numa node to another, via numa rebalancing. In this particular case, there are 3 rebalancing events happening: cpu 3 and cpu 10, cpu 3 and cpu 13, cpu 3 and cpu 20. the migrate threads on cpus 10, 13, and 20 are running multi_cpu_stop, but it's stuck waiting because cpu 3 still has it in its queue. cpu 3 is writing bytes to the serial port, and currently waiting for confirmation that the serial port write completed. This wait is done via checking the serial port register for CTS, then if it's not set delaying for 1us, and trying again. However, this is all inside a held spinlock, with irqs disabled. So while this serial port r/w is being done, nothing else will run on this cpu. But - the code limits this to 1 second, so presumably it shouldn't lock up the cpu for longer than 1 second or so (I haven't dug too far into this, so the function may be called multiple times with the lock held). For whatever reason, that serial port r/w seems to be taking a long time. The migrate threads on the other cpus are waiting for it to finish, so that the migrate thread on cpu 3 can run, and move the multi_cpu_stop state machine along. But that doesn't happen in time to avoid the softlockup detector. The multi_cpu_stop function could arguably use the addition of touch_nmi_watchdog(), since it intentionally spins on the cpu with interrupts disabled - doing so would avoid the softlockup detector (but would not change the system behavior). However, it's not really its fault, since the real cause is the other cpu(s) it's waiting for being locked. back on cpu 3 (that the others are waiting on), the way that delay is implemented is using the TSC. Unfortunately, the TSC is a generally unreliable clock source, so it's possible there is a problem in the delay function. To determine that, can you please boot with the "notsc" parameter, which will change the udelay function to use a simple loop instead of the TSC, and reproduce the softlockup? ** Changed in: linux (Ubuntu) Assignee: Rafael David Tinoco (inaddy) => Dan Streetman (ddstreet) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1505564 Title: Soft lockup with "block nbdX: Attempted send on closed socket" spam Status in linux package in Ubuntu: In Progress Bug description: Some of our nova compute hosts regularly freeze, sometimes for a few hours, with kern.log getting spammed with: block nbdX: Attempted send on closed socket and a few "CPU soft lockup" messages (see attached log). This clears up when the queue gets cleared, eg : block nbdX: queue cleared trusty hosts with kernel version 3.19.0-30-generic. --- AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Nov 24 12:23 seq crw-rw---- 1 root audio 116, 33 Nov 24 12:23 timer AplayDevices: Error: [Errno 2] No such file or directory ApportVersion: 2.14.1-0ubuntu3.19 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 14.04 IwConfig: Error: [Errno 2] No such file or directory MachineType: HP ProLiant DL385 G7 Package: linux (not installed) PciMultimedia: ProcEnviron: TERM=screen-256color PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 radeondrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.19.0-36-generic root=UUID=13289ac9-8dc9-4feb-b6bd-ca7db66b21d6 ro console=tty0 console=ttyS1,38400 nosplash crashkernel=384M-:512M nox2apic intremap=off ProcVersionSignature: Ubuntu 3.19.0-36.41~14.04.1hf00090138v20151122b1-generic 3.19.8-ckt9 RelatedPackageVersions: linux-restricted-modules-3.19.0-36-generic N/A linux-backports-modules-3.19.0-36-generic N/A linux-firmware 1.127.18 RfKill: Error: [Errno 2] No such file or directory Tags: trusty uec-images Uname: Linux 3.19.0-36-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: _MarkForUpload: True dmi.bios.date: 02/02/2014 dmi.bios.vendor: HP dmi.bios.version: A18 dmi.chassis.type: 23 dmi.chassis.vendor: HP dmi.modalias: dmi:bvnHP:bvrA18:bd02/02/2014:svnHP:pnProLiantDL385G7:pvr:cvnHP:ct23:cvr: dmi.product.name: ProLiant DL385 G7 dmi.sys.vendor: HP To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp