Thanks, Mike for the details. Just to make sure, you collected the info
from the same instance that locked up (either before or after a reboot)?
That would make sure that whatever information about the host is really
belonging to the host where the problem happened.

As for more details, not right now as we need first to understand more about 
the problem. But for a general feeling:
- looking at the same instance type, does it happen on all of them sooner or 
later or are there exceptions?
- did the same kind of workload run without issues in a previous Ubuntu release 
or were those new projects
  starting with Onerirc/Precise
- Probably more to Matt, are there issues with other Linux distros running 
comparable kernel versions?
- It might be worth trying a kernel from mainline 
(http://kernel.ubuntu.com/~kernel-ppa/mainline/).
  Right now I probably would go for a generic 64bit 3.5.2 and maybe 3.6-rc2 
kernel. Not sure whether
  update-grub in Precise already picks up generic kernel, so one might need to 
fiddle with /boot/grub/menu.cfg
  manually after installing the packages.

As Matt wrote above, when looking at the traces a bit more in detail, there are 
some cpus stuck in entering the hypervisor call to wait for a spinlock and 
others seem to have come out of that and trying to wake up some waiters.
@Matt, when you produce those cpu stacktraces, how do you do that? Is that from 
a dump or somehow tapping into the still running instance?
Right now it is hard to say whether this may be a real deadlock (probably the 
types of locks can be obtained by checking the backtrace for every cpu, but 
could be hard when it comes to locks of individual structures/devices). Or it 
is some problem of delivery of the spinlock event (be it the wrong cpu was 
notified or for some reason the event never happened). Also not easy to get 
hold of.

The best chances we would have, if it would be possible to re-create
this on an isolated test system. And for that I would need some relative
simple to follow steps that allow me to create that workload that is
causing the issue. Still, I only got an 8-core which I would have to
overcommit to 16 and if that is giving the same results...

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1011792

Title:
  Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1011792/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to