Hi Jonathan,
On 08/08/11 16:16, Jonathan Nieder wrote:
I assume this is fairly reproducible even after a reboot? Is the
Correct, we can reproduce the lock ups after a reboot following 5-60
minutes of high I/O load (900MB/s plus).
stacktrace from the first sign of trouble in dmesg always the same?
I'm no expert at reading these but I believe it is the same. Here's the
trace after the next reboot/lock up cycle:
[ 3705.959849] kernel BUG at
/build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debian/build/source_amd64_none/mm/slub.c:2969!
[ 3706.077621] invalid opcode: 0000 [#1] SMP
[ 3706.113947] last sysfs file:
/sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-3/target1:0:1/1:0:1:0/block/sdj/stat
[ 3706.235513] CPU 0
[ 3706.251928] Modules linked in: btrfs zlib_deflate crc32c libcrc32c
ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs
ext4 jbd2 crc16 ext2 dm_round_robin dm_multipath scsi_dh loop sd_mod
crc_t10dif snd_pcm joydev snd_timer snd soundcore snd_page_alloc usbhid
hid evdev pcspkr hpilo hpwdt psmouse power_meter container processor
button serio_raw ext3 jbd mbcache dm_mod hpsa cciss uhci_hcd ehci_hcd
qla2xxx usbcore scsi_transport_fc nls_base scsi_tgt scsi_mod be2net
thermal thermal_sys [last unloaded: scsi_wait_scan]
[ 3706.781628] Pid: 1845, comm: ext4-dio-unwrit Not tainted
2.6.32-5-amd64 #1 ProLiant BL460c G7
[ 3706.882853] RIP: 0010:[<ffffffff810e730b>] [<ffffffff810e730b>]
kfree+0x55/0xcb
[ 3706.956205] RSP: 0018:ffff8805851c7e00 EFLAGS: 00010246
[ 3707.017700] RAX: 0200000000000000 RBX: ffff88058553eed0 RCX:
0000000000000042
[ 3707.091197] RDX: ffff88058553eea0 RSI: 0000000000000041 RDI:
ffffea001352a590
[ 3707.167835] RBP: ffff88058553eea0 R08: ffff880585fdc0d0 R09:
0000000000080000
[ 3707.245578] R10: 0000000000000014 R11: ffff880584a6b8b8 R12:
ffffffffa023ddcf
[ 3707.319659] R13: ffff88058553eed8 R14: ffff880584a6b880 R15:
ffff880584a6b880
[ 3707.393985] FS: 0000000000000000(0000) GS:ffff880015200000(0000)
knlGS:0000000000000000
[ 3707.476061] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 3707.538541] CR2: 00007f4ffff0377c CR3: 000000026295b000 CR4:
00000000000006f0
[ 3707.627218] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 3707.707945] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 3707.788003] Process ext4-dio-unwrit (pid: 1845, threadinfo
ffff8805851c6000, task ffff880584a6b880)
[ 3707.885120] Stack:
[ 3707.914054] ffff88058553eed0 ffff88058553eea0 ffff8805844b0928
ffffffffa023ddcf
[ 3707.992872] <0> ffff8805851c7ef8 ffffe8ffffa08680 ffff88058553eed0
ffffffff810618e7
[ 3708.072050] <0> 000000000000f9e0 ffff880584a6bc38 ffff880584a6b880
ffff8805851c7fd8
[ 3708.169803] Call Trace:
[ 3708.211806] [<ffffffffa023ddcf>] ? ext4_end_aio_dio_work+0x4e/0x5a
[ext4]
[ 3708.285689] [<ffffffff810618e7>] ? worker_thread+0x188/0x21d
[ 3708.340716] [<ffffffffa023dd81>] ? ext4_end_aio_dio_work+0x0/0x5a [ext4]
[ 3708.415673] [<ffffffff81064f1a>] ? autoremove_wake_function+0x0/0x2e
[ 3708.495456] [<ffffffff8106175f>] ? worker_thread+0x0/0x21d
[ 3708.554553] [<ffffffff81064c4d>] ? kthread+0x79/0x81
[ 3708.616011] [<ffffffff81011baa>] ? child_rip+0xa/0x20
[ 3708.675317] [<ffffffff81064bd4>] ? kthread+0x0/0x81
[ 3708.730683] [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[ 3708.784232] Code: 83 c3 08 48 83 3b 00 eb ec 48 83 fd 10 0f 86 89 00
00 00 48 89 ef e8 b9 e8 ff ff 48 89 c7 48 8b 00 84 c0 78 13 66 a9 00 c0
75 04 <0f> 0b eb fe 5b 5d 41 5c e9 98 56 fd ff 48 8b 4c 24 18 4c 8b 4f
[ 3708.990151] RIP [<ffffffff810e730b>] kfree+0x55/0xcb
[ 3709.047553] RSP <ffff8805851c7e00>
[ 3709.095349] ---[ end trace fec09b541df2db86 ]---
[ 3709.158246] kernel tried to execute NX-protected page - exploit
attempt? (uid: 0)
I now have serial console logging enabled on these servers so I can
provide a fuller copy of the trace if required although I'm guessing the
only useful output is that pasted above.
Did this machine work well with other kernels before (and if so,
which ones)?
The machine is new and so we haven't tried older kernels, we have tried
the current bpo kernel and also experienced lock ups there although we
didn't have remote/serial logging enabled at the time. I can retest and
capture the logs if that would be useful.
If you get a chance to run memtest68+, that would also be useful, of
course.
We have 5 of these blades, all identical. I memtest86+'d them on arrival
a couple of weeks ago, everything was clean. I'll retest tonight though,
just to be on the safe side. I'll also repeat earlier tests on one of
the other blades to capture a trace (we've seen lock ups on the other
blades too but again, didn't have remote/serial logging enabled at the time)
Thanks, Paul.
--
Paul Elliott, UNIX Systems Administrator
York Neuroimaging Centre, University of York
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org