Control: tag -1 moreinfo On Thu, 2014-04-03 at 01:33 +0200, Raphaël Droz wrote: > Package: src:linux > Version: 3.13.5-1 > Severity: normal > > Dear Maintainer, > > my system regularly hangs (like once a day or once every two days). > Using netconsole I was able to grab a log of the failure. > I interpret this as a mce error which triggers kernel panic in chain. > I don't attach the full log (which lasted until I manually stop the machine) > since later backtraces seem redundant with the first ones: > > > > Apr 2 13:25:17 192.168.0.4 [ 2984.381126] Suspending console(s) (use > no_console_suspend to debug) > [ had resumed from a 12 hours long hibernation, just a couple of minutes ago ] > [ and here happens the crash, while they were no specific/intensive activity: > ] > > Apr 3 00:09:28 192.168.0.4 [ 3038.509437] Disabling lock debugging due to > kernel taint > Apr 3 00:09:28 192.168.0.4 [ 3038.509482] mce: [Hardware Error]: CPU 0: > Machine Check Exception: 4 Bank 1: b200000000000125 > Apr 3 00:09:28 192.168.0.4 [ 3038.509716] mce: [Hardware Error]: TSC > 6ecec0fdf > Apr 3 00:09:28 192.168.0.4 3038.509716] mce: [Hardware Error]: TSC 6ecec0fdf > Check Exception: 4 Bank 1: b200000000000125 > > Apr 3 00:09:28 192.168.0.4 [ 3038.509849] mce: [Hardware Error]: PROCESSOR > 0:6d8 TIME 1396476570 SOCKET 0 APIC 0 microcode 20 > Apr 3 00:09:28 192.168.0.4 [ 3038.510064] mce: [Hardware Error]: Run the > above through 'mcelog --ascii' > Apr 3 00:09:28 192.168.0.4 [ 3038.510236] mce: [Hardware Error]: Machine > check: Invalid > Apr 3 00:09:28 192.168.0.4 [ 3038.510374] Kernel panic - not syncing: Fatal > machine check on current CPU [...] > - As far as I understood this post: > http://forum.mepiscommunity.org/viewtopic.php?p=317898 > the kernel should not crash and, on 32bits, could safely ignore this > hardware error.
No, that is nonsense. MCEs cannot be ignored and are not specific to x86_64. In some cases the kernel may be able to recover from them, but apparently not in this case. Usually an MCE is due to faulty hardware. Does this often happen shortly after resuming from hibernation? Or was that just when it happened on this occasion? Please install the mcelog package, run 'mcelog --ascii' (as root) and paste these log lines in the terminal: [ 3038.509482] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 1: b200000000000125 [ 3038.509716] mce: [Hardware Error]: TSC 6ecec0fdf [ 3038.509849] mce: [Hardware Error]: PROCESSOR 0:6d8 TIME 1396476570 SOCKET 0 APIC 0 microcode 20 This may be able to identify a memory module that is at fault. Ben. -- Ben Hutchings The generation of random numbers is too important to be left to chance. - Robert Coveyou -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org