On Tue, Nov 18, 2014 at 10:30 AM, Luck, Tony <[email protected]> wrote: >>> The lost cpu is *really* lost. Warm reset doesn't fix the machine, I >>> usually >>> have to do a full power cycle. > >> How is it even possible that I did that with a few lines of asm? > > Probably not your directly your fault - some cascade of errors may have > occurred.
I went and read the manual. Here's a hypothesis: Your test case is presumably doing something that involves setting undocumented registers* to program the CPU or memory controller to generate a machine check on access to some address. Presumably this is done by broadcasting an SMI and programming the registers in SMM. Now SMM is rather strange. The docs list a large set of interrupt sources that are disabled on SMM entry, and this list does not include #MC. So presumably #MC is actually left enabled on entry to SMM. That means that, unless SMRAM has an interrupt table that has a working machine check handler (which seems highly unlikely), then there is at least some window in which a #MC delivered in SMM will cause some kind of failure. This could really happen: a broadcast #MC could easily race a broadcast SMI and do this. If you crash your SMM code, then I wouldn't be at all surprised if the CPU wedges hard enough that even your remote management thing can't reset it. * These are probably the registers that are supposed to be documented in volume 2 section 4.4.9 of the Xeon E5 1600/2600 datasheet, reference 326509-003, but the docs are extremely incomplete. --Andy > >> Could this be a hardware bug? Is there some condition that causes #MC >> delivery to wedge hard enough that even INIT/RESET stops working? Or >> possibly some CPU got stuck in SMM -- I have no idea what warm reset >> does these days. > > I'm not even sure what kind of reset the remote management i/f I used > actually applied. > >> Here's the patch to improve the timeout messages, but given the degree >> of wedgedness, I can guess what it'll say: >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/paranoid&id=e5cbd9d141bde651ecb20f0b65ad13bcef2468d0 > > Heh - I'd already put in some hacky printk()s to do similar. Mine aren't > upstream quality, but do print the value of mce_callin/mce_executing > as appropriate. But I got some confusing results - reporter complained that > only 142 of 144 had shown up. So two threads missing, > maybe means one core went into h/w shutdown. Need to dig further to see if > the missing duo really are from the same core. > > -Tony -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

