>t's a pretty unusual hang. I bet that the reason that you don't get >a kernel crash dump is that the kernel doesn't run long enough after >the problem happens to create one.
Thanks Greg. I suspected that....I am actually curious: when exactly can kdump be useful? If a crash is hardware precipitated the second kernel never gets a chance to do what its supposed to. If it is software related, and the first kernel actually has time to detect the inconsistancy then it might as well "deal" with the offending process. >Probably your fastest solution is to swap parts until works. Tedious, >but... That's exactly what I'm doing so far! :-) Problem is which ones? CPU / MB/ Power supplies /RAM ? I've even received solutions as exotic as re-flashing the BIOS / ESM firmware upgrades / processor reseating. Any bets on the likelihood based on symptoms, intuition, and past experience? We've swapped CPUs and processors on all the offending nodes. Seems to have worked so far (i.e. none of the swapped machines have re-crashed) But I'm hesitant to conclude "problem solved" since all this is only over the last 2 weeks. I'm dreading the day when one of the swapped machines re-crashes! Let's see........ -- Rahul _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf