>t's a pretty unusual hang. I bet that the reason that you don't get
>a kernel crash dump is that the kernel doesn't run long enough after
>the problem happens to create one.

Thanks Greg. I suspected that....I am actually curious: when exactly
can kdump be useful? If a crash is hardware precipitated the second
kernel never gets a chance to do what its supposed to. If it is
software related, and the first kernel actually has time to detect the
inconsistancy then it might as well "deal" with the offending process.

>Probably your fastest solution is to swap parts until works. Tedious,
>but...

That's exactly what I'm doing so far! :-) Problem is which ones? CPU /
MB/ Power supplies /RAM  ? I've even received solutions as exotic as
re-flashing the BIOS / ESM firmware upgrades / processor reseating.
Any bets on the likelihood based on symptoms, intuition, and past
experience?

 We've swapped CPUs and processors on all the offending nodes. Seems
to have worked so far (i.e. none of the swapped machines have
re-crashed) But I'm hesitant to conclude "problem solved" since all
this is only over the last 2 weeks.

I'm dreading the day when one of the swapped machines re-crashes!
Let's see........

-- 
Rahul
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to