----- "Francesco Pietra" <[email protected]> wrote:

> Therefore, is it any software way to check if the CPUs are fully in
> order, including the memory controller? lshw and other software
> provided only partial help in my hands.

Make sure that you have ECC turned to MAX in your BIOS,
on our SuperMicro mainboards that enables scrubs of RAM
and CPU caches as well as spotting ECC memory errors.

For some reason the SuperMicro BIOS's we've had recently
have defaulted to turning ECC off which isn't particularly
useful, especially on motherboards that can only take ECC
memory!   We found that the hard way recently, and you
can work that out from the output of dmidecode like this:

dmidecode  | grep -A7 "Physical Memory Array" | grep "Error Correction"| grep  
ECC

Make sure you're also running mcelog to pull any MCE
or ECC hardware reports that the kernel has recorded
from the CPUs out to a logfile.

We find that running it with the --k8 and --dmi options
is important to decode more information about these events.

cheers!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to