> On Wed, Aug 06, 2008 at 02:56:51PM -0500, Jason Clinton wrote: > >> We have a tool on our website called "breakin" that is Linux 2.6.25.9 >> patched with K8 and K10f Opteron EDAC reporting facilities. It can >> usually find and identify failed RAM in fifteen minutes (two hours at >> most). The EDAC patches to the kernel aren't that great about naming >> the correct memory rank, though. >> >> Make sure you have multibit (sometimes says 4-bit) ECC enabled in your >> BIOS. >> >> http://www.advancedclustering.com/software/breakin.html > > I just gave this a try, and it seems to be a very nicely packaged > utility. Thanks for making it available. I've used some similar stuff > before, but this is really easy. > > -- greg >
After more than a week of testing I can assert :-) that the cause was poor power, as the UPS was operating outside its envelope. Since I re-distributed the load, moving some nodes to other UPS'es, errors went away. Thanks for all the suggestions, paulo -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Informática | 294 8300 ext.10763 Faculdade de Ciências e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: [EMAIL PROTECTED] 2829-516 Caparica, PORTUGAL _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf