> We have a tool on our website called "breakin" that is Linux 2.6.25.9 > patched with K8 and K10f Opteron EDAC reporting facilities. It can > usually find and identify failed RAM in fifteen minutes (two hours at > most). The EDAC patches to the kernel aren't that great about naming > the correct memory rank, though. > > Make sure you have multibit (sometimes says 4-bit) ECC enabled in your BIOS. > > http://www.advancedclustering.com/software/breakin.html
I've been using breakin for the past week or two on my new cluster. I get some results that seem to be inconsistent. For example on a node I'll get this: Test | Pass | Fail | Last Message ------------------------------------------ hdhealth | 315 | 0 | No disk devices found Then in the log section: 00h 57m 40s: Disabling burnin test 'hdhealth' If I reboot and restart the testing, it will see a hard disk. Why is breaking not always seeing the disk? I've tried to dump logs to a USB drive, but breakin refuses to mount the correct partition on my usb drive (/dev/sdb vs. /dev/sdb1, or vice versa). I sent e-mail to Advanced Clustering regarding these issues, but didn't get any response, so I"m hoping I have better luck here. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf