David Mathog wrote:
[...]
Any of you running clusters without ECC? Has the lack of error
correction been a problem?
Hello, David.
Yes, I'm running openMosix on [EMAIL PROTECTED]/2600+ 1p compute nodes. I
posted this on the openMosix Wiki about it:
http://howto.krisbuytaert.be/openMosixWiki/index.php/Additions_to_the_FAQ
'Q.' How reliable is openMosix?
'A.' An openMosix cluster is only as reliable as its "least" reliable
node: In particular, memory corruption can be propagated throughout a
cluster if processes are migrated to and from an unreliable COTS
(Commodity Off The Shelf) PC without ECC (Error Correction Code) memory.
If the memory corruption is sufficient to make a migrated process crash,
the load on the unreliable node then decreases and more processes are
"attracted" to the node from the rest of the cluster by the openMosix
load balancing algorithm. Migrated processes that do not crash on the
node may also be corrupted if they make use of unreliable memory. When
these processes are migrated away from the unreliable node memory
corruption is propagated back to the rest of the openMosix cluster. For
this reason, it is essential to test the memory of COTS PC's thoroughly
BEFORE allowing them to join an openMosix cluster. This can be done
using a stand-alone utility e.g. "memtest86" (http://www.memtest86.com/)
or under Linux with a user-mode utility e.g. "memtester"
(http://pyropus.ca/software/memtester/).
Best wishes,
Tony.
--
Dr. A.J.Travis, | mailto:[EMAIL PROTECTED]
Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf