David Mathog wrote:
[...]
Any of you running clusters without ECC?  Has the lack of error
correction been a problem?

Hello, David.

Yes, I'm running openMosix on [EMAIL PROTECTED]/2600+ 1p compute nodes. I posted this on the openMosix Wiki about it:

http://howto.krisbuytaert.be/openMosixWiki/index.php/Additions_to_the_FAQ

'Q.' How reliable is openMosix?

'A.' An openMosix cluster is only as reliable as its "least" reliable node: In particular, memory corruption can be propagated throughout a cluster if processes are migrated to and from an unreliable COTS (Commodity Off The Shelf) PC without ECC (Error Correction Code) memory. If the memory corruption is sufficient to make a migrated process crash, the load on the unreliable node then decreases and more processes are "attracted" to the node from the rest of the cluster by the openMosix load balancing algorithm. Migrated processes that do not crash on the node may also be corrupted if they make use of unreliable memory. When these processes are migrated away from the unreliable node memory corruption is propagated back to the rest of the openMosix cluster. For this reason, it is essential to test the memory of COTS PC's thoroughly BEFORE allowing them to join an openMosix cluster. This can be done using a stand-alone utility e.g. "memtest86" (http://www.memtest86.com/) or under Linux with a user-mode utility e.g. "memtester" (http://pyropus.ca/software/memtester/).

Best wishes,

        Tony.
--
Dr. A.J.Travis,                     |  mailto:[EMAIL PROTECTED]
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to