On 20/05/11 05:35, Joe Landman wrote: > Hi folks > > Does anyone run a large-ish cluster without ECC ram? Or with ECC > turned off at the motherboard level? I am curious if there are numbers > of these, and what issues people encounter. I have some of my own data > from smaller collections of systems, I am wondering about this for > larger systems.
Hi, Joe. I ran a small cluster of ~100 32-bit nodes witn non-ECC memory and it was a nightmare, as Guy described in his email, until I pre-emptively tested the memory in user-space, using Chlarles Cazabon's "memtester": http://pyropus.ca/software/memtester Prior to this, *all* the RAM had passed Memtest86+. I had a strict policy that if a system crashed, for any reason, it was re-tested with Memtest86+, then 100 passes of "memtester" before being allowed to re-join the Beowulf cluster. This made the Beowulf much more stable running openMosix. However, I've scrapped all our non-ECC nodes now because the real worry is not knowing if an error has occurred... Apparently this is still a big issue for computers in space, using non-ECC RAM for solid-state storage on grounds of cost for imaging. They, apparently, use RAM background SoftECC 'scrubbers' like this: http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.tra...@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf