On 20/05/11 06:45, Greg Lindahl wrote: > On Fri, May 20, 2011 at 12:35:25AM -0400, Joe Landman wrote: > >> Does anyone run a large-ish cluster without ECC ram? Or with ECC >> turned off at the motherboard level? I am curious if there are numbers >> of these, and what issues people encounter. I have some of my own data >> from smaller collections of systems, I am wondering about this for >> larger systems.
We did, circa 2003. Never again. When we were lucky, the uncorrected errors happened in memory in use by the kernel or application code, and we got hard machine crashes or code seg-faulting. Those were easy to spot. When we were unlucky, the errors happened in page cache, resulting in data being randomly transmuted. Most of the code we were running at the time did minimal input sanity checking. It was quite instructive to see just how much genomic analysis code would quite happily compute on DNA sequences that contained things other than ATGC. The duff runs would eventually get picked up by the various sanity-checks that happened at the end of our analysis pipelines, but it involved quite a bit of developer & sysadmin effort to track down and re-run all of the possibly affected jobs. Cheers, Guy -- Dr. Guy Coates, Informatics Systems Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf