Mark Hahn wrote: > Hi all, > we're having some trouble with nodes showing high ECC corrected error (CE) > counts. I'm wondering whether you have any wisdom on the following: > > - first, how would you go about setting a threshold for how high is an > acceptable CE count? we by default are using the mce module, which by > default polls at 1Hz. my thinking is that if we get overflow events > (the multiple error bit is set), then it's too fast. > > - do you have or know of a good exerciser for testing ECC's? yes, I > know about memtest86, but I'm more curious about a load that could be > run under > linux. my thinking is that ecc's are triggered by bad reads, so something > which allocates all memory and then continually reads it would be best. >
Mark, I find just running a large HPL job across the cluster will find errors It may take a couple of days, but it will. I've run breakin for days on end, and not found any memory errors, but when I run a full-blown hpl job, I find memory errors right away (if right away = a couple of days) Breakin runs xhpl on every core, but I'm not sure if it's MPI-based, or if every core is running an independent job. Maybe the breakin developer(s) can pipe in on how it stresses the RAM. Hope that helps. -- Prentice _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
