Re: [Beowulf] ECC exerciser/exorciser?

Prentice Bisbal Mon, 26 Jan 2009 08:09:55 -0800

Mark Hahn wrote:
> Hi all,
> we're having some trouble with nodes showing high ECC corrected error (CE)
> counts.  I'm wondering whether you have any wisdom on the following:
> 
> - first, how would you go about setting a threshold for how high is an
> acceptable CE count?  we by default are using the mce module, which by
> default polls at 1Hz.  my thinking is that if we get overflow events
> (the multiple error bit is set), then it's too fast.
> 
> - do you have or know of a good exerciser for testing ECC's?  yes, I
> know about memtest86, but I'm more curious about a load that could be
> run under
> linux.  my thinking is that ecc's are triggered by bad reads, so something
> which allocates all memory and then continually reads it would be best.
>


Mark,

I find just running a large HPL job across the cluster will find errors
It may take a couple of days, but it will. I've run breakin for days on
end, and not found any memory errors, but when I run a full-blown hpl
job, I find memory errors right away (if right away = a couple of days)

Breakin runs xhpl on every core, but I'm not sure if it's MPI-based, or
if every core is running an independent job. Maybe the breakin
developer(s) can pipe in on how it stresses the RAM.

Hope that helps.

-- 
Prentice
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] ECC exerciser/exorciser?

Reply via email to