* Dual AMP Opteron DP270 (2.0 GHz)

which rev?

* Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB
 ( 12 nodes have 8*2GB)

this dimm is 2-rank, I believe; corsair's datasheet is pretty lame. that means that each bank of memory is 4x2=8 ranks. that's definitely
pushing the limit; I'm sure it can be done in some cases, but it's definitely
not supported by some rev's of the opteron, and will always be pretty
bleeding-edge.

When a node crashes, we typically see a MCE + kernel panic. We get about

try running mcelog periodically; I bet you see lots of corrected ECC's.

once and ran stable afterwards. Crashes seem to occur mostly when the system
is under heavy CPU (memory?) load.

yep.

Far too many correctable ECC errors are reported (on a subset of about 10-20
nodes). Sometimes the ECC errors disappeared after I cyclically interchanged
the memory modules within one node. There seems to be a weak correlation
between the instabilities and the tendency to exhibit ECC errors.

IMO, the config is the problem, not the boards, cpus, dimms, etc.

It seems that the last BIOS upgrade has reduced the ECC error rate
somewhat.

probably made the timing a little looser.  does the bios let you tweak?
it would be interesting to know whether derating the clock (->pc2700)
helps this situation more or less than derating the latency.

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to