* Dual AMP Opteron DP270 (2.0 GHz)
which rev?
* Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB ( 12 nodes have 8*2GB)
this dimm is 2-rank, I believe; corsair's datasheet is pretty lame. that means that each bank of memory is 4x2=8 ranks. that's definitely
pushing the limit; I'm sure it can be done in some cases, but it's definitely not supported by some rev's of the opteron, and will always be pretty bleeding-edge.
When a node crashes, we typically see a MCE + kernel panic. We get about
try running mcelog periodically; I bet you see lots of corrected ECC's.
once and ran stable afterwards. Crashes seem to occur mostly when the system is under heavy CPU (memory?) load.
yep.
Far too many correctable ECC errors are reported (on a subset of about 10-20 nodes). Sometimes the ECC errors disappeared after I cyclically interchanged the memory modules within one node. There seems to be a weak correlation between the instabilities and the tendency to exhibit ECC errors.
IMO, the config is the problem, not the boards, cpus, dimms, etc.
It seems that the last BIOS upgrade has reduced the ECC error rate somewhat.
probably made the timing a little looser. does the bios let you tweak? it would be interesting to know whether derating the clock (->pc2700) helps this situation more or less than derating the latency. regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf