Gebhardt Thomas <[EMAIL PROTECTED]> writes: > Hi, > >> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have >> found the system to be quite unstable. After BIOS updates and kernel >> changes we still get random kernel panics when under load. > > Me too :-( > > We've got a 85 Node Dual Opteron Cluster. I've documented most of the > crashes on > http://clust-doc.hrz.uni-marburg.de/index.php/Hardware_Bulletin . > > Our equipment: > > * Dual AMP Opteron DP270 (2.0 GHz) > * MB: TYAN S2882G3-DNR > * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB > ( 12 nodes have 8*2GB) > * PS: EMACS P1 6400P > * HD: 250 GB SATA from Western Digital > > Dist: Debian/Sarge amd64 > Kernel: various, currently 2.6.15.3 from kernel.org > BIOS: (most recent, as far as I know) > > When a node crashes, we typically see a MCE + kernel panic. We get about > 2 crashes per week on our 85 node cluster. Some nodes seem to be more unstable > than others but we also see instabilities on nodes that had been stable so > far. The instabilities are very hard to reproduce: we have nodes that crashed > once and ran stable afterwards. Crashes seem to occur mostly when the system > is under heavy CPU (memory?) load.
I bet if you decode the MCE it will say uncorrectable ECC memory error. > Far too many correctable ECC errors are reported (on a subset of about 10-20 > nodes). Sometimes the ECC errors disappeared after I cyclically interchanged > the memory modules within one node. There seems to be a weak correlation > between the instabilities and the tendency to exhibit ECC errors. memtest86 > runs fine on the momory modules. memtest86 doesn't see correctable memory errors. > It seems that the last BIOS upgrade has reduced the ECC error rate > somewhat. > > We definitely have no temperature problem. As far as I can see (libsensor) > the voltages are ok, too. It sounds like you have a pile of correctable (soft?) memory errors that occasionally become uncorrectable. Good Luck, Eric _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf