Hi, > We are currently deploying Tyan S2882 Dual Opteron Boards, and we have > found the system to be quite unstable. After BIOS updates and kernel > changes we still get random kernel panics when under load.
Me too :-( We've got a 85 Node Dual Opteron Cluster. I've documented most of the crashes on http://clust-doc.hrz.uni-marburg.de/index.php/Hardware_Bulletin . Our equipment: * Dual AMP Opteron DP270 (2.0 GHz) * MB: TYAN S2882G3-DNR * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB ( 12 nodes have 8*2GB) * PS: EMACS P1 6400P * HD: 250 GB SATA from Western Digital Dist: Debian/Sarge amd64 Kernel: various, currently 2.6.15.3 from kernel.org BIOS: (most recent, as far as I know) When a node crashes, we typically see a MCE + kernel panic. We get about 2 crashes per week on our 85 node cluster. Some nodes seem to be more unstable than others but we also see instabilities on nodes that had been stable so far. The instabilities are very hard to reproduce: we have nodes that crashed once and ran stable afterwards. Crashes seem to occur mostly when the system is under heavy CPU (memory?) load. Far too many correctable ECC errors are reported (on a subset of about 10-20 nodes). Sometimes the ECC errors disappeared after I cyclically interchanged the memory modules within one node. There seems to be a weak correlation between the instabilities and the tendency to exhibit ECC errors. memtest86 runs fine on the momory modules. It seems that the last BIOS upgrade has reduced the ECC error rate somewhat. We definitely have no temperature problem. As far as I can see (libsensor) the voltages are ok, too. Cheers, Thomas _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf