On Wednesday 27 September 2006 11:20, Gebhardt Thomas wrote: > Hi, > > > We are currently deploying Tyan S2882 Dual Opteron Boards, and we have > > found the system to be quite unstable. After BIOS updates and kernel > > changes we still get random kernel panics when under load. > > Me too :-( > > We've got a 85 Node Dual Opteron Cluster. I've documented most of the > crashes on > http://clust-doc.hrz.uni-marburg.de/index.php/Hardware_Bulletin .
Gosh, good that we didn't buy our cluster from your vendor, they made us an offer, too. We did buy from Transtec, there were also some memory related problems during the first few weeks, but all those nodes became smoothly replaced and ever since everything is running almost perfectly (a small exeption was in the past the SIL3114 sata controller, at least the driver of 2.6.11 made some problems under heavy load, but this seams to be fixed with newer kernel versions). Its only a 16 node cluster (Tyan S2881 boards with 4GB and 8GB memory), but given your failure numbers, we also should have seen many crashes during the last 2 years. In the past our main fileserver also was a Tyan S2882 system, it randomly (without any load) entirely locks up sometimes, without any log messages (monitored with serial cable). Sometimes its running stable for month, sometimes it crashes once a week - we had to replace the entire system, since it was not suitable for a high-availibility node. We are additionally monitoring the memory using bluesmoke - there were never any logged problems. -- Bernd Schubert PCI / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf