My dual opteron dual core is extremely stable,
except when i run 1 type of software, namely software that is
doing non-stop multiplying. I do that under Ubuntu.
That really seems like a worst case path in the dual core opteron chips.
After it is nonstop multiplying for a number of days,
I get a complete crash of the system.
Any other software program, windows (x64) or ubuntu linux,
it runs extremely stable for months.
Is it possible some crashes you had were caused by non stop multiplying
numbers?
Very optimal programmed software will of course manage to limit the amount
of
instructions overhead when doing matrix calculations or whatever and will be
basically
busy multiplying.
In my case it was big number multiplying just with integer multiplying.
Vincent
----- Original Message -----
From: "Gebhardt Thomas" <[EMAIL PROTECTED]>
To: <beowulf@beowulf.org>
Sent: Wednesday, September 27, 2006 10:20 AM
Subject: Re: [Beowulf] Tyan S2882
Hi,
We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
found the system to be quite unstable. After BIOS updates and kernel
changes we still get random kernel panics when under load.
Me too :-(
We've got a 85 Node Dual Opteron Cluster. I've documented most of the
crashes on
http://clust-doc.hrz.uni-marburg.de/index.php/Hardware_Bulletin .
Our equipment:
* Dual AMP Opteron DP270 (2.0 GHz)
* MB: TYAN S2882G3-DNR
* Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung
CM72SD1024RLP-3200/SB
( 12 nodes have 8*2GB)
* PS: EMACS P1 6400P
* HD: 250 GB SATA from Western Digital
Dist: Debian/Sarge amd64
Kernel: various, currently 2.6.15.3 from kernel.org
BIOS: (most recent, as far as I know)
When a node crashes, we typically see a MCE + kernel panic. We get about
2 crashes per week on our 85 node cluster. Some nodes seem to be more
unstable
than others but we also see instabilities on nodes that had been stable so
far. The instabilities are very hard to reproduce: we have nodes that
crashed
once and ran stable afterwards. Crashes seem to occur mostly when the
system
is under heavy CPU (memory?) load.
Far too many correctable ECC errors are reported (on a subset of about
10-20
nodes). Sometimes the ECC errors disappeared after I cyclically
interchanged
the memory modules within one node. There seems to be a weak correlation
between the instabilities and the tendency to exhibit ECC errors.
memtest86
runs fine on the momory modules.
It seems that the last BIOS upgrade has reduced the ECC error rate
somewhat.
We definitely have no temperature problem. As far as I can see (libsensor)
the voltages are ok, too.
Cheers, Thomas
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf