In message from Joe Landman <[EMAIL PROTECTED]> (Sat, 28
Jun 2008 14:48:02 -0400):>
This is possible, depending upon the compiler used. Though I have
to
admit that I find it odd that it would be the case within the Opteron
family and not between Opteron and Xeon.
Intel compilers used to (haven't checked 10.1) switch between fast
(SSE*) and slow (x87 FP) paths as a function of a processor version
string. If this is an old Intel compiler built code, this is
possible that the code paths may be different, though as noted, I
would find that surprising if this were the case within the Opteron
family.
Well, I thought about (absense of) using of SSE in binary Gaussian 03
Rev.C02 version
I used, but even if x87-codes were really generated by pgf77 - why
this x87-based codes gives such "high" performance on Opteron 246 in
comparison w/Opteron 2350 core ? On both CPUs I ran the same binary
Gaussian codes !
Modern PGI compilers (suggested default for Gaussian-03 last I
checked) have the ability to do this as well, though I don't know how
they implement it (capability testing hopefully?)
Out of curiousity, how does streams run on both systems?
I ran stream on Opteron 242 and 244 few years ago. The scalability and
the troughput itself was OK. Currently I ran stream on my Opteron
2350-based dual-socket server. In accordance w/more fast DDR2-667 I
obtained more high throughput. I reproduced in particular 8-cores
result presented in McCalpin's table (sent from AMD), and some data
presented early on our Beowulf maillist.
(BTW, there is one bad thing for stream on this server - the
corresponding data are absent in McCalpin's table: the throughput is
scaled good from 1 to 2 OpenMP threads, and gives good result for 8
threads, but the throughput for 4 threads is about the same as for 2
threads. The reason is, IMHO, that for 8 threads RAM is allocated by
kernel in both nodes, but for 4 threads the RAM allocated is placed in
one node, and 4 threads have bad competition for memory access).
Taking into account that Gaussian-03 was bad on Opteron 2350 core - in
sequential run, Opteron 2350 RAM gives it only pluses in comparison
w/Opteron 246. I didn't run stream on Opteron 246, but it's clear for
me.
Also, it
is
possible, with a larger cache, that you might be running into some
odd cache effects (tlb/page thrashing). But DFTs are usually "small"
and thus "sensitive" to cache size.
You might be able to instrument the run within a papi wrapper, and
see if you observe a large number of cache/tlb flushes for some
reason.
On a related note: are you using a stepping before B3 of 2350?
That
could impact performance, if you have the patch in place or have the
tlb/cache turned off in bios (some MB makers created a patch to do
this).
Gaussian-03 fails in link302 on Barcelona B2 because of this error. I
use stepping B3.
Yours
Mikhail
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf