In message from Joe Landman <[EMAIL PROTECTED]> (Sat, 28 Jun 2008 14:48:02 -0400):>
This is possible, depending upon the compiler used. Though I have to admit that I find it odd that it would be the case within the Opteron family and not between Opteron and Xeon.

Intel compilers used to (haven't checked 10.1) switch between fast (SSE*) and slow (x87 FP) paths as a function of a processor version string. If this is an old Intel compiler built code, this is possible that the code paths may be different, though as noted, I would find that surprising if this were the case within the Opteron family.

Well, I thought about (absense of) using of SSE in binary Gaussian 03 Rev.C02 version I used, but even if x87-codes were really generated by pgf77 - why this x87-based codes gives such "high" performance on Opteron 246 in comparison w/Opteron 2350 core ? On both CPUs I ran the same binary Gaussian codes !

Modern PGI compilers (suggested default for Gaussian-03 last I checked) have the ability to do this as well, though I don't know how they implement it (capability testing hopefully?)

Out of curiousity, how does streams run on both systems?

I ran stream on Opteron 242 and 244 few years ago. The scalability and the troughput itself was OK. Currently I ran stream on my Opteron 2350-based dual-socket server. In accordance w/more fast DDR2-667 I obtained more high throughput. I reproduced in particular 8-cores result presented in McCalpin's table (sent from AMD), and some data presented early on our Beowulf maillist. (BTW, there is one bad thing for stream on this server - the corresponding data are absent in McCalpin's table: the throughput is scaled good from 1 to 2 OpenMP threads, and gives good result for 8 threads, but the throughput for 4 threads is about the same as for 2 threads. The reason is, IMHO, that for 8 threads RAM is allocated by kernel in both nodes, but for 4 threads the RAM allocated is placed in one node, and 4 threads have bad competition for memory access). Taking into account that Gaussian-03 was bad on Opteron 2350 core - in sequential run, Opteron 2350 RAM gives it only pluses in comparison w/Opteron 246. I didn't run stream on Opteron 246, but it's clear for me.

Also, it is possible, with a larger cache, that you might be running into some odd cache effects (tlb/page thrashing). But DFTs are usually "small" and thus "sensitive" to cache size.

You might be able to instrument the run within a papi wrapper, and see if you observe a large number of cache/tlb flushes for some reason.

On a related note: are you using a stepping before B3 of 2350? That could impact performance, if you have the patch in place or have the tlb/cache turned off in bios (some MB makers created a patch to do this).

Gaussian-03 fails in link302 on Barcelona B2 because of this error. I use stepping B3.
Yours
Mikhail


Joe


--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to