Re: [Beowulf] Q: AMD Opteron (Barcelona) 2356 vs Intel Xeon 5460

Sangamesh B Thu, 18 Sep 2008 02:20:46 -0700

With 8 processes, AMD-gnu is better than others.

Parallel 8 core job results:



AMD-GNU            26.880 sec
AMD-Pathscale    33.746 sec
AMD-Intel10         27.979 sec
Intel-Intel10          30.371 sec

Thank you,
Sangamesh
Consultant, HPC

On Thu, Sep 18, 2008 at 2:08 PM, Bill Broadley <[EMAIL PROTECTED]> wrote:

> Sangamesh B wrote:
>
>> Hi Bill,
>>
>>     I'm sorry. I composed the mail in proper format, but its not showing
>> as
>> I put.
>>
>> See, I've tested with three compilers only for AMD. For intel only Intel
>> ifort.
>>
>
> Ah, so with 8 threads what was the intel time?  The amd-gnu, amd-pathscale,
> and amd-intel times?
>
>
>
>> Also there are two results for a single run (not for all. I missed out to
>> take results with time command).
>>
>> I hope this helps,
>>
>> Thanks,
>> Sangamesh
>>
>> On Thu, Sep 18, 2008 at 11:59 AM, Bill Broadley <[EMAIL PROTECTED]
>> >wrote:
>>
>>
>>> I'm trying to understand your post, but failed.  Can you post a link,
>>> publish a google spreadsheet or format it differently?
>>>
>>> You tried 3 compilers on both machines?  Which times are for which
>>> CPU/Compiler combos?  I tried to match up the columns and ros, but
>>> sometimes
>>> there were 3 columns, and sometimes 4.  None of them lines up nicely
>>> under
>>> CPU or compiler headings.
>>>
>>> Mine (and many other folks) read email in ASCII/text, so a table should
>>> look like:
>>>
>>> Serial run:
>>>               Compiler A   Compiler B   Compiler C
>>> =====================================================
>>> Intel 2.3 GHz     30            29           31
>>> AMD 2.3 GHZ       28            32           32
>>>
>>> Note that I used spaces and not tabs so it appears clear to everyone
>>> irregardless of their mail client, ascii/text, html, tab settings, etc.
>>>
>>> I've been testing these machines quite a bit lately and have been quite
>>> impressed with the barcelona memory systems, for instance:
>>>
>>> http://cse.ucdavis.edu/bill/fat-node-numa3.png
>>>
>>>
>>> Sangamesh B wrote:
>>>
>>>  The scientific application used is Dl-Poly - 2.17.
>>>>
>>>> Tested with Pathscale and Intel compilers on AMD Opteron Quad core. The
>>>> time
>>>> figures mentioned were taken from DL-Poly output file. Also I had used
>>>> time
>>>> command. Here are the results:
>>>>
>>>>
>>>>                     AMD-2.3GHz (32 GB RAM)
>>>>   INTEL-2.33GHz (32 GB RAM)
>>>>
>>>>                        GNU gfortran      Pathscale      Intel 10
>>>> ifort                      Intel 10 fiort
>>>>
>>>> 1. Serial
>>>>
>>>> OUTPUT file       147.719 sec       158.158 sec     135.729 sec
>>>>                    73.952 sec
>>>>
>>>> Time command    2m27.791s
>>>> 2m38.268s                                              1m13.972s
>>>>
>>>> 2. Parallel
>>>>     4 core
>>>>
>>>> OUTPUT file         39.798 sec           44.717 sec        36.962 sec
>>>>         32.317 sec
>>>>
>>>> Time Command     0m41.527s
>>>> 0m46.571s                                       0m36.218s
>>>>
>>>>
>>>> 3. Parallel
>>>>     8 core
>>>>
>>>> OUTPUT               26.880 sec             33.746 sec       27.979 sec
>>>>              30.371 sec
>>>>
>>>> Time cmd
>>>> 0m30.171s
>>>>
>>>>
>>>> The optimization flags used:
>>>>
>>>> Intel ifort 10:        -O3  -axW  -funroll-loops  (don't remember exact
>>>> flag. Similar to loop unroll)
>>>>
>>>> Pathscale:          -O3  -OPT:Ofast   -ffast-math      -fno-math-errno
>>>>
>>>> GNU gfortran      -O3   -ffast-math -funroll-all-loops  -ftree-vectorize
>>>>
>>>>
>>>> I'll try to use the further: http://directory.fsf.org/project/time/
>>>>
>>>> Thanks,
>>>> Sangamesh
>>>>
>>>>
>>>> On Thu, Sep 18, 2008 at 6:07 AM, Vincent Diepeveen <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>  How does all this change when you use a PGO optimized executable on
>>>> both
>>>>
>>>>> sides?
>>>>>
>>>>> Vincent
>>>>>
>>>>>
>>>>> On Sep 18, 2008, at 2:34 AM, Eric Thibodeau wrote:
>>>>>
>>>>>  Vincent Diepeveen wrote:
>>>>>
>>>>>  Nah,
>>>>>>
>>>>>>> I guess he's referring to sometimes it's using single precision
>>>>>>> floating
>>>>>>> point
>>>>>>> to get something done instead of double precision, and it tends to
>>>>>>> keep
>>>>>>> sometimes stuff in registers.
>>>>>>>
>>>>>>> That isn't a problem necessarily, but if i remember well floating
>>>>>>> point
>>>>>>> state
>>>>>>> could get wiped out when switching to SSE2.
>>>>>>>
>>>>>>> Sometimes you lose your FPU registerset in that case.
>>>>>>>
>>>>>>> Main problem is that there is so many dangerous optimizations
>>>>>>> possible,
>>>>>>> to speedup testsets, because in itself floating point is real slow to
>>>>>>> do
>>>>>>> at hardware,
>>>>>>> from hardware viewpoint seen.
>>>>>>>
>>>>>>> Yet in general last generations of intel compilers that has improved
>>>>>>> really a lot.
>>>>>>>
>>>>>>>  Well, running the same code here is the result discrepancy I got:
>>>>>>>
>>>>>> FLOPS:
>>>>>>  my code has to do: 7,975,847,125,000 (~8Tflops) ...takes 15minutes on
>>>>>> 8*2core Opeteron with 32 Gigs-o-RAM (thank you OpenMP ;)
>>>>>>
>>>>>> The running times (ran it a _few_ times...but not the statistical
>>>>>> minimum
>>>>>> of 30):
>>>>>>  ICC -> runtime == 689.249  ; summed error == 1651.78
>>>>>>  GCC -> runtime == 1134.404 ; summed error == 0.883501
>>>>>>
>>>>>> Compiler Flags:
>>>>>>  icc -xW -openmp -O3 vqOpenMP.c -o vqOpenMP
>>>>>>  gcc -lm -fopenmp -O3 -march=native vqOpenMP.c -o vqOpenMP_GCC
>>>>>>
>>>>>> No trickery, no smoky mirrors ;) Just a _huge_ kick ASS k-Means
>>>>>> parallelized with OpenMP (thank gawd, otherwise it takes hours to run)
>>>>>> and a
>>>>>> rather big database of 1.4 Gigs
>>>>>>
>>>>>> ... So this is what I meant by floating point errors. Yes, the runtime
>>>>>> was
>>>>>> almost halved by ICC (and this is on an *opteron* based system, Tyan
>>>>>> VX50).
>>>>>> The running time wasn't what I was actually looking for rather than
>>>>>> precision skew and that's where I fell off my chair.
>>>>>>
>>>>>> For the ones itching for a little more specs:
>>>>>>
>>>>>> [EMAIL PROTECTED] ~ $ icc -V
>>>>>> Intel(R) C Compiler for applications running on Intel(R) 64, Version
>>>>>> 10.1
>>>>>>  Build 20080602
>>>>>> Copyright (C) 1985-2008 Intel Corporation.  All rights reserved.
>>>>>> FOR NON-COMMERCIAL USE ONLY
>>>>>>
>>>>>> [EMAIL PROTECTED] ~ $ gcc -v
>>>>>> Using built-in specs.
>>>>>> Target: x86_64-pc-linux-gnu
>>>>>> Configured with:
>>>>>> /dev/shm/portage/sys-devel/gcc-4.3.1-r1/work/gcc-4.3.1/configure
>>>>>> --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.3.1
>>>>>> --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.1/include
>>>>>> --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.3.1
>>>>>> --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.3.1/man
>>>>>> --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.3.1/info
>>>>>>
>>>>>>
>>>>>> --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.1/include/g++-v4
>>>>>> --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu
>>>>>> --disable-altivec
>>>>>> --enable-nls --without-included-gettext --with-system-zlib
>>>>>> --disable-checking --disable-werror --enable-secureplt
>>>>>> --enable-multilib
>>>>>> --enable-libmudflap --disable-libssp --enable-cld --disable-libgcj
>>>>>> --enable-languages=c,c++,treelang,fortran --enable-shared
>>>>>> --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu
>>>>>> --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo
>>>>>> 4.3.1-r1
>>>>>> p1.1'
>>>>>> Thread model: posix
>>>>>> gcc version 4.3.1 (Gentoo 4.3.1-r1 p1.1)
>>>>>>
>>>>>>  Vincent
>>>>>>
>>>>>>> On Sep 17, 2008, at 10:25 PM, Greg Lindahl wrote:
>>>>>>>
>>>>>>>  On Wed, Sep 17, 2008 at 03:43:36PM -0400, Eric Thibodeau wrote:
>>>>>>>
>>>>>>>   Also, note that I've had issues with icc
>>>>>>>>
>>>>>>>>  generating really fast but inaccurate code (fp model is not IEEE
>>>>>>>>> *by
>>>>>>>>> default*, I am sure _everyone_ knows this and I am stating the
>>>>>>>>> obvious
>>>>>>>>> here).
>>>>>>>>>
>>>>>>>>>  All modern, high-performance compilers default that way. It's
>>>>>>>>>
>>>>>>>> certainly
>>>>>>>> the case that sometimes it goes more horribly wrong than necessary,
>>>>>>>> but
>>>>>>>> I wouldn't ding icc for this default. Compare results with IEEE
>>>>>>>> mode.
>>>>>>>>
>>>>>>>> -- greg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>   _______________________________________________
>>>>>>
>>>>> Beowulf mailing list, [email protected]
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, [email protected]
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>>
>>>
>>
>

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Q: AMD Opteron (Barcelona) 2356 vs Intel Xeon 5460

Reply via email to