Re: [Beowulf] Nehalem and Shanghai code performance for our rzf example

Kevin Abbey Sat, 17 Jan 2009 11:16:13 -0800

Hi Joe,

Can that 9% difference be due to the Intel capability to overclock onecore and turn the others off?

Or is does this Intel feature require manual switch somewhere?
Thank  you,
Kevin



Joe Landman wrote:

Hi folks:
Thought you might like to see this. I rewrote the interior loop forour Riemann Zeta Function (rzf) example for SSE2, and ran it on aNehalem and on a Shanghai. This code is compute intensive. The innerloop which had been written as this (some small hand optimization,loop unrolling, etc):
    l[0]=(double)(inf-1 - 0);
    l[1]=(double)(inf-1 - 1);
    l[2]=(double)(inf-1 - 2);
    l[3]=(double)(inf-1 - 3);
    p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
    for(k=start_index;k>end_index;k-=unroll)
       {
          d_pow[0] = l[0];
          d_pow[1] = l[1];
          d_pow[2] = l[2];
          d_pow[3] = l[3];

          for (m=n;m>1;m--)
           {
             d_pow[0] *=  l[0];
             d_pow[1] *=  l[1];
             d_pow[2] *=  l[2];
             d_pow[3] *=  l[3];
           }
          p_sum[0] += one/d_pow[0];
          p_sum[1] += one/d_pow[1];
          p_sum[2] += one/d_pow[2];
          p_sum[3] += one/d_pow[3];

          l[0]-=four;
          l[1]-=four;
          l[2]-=four;
          l[3]-=four;
      }
    sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;

has been rewritten as

    __m128d __P_SUM = _mm_set_pd1(0.0);        // __P_SUM[0 ... VLEN] = 0
    __m128d __ONE = _mm_set_pd1(1.);   // __ONE[0 ... VLEN] = 1
    __m128d __DEC = _mm_set_pd1((double)VLEN);
    __m128d __L   = _mm_load_pd(l);

    for(k=start_index;k>end_index;k-=unroll)
       {
          __D_POW       = __L;

          for (m=n;m>1;m--)
           {
             __D_POW    = _mm_mul_pd(__D_POW, __L);
           }
__P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE,__D_POW));
          __L           = _mm_sub_pd(__L, __DEC);

      }

    _mm_store_pd(p_sum,__P_SUM);

    for(k=0;k<VLEN;k++)
     {
       sum += p_sum[k];
     }
The two codes were run on a Nehalem 3.2 GHz (desktop) processor, and aShanghai 2.3 GHz desktop processor. Here are the results
    Code        CPU    Freq (GHz)    Wall clock (s)
    ------        -------    -------------    --------------
base Nehalem 3.2 20.5optimized Nehalem 3.2 6.72SSE-ized Nehalem 3.2 3.37
    base        Shanghai 2.3        30.3
optimized Shanghai 2.3 7.36SSE-ized Shanghai 2.3 3.68These are single thread, single core runs. Code scales very well (isone of our example codes for the HPC/programming/parallelizationclasses we do).
I found it interesting that they started out with the baseline codeperformance tracking the ratio of clock speeds ... The Nehalem has a39% faster clock, and showed 48% faster performance, which is about 9%more than could be accounted for by clock speed alone. The SSE codeperformance appears to be about 9% different.
I am sure lots of interesting points can be made out of this (beingonly one test, and not the most typical test/use case either, suchpoints may be of dubious value).
I am working on a Cuda version of the above as well, and will try tocompare this to the threaded versions of the above. I am curious whatwe can achieve.
Joe


--
Kevin C. Abbey
System Administrator
Rutgers University - BioMaPS Institute

Email: kab...@biomaps.rutgers.edu


Hill Center - Room 259
110 Frelinghuysen Road
Piscataway, NJ  08854

Phone and Voice mail: 732-445-3288

Wright-Rieman Laboratories Room 201
610 Taylor Rd.
Piscataway, NJ  08854-8087
Phone: 732-445-2069
Fax: 732-445-5958

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Nehalem and Shanghai code performance for our rzf example

Reply via email to