On Thu, Sep 8, 2011 at 3:09 PM, Steve White <stevan.wh...@googlemail.com> wrote:
> Hi Richard!
>
> On Thu, Sep 8, 2011 at 11:02 AM, Richard Guenther
> <richard.guent...@gmail.com> wrote:
>> On Thu, Sep 8, 2011 at 12:31 AM, Steve White
>> <stevan.wh...@googlemail.com> wrote:
>>> Hi,
>>>
>>> I run some tests of simple number-crunching loops whenever new
>>> architectures and compilers arise.
>>>
>>> These tests on recent Intel architectures show similar performance
>>> between gcc and icc compilers, at full optimization.
>>>
>>> However a recent test on x86_64 showed the open64 compiler
>>> outstripping gcc by a factor of 2 to 3.  I tried all the obvious
>>> flags; nothing helped.
>>
>> Like -funroll-loops?
>>
>
> ** Let's turn it around:  What are a good set of flags then for
> improving speed in simple loops such as these on the x86_64?
>
> In fact, I did try -funroll-loops and several others, but I somehow
> fooled myself (Maybe partly because, as I wrote, I was under the
> impression -O3 turned this on by default.)
>
> With -funroll-loops, the performance is improved a lot.
>
> $ gcc --std=c99 -O3 -funroll-loops -Wall -pedantic mults_by_const.c
> $ ./a.out
> double array mults by const             320 ms [  1.013193]
>
> Which puts it only a factor of 2 slower than the open64 -O3.
>
> Furthermore, -march=native improves it yet more.
>
> $ gcc --std=c99 -O3 -funroll-loops -march=native -Wall -pedantic
> mults_by_const.c
> $ ./a.out
> double array mults by const             300 ms [  1.013193]
>
> Now it's only 70% slower than the open64 results.
>
> I tried these flags
>   -floop-optimize  -fmove-loop-invariants -fprefetch-loop-arrays -fprofile-use
> but saw no further improvements.
>
> So I drop my claim of knowing what the problem is (and repent of even
> having tried before.)
>
> Simple searches on the web turn up a lot of experiments, nothing definitive.
>
> FWIW, also attached is the whole assembler file generated with the
> above settings.
>
> To my eye, the gcc assembler is a great deal more complicated, and
> does a lot more stuff, besides being slower.

opencc exchanged the loops

        for( j = 0; j < ITERATIONS; j++ )
                for( i = 0; i < size; i++ )
                        dvec[i] *= dval;

to

                for( i = 0; i < size; i++ )
        for( j = 0; j < ITERATIONS; j++ )
                        dvec[i] *= dval;

and then applies store-motion to end up with

                for( i = 0; i < size; i++ )
{
   double tem = dvec[i];
        for( j = 0; j < ITERATIONS; j++ )
            tem *= dval;
   dvec[i] = tem;
}

that's obviously better for the cache.  GCC can do the same
when you enable -ftree-loop-linear but then it confuses itself
enough to no longer vectorize the loop.

Richard.

> Thanks!
>

Reply via email to