On Thu, Sep 8, 2011 at 3:09 PM, Steve White <stevan.wh...@googlemail.com> wrote: > Hi Richard! > > On Thu, Sep 8, 2011 at 11:02 AM, Richard Guenther > <richard.guent...@gmail.com> wrote: >> On Thu, Sep 8, 2011 at 12:31 AM, Steve White >> <stevan.wh...@googlemail.com> wrote: >>> Hi, >>> >>> I run some tests of simple number-crunching loops whenever new >>> architectures and compilers arise. >>> >>> These tests on recent Intel architectures show similar performance >>> between gcc and icc compilers, at full optimization. >>> >>> However a recent test on x86_64 showed the open64 compiler >>> outstripping gcc by a factor of 2 to 3. I tried all the obvious >>> flags; nothing helped. >> >> Like -funroll-loops? >> > > ** Let's turn it around: What are a good set of flags then for > improving speed in simple loops such as these on the x86_64? > > In fact, I did try -funroll-loops and several others, but I somehow > fooled myself (Maybe partly because, as I wrote, I was under the > impression -O3 turned this on by default.) > > With -funroll-loops, the performance is improved a lot. > > $ gcc --std=c99 -O3 -funroll-loops -Wall -pedantic mults_by_const.c > $ ./a.out > double array mults by const 320 ms [ 1.013193] > > Which puts it only a factor of 2 slower than the open64 -O3. > > Furthermore, -march=native improves it yet more. > > $ gcc --std=c99 -O3 -funroll-loops -march=native -Wall -pedantic > mults_by_const.c > $ ./a.out > double array mults by const 300 ms [ 1.013193] > > Now it's only 70% slower than the open64 results. > > I tried these flags > -floop-optimize -fmove-loop-invariants -fprefetch-loop-arrays -fprofile-use > but saw no further improvements. > > So I drop my claim of knowing what the problem is (and repent of even > having tried before.) > > Simple searches on the web turn up a lot of experiments, nothing definitive. > > FWIW, also attached is the whole assembler file generated with the > above settings. > > To my eye, the gcc assembler is a great deal more complicated, and > does a lot more stuff, besides being slower.
opencc exchanged the loops for( j = 0; j < ITERATIONS; j++ ) for( i = 0; i < size; i++ ) dvec[i] *= dval; to for( i = 0; i < size; i++ ) for( j = 0; j < ITERATIONS; j++ ) dvec[i] *= dval; and then applies store-motion to end up with for( i = 0; i < size; i++ ) { double tem = dvec[i]; for( j = 0; j < ITERATIONS; j++ ) tem *= dval; dvec[i] = tem; } that's obviously better for the cache. GCC can do the same when you enable -ftree-loop-linear but then it confuses itself enough to no longer vectorize the loop. Richard. > Thanks! >