from:"Benoît Jacob"

g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Benoît Jacob

Hi,

I'm developing a Free C++ template library (1) in which it is very important 
that certain loops get unrolled, but at the same time I can't unroll them by 
hand, because they depend on template parameters.

My problem is that G++ 4.1.1 (Gentoo) doesn't unroll these loops.

I have written a standalone simple program showing this problem; I attach it 
(toto.cpp) and I also paste it below. This program does a loop if UNROLL is 
not defined, and does the same thing but with the loop unrolled by hand if 
UNROLL is defined. So one would expect that with g++ -O3, the speed would be 
the same in both cases. Alas, it's not:

g++ -DUNROLL -O3 toto.cpp -o toto   ---> toto runs in 0.3 seconds
g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds

So what can I do? Is that a bug in g++? If yes, any hope to see it fixed soon?

Cheers,
Benoit

(1) : Eigen, see http://eigen.tuxfamily.org


file: toto.cpp


#include

class Matrix
{
public:
double data[9];
double & operator()( int i, int j )
{
return data[i + 3 * j];
}
void loadScaling( double factor );
};

void Matrix::loadScaling( double factor)
{
#ifdef UNROLL
(*this)( 0, 0 ) = factor;
(*this)( 1, 0 ) = 0;
(*this)( 2, 0 ) = 0;
(*this)( 0, 1 ) = 0;
(*this)( 1, 1 ) = factor;
(*this)( 2, 1 ) = 0;
(*this)( 0, 2 ) = 0;
(*this)( 1, 2 ) = 0;
(*this)( 2, 2 ) = factor;
#else
for( int i = 0; i < 3; i++ )
for( int j = 0; j < 3; j++ )
(*this)(i, j) = (i == j) * factor;
#endif
}

int main( int argc, char *argv[] )
{
Matrix m;
for( int i = 0; i < 1; i++ )
m.loadScaling( i );
std::cout << "m(0,0) = " << m(0,0) << std::endl;
}
#include

class Matrix
{
public:
double data[9];
double & operator()( int i, int j )
{
return data[i + 3 * j];
}
void loadScaling( double factor );
};

void Matrix::loadScaling( double factor)
{
#ifdef UNROLL
(*this)( 0, 0 ) = factor;
(*this)( 1, 0 ) = 0;
(*this)( 2, 0 ) = 0;
(*this)( 0, 1 ) = 0;
(*this)( 1, 1 ) = factor;
(*this)( 2, 1 ) = 0;
(*this)( 0, 2 ) = 0;
(*this)( 1, 2 ) = 0;
(*this)( 2, 2 ) = factor;
#else
for( int i = 0; i < 3; i++ )
for( int j = 0; j < 3; j++ )
(*this)(i, j) = (i == j) * factor;
#endif
}

int main( int argc, char *argv[] )
{
Matrix m;
for( int i = 0; i < 1; i++ )
m.loadScaling( i );
std::cout << "m(0,0) = " << m(0,0) << std::endl;
}


pgpWZeXGqxnTe.pgp
Description: PGP signature

Re: g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Benoît Jacob

I had already tried that. That doesn't change anything.

I had also tried passing a higher --param max-unroll-times. No effect.

So, any idea? The example program toto.cpp is so simple, I can't believe g++ 
can't handle it. Surely there must be something simple that I haven't 
understood?

Benoit

Le mercredi 13 décembre 2006 13:12, Steven Bosscher a écrit :
> On 12/13/06, Benoît Jacob <[EMAIL PROTECTED]> wrote:
> > g++ -DUNROLL -O3 toto.cpp -o toto   ---> toto runs in 0.3 seconds
> > g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds
> >
> > So what can I do? Is that a bug in g++? If yes, any hope to see it fixed
> > soon?
>
> You could try adding -funroll-loops.
>
> Gr.
> Steven

pgpsNhllyNHF0.pgp
Description: PGP signature

Re: g++ doesn't unroll a loop it should unroll

2006-12-13 Thread Benoît Jacob

Le mercredi 13 décembre 2006 23:09, Denis Vlasenko a écrit :
> C++ doesn't specify that compiler shall unroll loops, so it cannot be
> classified as "real" bug.

OK, but then, even if I explicitly ask gcc to unroll loops 
with -funroll-loops, it still doesn't unroll them completely and is still as 
slow. See bug report here:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30201

> Re code: I would use memset + just a single

no, in this example the numbers are double, but in my template library the 
type is a "typename T" and I can make no assumption as to the bit 
representation of static_cast(0).

> loop anyway... you C++ people tend to overtax compiler with
> optimizations. Is it really necessary to do (i == j) * factor
> when (i == j) ? factor : 0 is easier for compiler to grok?

Of course I tried it. It's even slower. Doesn't help the compiler unroll the 
loop, and now there's a branch at each iteration.

> Template lib for vector and matrix math sounds like a performance
> disaster in the making, at least for me. However, maybe you are
> truly smart guy and can do miracles.

I don't understand why you say that. At the language specification level, 
templates come with no inherent speed overhead. All of the template stuff is 
unfolded at compile time, none of it remains visible in the binary, so it 
shouldn't make the binary slower.

Benoit


pgphvVzwRwvyK.pgp
Description: PGP signature

Re: g++ doesn't unroll a loop it should unroll

2006-12-14 Thread Benoît Jacob

Le jeudi 14 décembre 2006 08:58, Steven Bosscher a écrit :
> On 12/14/06, Benoît Jacob <[EMAIL PROTECTED]> wrote:
> > I don't understand why you say that. At the language specification level,
> > templates come with no inherent speed overhead. All of the template stuff
> > is unfolded at compile time, none of it remains visible in the binary, so
> > it shouldn't make the binary slower.
>
> You're confusing theory and practice...

We're getting offtopic here, the example program I sent in my first mail 
didn't have any templates, so the gcc bug we're talking about has nothing to 
do with templates. The same bug would appear in any C or C++ program having 
nested loops and expecting them to get completely unrolled.

Benoit

pgpI1YexNOPbn.pgp
Description: PGP signature

Auto-vectorization: need to know what to expect

2008-03-17 Thread Benoît Jacob

Dear All,

I am currently (co-)developing a Free (GPL/LGPL) C++ library for vector/matrix 
math.

A major decision that we need to take is, what to do regarding vectorization 
instructions (SSE). Either we rely on GCC to auto-vectorize, or we control 
explicitly the vectorization using GCC's special primitives. The latter 
solution is of course more difficult, and would to some degree obfuscate our 
source code, so we wish to know whether or not it's really necessary.

GCC 4.3.0 does auto-vectorize our loops, but the resulting code has worse 
performance than a version with unrolled loops and no vectorization. By 
contrast, ICC auto-vectorizes the same loops in a way that makes them 
significantly faster than the unrolled-loops non-vectorized version.

If you want to know, the loops in question typically look like:
for(int i = 0; i < COMPILE_TIME_CONSTANT; i++)
{
// some abstract c++ code with deep recursive templates and
// deep recursive inline functions, but resulting in only a
// few assembly instructions
a().b().c().d(i) = x().y().z(i);
}

As said above, it's crucial for us to be able to get an idea of what to 
expect, because design decisions depend on that. Should we expect large 
improvements regarding autovectorization in 4.3.x, in 4.4 or 4.5 ?

A roadmap or a GCC developer sharing his thoughts would be very helpful.

Cheers,

Benoit

P.S. I have noticed huge improvements in GCC recently and would like to thank 
all the developers for that. This is what makes me hope that GCC might soon 
handle auto-vectorization in a way that allows me to rely on it!


signature.asc
Description: This is a digitally signed message part.

Re: Auto-vectorization: need to know what to expect

2008-03-17 Thread Benoît Jacob

Thanks Richard for the answer.

It sounds like it's worth betting on gcc's autovectorizer and submitting bug 
reports -- so expect to hear again from us :)

Cheers,
Benoît

On Monday 17 March 2008 15:59:21 Richard Guenther wrote:
> On Mon, Mar 17, 2008 at 3:45 PM, Benoît Jacob <[EMAIL PROTECTED]> wrote:
> > Dear All,
> >
> >  I am currently (co-)developing a Free (GPL/LGPL) C++ library for
> > vector/matrix math.
> >
> >  A major decision that we need to take is, what to do regarding
> > vectorization instructions (SSE). Either we rely on GCC to
> > auto-vectorize, or we control explicitly the vectorization using GCC's
> > special primitives. The latter solution is of course more difficult, and
> > would to some degree obfuscate our source code, so we wish to know
> > whether or not it's really necessary.
> >
> >  GCC 4.3.0 does auto-vectorize our loops, but the resulting code has
> > worse performance than a version with unrolled loops and no
> > vectorization. By contrast, ICC auto-vectorizes the same loops in a way
> > that makes them significantly faster than the unrolled-loops
> > non-vectorized version.
> >
> >  If you want to know, the loops in question typically look like:
> >  for(int i = 0; i < COMPILE_TIME_CONSTANT; i++)
> >  {
> > // some abstract c++ code with deep recursive templates and
> > // deep recursive inline functions, but resulting in only a
> > // few assembly instructions
> > a().b().c().d(i) = x().y().z(i);
> >  }
> >
> >  As said above, it's crucial for us to be able to get an idea of what to
> >  expect, because design decisions depend on that. Should we expect large
> >  improvements regarding autovectorization in 4.3.x, in 4.4 or 4.5 ?
>
> In general GCCs autovectorization capabilities are quite good, cases
> where we miss opportunities do of course exist.  There were improvements
> regarding autovectorization capabilities in every GCC release and I expect
> that to continue for future releases (though I cannot promise anything
> as GCC is a volunteer driven project - but certainly testcases where we
> miss optimizations are welcome - often we don't know of all corner cases).
>
> If you require to get the absolute most out of your CPU I recommend to
> provide special routines tuned for the different CPU families and I
> recommend the use of the standard intrinsics headers (*mmintr.h) for
> this.  Of course this comes at a high cost of maintainance (and initial
> work), so autovectorization might prove good enough.  Often tuning the
> source for a given compiler has a similar effect than producing vectorized
> code manually.  Looking at GCC tree dumps and knowing a bit about
> GCC internals helps you here ;)
>
> >  A roadmap or a GCC developer sharing his thoughts would be very helpful.
>
> Thanks,
> Richard.




signature.asc
Description: This is a digitally signed message part.

Re: Auto-vectorization: need to know what to expect

2008-03-17 Thread Benoît Jacob

I have looked more closely at the messages generated by the gcc 4.3 vectorizer 
and it seems that they fall into two categories:

1) complaining about aligmnent.

For example:

Unknown alignment for access: D.33485
Unknown alignment for access: m

I don't understand, as all my data is statically allocated doubles (no dynamic 
memory allocation) and I am using -malign-double. What more can I do?

2) complaining about "possible dependence" between some data and itself

Example:

not vectorized, possible dependence between data-refs 
m.m_storage.m_data[D.43225_112] and m.m_storage.m_data[D.43225_112]


I am wondering what to do about all that? Surely there must be documentation 
about the vectorizer and its messages somewhere but I can't find it?

Cheers,
Benoit


On Monday 17 March 2008 15:59:21 Richard Guenther wrote:
> On Mon, Mar 17, 2008 at 3:45 PM, Benoît Jacob <[EMAIL PROTECTED]> wrote:
> > Dear All,
> >
> >  I am currently (co-)developing a Free (GPL/LGPL) C++ library for
> > vector/matrix math.
> >
> >  A major decision that we need to take is, what to do regarding
> > vectorization instructions (SSE). Either we rely on GCC to
> > auto-vectorize, or we control explicitly the vectorization using GCC's
> > special primitives. The latter solution is of course more difficult, and
> > would to some degree obfuscate our source code, so we wish to know
> > whether or not it's really necessary.
> >
> >  GCC 4.3.0 does auto-vectorize our loops, but the resulting code has
> > worse performance than a version with unrolled loops and no
> > vectorization. By contrast, ICC auto-vectorizes the same loops in a way
> > that makes them significantly faster than the unrolled-loops
> > non-vectorized version.
> >
> >  If you want to know, the loops in question typically look like:
> >  for(int i = 0; i < COMPILE_TIME_CONSTANT; i++)
> >  {
> > // some abstract c++ code with deep recursive templates and
> > // deep recursive inline functions, but resulting in only a
> > // few assembly instructions
> > a().b().c().d(i) = x().y().z(i);
> >  }
> >
> >  As said above, it's crucial for us to be able to get an idea of what to
> >  expect, because design decisions depend on that. Should we expect large
> >  improvements regarding autovectorization in 4.3.x, in 4.4 or 4.5 ?
>
> In general GCCs autovectorization capabilities are quite good, cases
> where we miss opportunities do of course exist.  There were improvements
> regarding autovectorization capabilities in every GCC release and I expect
> that to continue for future releases (though I cannot promise anything
> as GCC is a volunteer driven project - but certainly testcases where we
> miss optimizations are welcome - often we don't know of all corner cases).
>
> If you require to get the absolute most out of your CPU I recommend to
> provide special routines tuned for the different CPU families and I
> recommend the use of the standard intrinsics headers (*mmintr.h) for
> this.  Of course this comes at a high cost of maintainance (and initial
> work), so autovectorization might prove good enough.  Often tuning the
> source for a given compiler has a similar effect than producing vectorized
> code manually.  Looking at GCC tree dumps and knowing a bit about
> GCC internals helps you here ;)
>
> >  A roadmap or a GCC developer sharing his thoughts would be very helpful.
>
> Thanks,
> Richard.




signature.asc
Description: This is a digitally signed message part.

Re: Auto-vectorization: need to know what to expect

2008-03-17 Thread Benoît Jacob

OK. It's nontrivial as this uses a 2500-line c++ template library, but I'll do 
my best to come up with something self-contained.

Cheers,
Benoit

On Monday 17 March 2008 18:51:57 Daniel Jacobowitz wrote:
> On Mon, Mar 17, 2008 at 06:33:23PM +0100, Benoît Jacob wrote:
> > I have looked more closely at the messages generated by the gcc 4.3
> > vectorizer and it seems that they fall into two categories:
>
> The absolute best thing you can do in cases like this is to make a
> small program which shows the message, and send that to Bugzilla.

signature.asc
Description: This is a digitally signed message part.

Re: Auto-vectorization: need to know what to expect

2008-03-17 Thread Benoît Jacob

Thanks a lot Michael for the detailed help!

Thanks also n8tm, and sorry to have posted on the wrong list.

Well that's a lot of food for thought and it'll keep me busy for some time,
so thanks again to all, and bye!

Benoit

On Monday 17 March 2008 20:08:43 Michael Meissner wrote:
> However, SSE instructions need 128-bit alignment, not 64-bit alignment that
> -malign-double would give.  You can align the arrays yourself with the
> __attribute__((__aligned__(16))) declaration, or use a union that has an
> element with 16-byte alignment (vector element, such as __m128, __m128d,
> __m128i or long double and -m128bit-long-double).  Note, if the arrays are
> auto rather than static, you probably need to use the -mstackrealign and
> -mpreferred-stack-boundary=16 as well.
>
> It might be nice to think about an option that automatically aligns large
> arrays without having to do the declaration (or even have the vectorizer
> override the alignment for statics/auto).


signature.asc
Description: This is a digitally signed message part.

g++ doesn't unroll a loop it should unroll

Re: g++ doesn't unroll a loop it should unroll

Re: g++ doesn't unroll a loop it should unroll

Re: g++ doesn't unroll a loop it should unroll

Auto-vectorization: need to know what to expect

Re: Auto-vectorization: need to know what to expect

Re: Auto-vectorization: need to know what to expect

Re: Auto-vectorization: need to know what to expect

Re: Auto-vectorization: need to know what to expect

9 matches

Site Navigation

Mail list logo

Footer information