g++ doesn't unroll a loop it should unroll
Hi, I'm developing a Free C++ template library (1) in which it is very important that certain loops get unrolled, but at the same time I can't unroll them by hand, because they depend on template parameters. My problem is that G++ 4.1.1 (Gentoo) doesn't unroll these loops. I have written a standalone simple program showing this problem; I attach it (toto.cpp) and I also paste it below. This program does a loop if UNROLL is not defined, and does the same thing but with the loop unrolled by hand if UNROLL is defined. So one would expect that with g++ -O3, the speed would be the same in both cases. Alas, it's not: g++ -DUNROLL -O3 toto.cpp -o toto ---> toto runs in 0.3 seconds g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds So what can I do? Is that a bug in g++? If yes, any hope to see it fixed soon? Cheers, Benoit (1) : Eigen, see http://eigen.tuxfamily.org file: toto.cpp #include class Matrix { public: double data[9]; double & operator()( int i, int j ) { return data[i + 3 * j]; } void loadScaling( double factor ); }; void Matrix::loadScaling( double factor) { #ifdef UNROLL (*this)( 0, 0 ) = factor; (*this)( 1, 0 ) = 0; (*this)( 2, 0 ) = 0; (*this)( 0, 1 ) = 0; (*this)( 1, 1 ) = factor; (*this)( 2, 1 ) = 0; (*this)( 0, 2 ) = 0; (*this)( 1, 2 ) = 0; (*this)( 2, 2 ) = factor; #else for( int i = 0; i < 3; i++ ) for( int j = 0; j < 3; j++ ) (*this)(i, j) = (i == j) * factor; #endif } int main( int argc, char *argv[] ) { Matrix m; for( int i = 0; i < 1; i++ ) m.loadScaling( i ); std::cout << "m(0,0) = " << m(0,0) << std::endl; } #include class Matrix { public: double data[9]; double & operator()( int i, int j ) { return data[i + 3 * j]; } void loadScaling( double factor ); }; void Matrix::loadScaling( double factor) { #ifdef UNROLL (*this)( 0, 0 ) = factor; (*this)( 1, 0 ) = 0; (*this)( 2, 0 ) = 0; (*this)( 0, 1 ) = 0; (*this)( 1, 1 ) = factor; (*this)( 2, 1 ) = 0; (*this)( 0, 2 ) = 0; (*this)( 1, 2 ) = 0; (*this)( 2, 2 ) = factor; #else for( int i = 0; i < 3; i++ ) for( int j = 0; j < 3; j++ ) (*this)(i, j) = (i == j) * factor; #endif } int main( int argc, char *argv[] ) { Matrix m; for( int i = 0; i < 1; i++ ) m.loadScaling( i ); std::cout << "m(0,0) = " << m(0,0) << std::endl; } pgpWZeXGqxnTe.pgp Description: PGP signature
Re: g++ doesn't unroll a loop it should unroll
I had already tried that. That doesn't change anything. I had also tried passing a higher --param max-unroll-times. No effect. So, any idea? The example program toto.cpp is so simple, I can't believe g++ can't handle it. Surely there must be something simple that I haven't understood? Benoit Le mercredi 13 décembre 2006 13:12, Steven Bosscher a écrit : > On 12/13/06, Benoît Jacob <[EMAIL PROTECTED]> wrote: > > g++ -DUNROLL -O3 toto.cpp -o toto ---> toto runs in 0.3 seconds > > g++ -O3 toto.cpp -o toto---> toto runs in 1.9 seconds > > > > So what can I do? Is that a bug in g++? If yes, any hope to see it fixed > > soon? > > You could try adding -funroll-loops. > > Gr. > Steven pgpsNhllyNHF0.pgp Description: PGP signature
Re: g++ doesn't unroll a loop it should unroll
Le mercredi 13 décembre 2006 23:09, Denis Vlasenko a écrit : > C++ doesn't specify that compiler shall unroll loops, so it cannot be > classified as "real" bug. OK, but then, even if I explicitly ask gcc to unroll loops with -funroll-loops, it still doesn't unroll them completely and is still as slow. See bug report here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30201 > Re code: I would use memset + just a single no, in this example the numbers are double, but in my template library the type is a "typename T" and I can make no assumption as to the bit representation of static_cast(0). > loop anyway... you C++ people tend to overtax compiler with > optimizations. Is it really necessary to do (i == j) * factor > when (i == j) ? factor : 0 is easier for compiler to grok? Of course I tried it. It's even slower. Doesn't help the compiler unroll the loop, and now there's a branch at each iteration. > Template lib for vector and matrix math sounds like a performance > disaster in the making, at least for me. However, maybe you are > truly smart guy and can do miracles. I don't understand why you say that. At the language specification level, templates come with no inherent speed overhead. All of the template stuff is unfolded at compile time, none of it remains visible in the binary, so it shouldn't make the binary slower. Benoit pgphvVzwRwvyK.pgp Description: PGP signature
Re: g++ doesn't unroll a loop it should unroll
Le jeudi 14 décembre 2006 08:58, Steven Bosscher a écrit : > On 12/14/06, Benoît Jacob <[EMAIL PROTECTED]> wrote: > > I don't understand why you say that. At the language specification level, > > templates come with no inherent speed overhead. All of the template stuff > > is unfolded at compile time, none of it remains visible in the binary, so > > it shouldn't make the binary slower. > > You're confusing theory and practice... We're getting offtopic here, the example program I sent in my first mail didn't have any templates, so the gcc bug we're talking about has nothing to do with templates. The same bug would appear in any C or C++ program having nested loops and expecting them to get completely unrolled. Benoit pgpI1YexNOPbn.pgp Description: PGP signature
Auto-vectorization: need to know what to expect
Dear All, I am currently (co-)developing a Free (GPL/LGPL) C++ library for vector/matrix math. A major decision that we need to take is, what to do regarding vectorization instructions (SSE). Either we rely on GCC to auto-vectorize, or we control explicitly the vectorization using GCC's special primitives. The latter solution is of course more difficult, and would to some degree obfuscate our source code, so we wish to know whether or not it's really necessary. GCC 4.3.0 does auto-vectorize our loops, but the resulting code has worse performance than a version with unrolled loops and no vectorization. By contrast, ICC auto-vectorizes the same loops in a way that makes them significantly faster than the unrolled-loops non-vectorized version. If you want to know, the loops in question typically look like: for(int i = 0; i < COMPILE_TIME_CONSTANT; i++) { // some abstract c++ code with deep recursive templates and // deep recursive inline functions, but resulting in only a // few assembly instructions a().b().c().d(i) = x().y().z(i); } As said above, it's crucial for us to be able to get an idea of what to expect, because design decisions depend on that. Should we expect large improvements regarding autovectorization in 4.3.x, in 4.4 or 4.5 ? A roadmap or a GCC developer sharing his thoughts would be very helpful. Cheers, Benoit P.S. I have noticed huge improvements in GCC recently and would like to thank all the developers for that. This is what makes me hope that GCC might soon handle auto-vectorization in a way that allows me to rely on it! signature.asc Description: This is a digitally signed message part.
Re: Auto-vectorization: need to know what to expect
Thanks Richard for the answer. It sounds like it's worth betting on gcc's autovectorizer and submitting bug reports -- so expect to hear again from us :) Cheers, Benoît On Monday 17 March 2008 15:59:21 Richard Guenther wrote: > On Mon, Mar 17, 2008 at 3:45 PM, Benoît Jacob <[EMAIL PROTECTED]> wrote: > > Dear All, > > > > I am currently (co-)developing a Free (GPL/LGPL) C++ library for > > vector/matrix math. > > > > A major decision that we need to take is, what to do regarding > > vectorization instructions (SSE). Either we rely on GCC to > > auto-vectorize, or we control explicitly the vectorization using GCC's > > special primitives. The latter solution is of course more difficult, and > > would to some degree obfuscate our source code, so we wish to know > > whether or not it's really necessary. > > > > GCC 4.3.0 does auto-vectorize our loops, but the resulting code has > > worse performance than a version with unrolled loops and no > > vectorization. By contrast, ICC auto-vectorizes the same loops in a way > > that makes them significantly faster than the unrolled-loops > > non-vectorized version. > > > > If you want to know, the loops in question typically look like: > > for(int i = 0; i < COMPILE_TIME_CONSTANT; i++) > > { > > // some abstract c++ code with deep recursive templates and > > // deep recursive inline functions, but resulting in only a > > // few assembly instructions > > a().b().c().d(i) = x().y().z(i); > > } > > > > As said above, it's crucial for us to be able to get an idea of what to > > expect, because design decisions depend on that. Should we expect large > > improvements regarding autovectorization in 4.3.x, in 4.4 or 4.5 ? > > In general GCCs autovectorization capabilities are quite good, cases > where we miss opportunities do of course exist. There were improvements > regarding autovectorization capabilities in every GCC release and I expect > that to continue for future releases (though I cannot promise anything > as GCC is a volunteer driven project - but certainly testcases where we > miss optimizations are welcome - often we don't know of all corner cases). > > If you require to get the absolute most out of your CPU I recommend to > provide special routines tuned for the different CPU families and I > recommend the use of the standard intrinsics headers (*mmintr.h) for > this. Of course this comes at a high cost of maintainance (and initial > work), so autovectorization might prove good enough. Often tuning the > source for a given compiler has a similar effect than producing vectorized > code manually. Looking at GCC tree dumps and knowing a bit about > GCC internals helps you here ;) > > > A roadmap or a GCC developer sharing his thoughts would be very helpful. > > Thanks, > Richard. signature.asc Description: This is a digitally signed message part.
Re: Auto-vectorization: need to know what to expect
I have looked more closely at the messages generated by the gcc 4.3 vectorizer and it seems that they fall into two categories: 1) complaining about aligmnent. For example: Unknown alignment for access: D.33485 Unknown alignment for access: m I don't understand, as all my data is statically allocated doubles (no dynamic memory allocation) and I am using -malign-double. What more can I do? 2) complaining about "possible dependence" between some data and itself Example: not vectorized, possible dependence between data-refs m.m_storage.m_data[D.43225_112] and m.m_storage.m_data[D.43225_112] I am wondering what to do about all that? Surely there must be documentation about the vectorizer and its messages somewhere but I can't find it? Cheers, Benoit On Monday 17 March 2008 15:59:21 Richard Guenther wrote: > On Mon, Mar 17, 2008 at 3:45 PM, Benoît Jacob <[EMAIL PROTECTED]> wrote: > > Dear All, > > > > I am currently (co-)developing a Free (GPL/LGPL) C++ library for > > vector/matrix math. > > > > A major decision that we need to take is, what to do regarding > > vectorization instructions (SSE). Either we rely on GCC to > > auto-vectorize, or we control explicitly the vectorization using GCC's > > special primitives. The latter solution is of course more difficult, and > > would to some degree obfuscate our source code, so we wish to know > > whether or not it's really necessary. > > > > GCC 4.3.0 does auto-vectorize our loops, but the resulting code has > > worse performance than a version with unrolled loops and no > > vectorization. By contrast, ICC auto-vectorizes the same loops in a way > > that makes them significantly faster than the unrolled-loops > > non-vectorized version. > > > > If you want to know, the loops in question typically look like: > > for(int i = 0; i < COMPILE_TIME_CONSTANT; i++) > > { > > // some abstract c++ code with deep recursive templates and > > // deep recursive inline functions, but resulting in only a > > // few assembly instructions > > a().b().c().d(i) = x().y().z(i); > > } > > > > As said above, it's crucial for us to be able to get an idea of what to > > expect, because design decisions depend on that. Should we expect large > > improvements regarding autovectorization in 4.3.x, in 4.4 or 4.5 ? > > In general GCCs autovectorization capabilities are quite good, cases > where we miss opportunities do of course exist. There were improvements > regarding autovectorization capabilities in every GCC release and I expect > that to continue for future releases (though I cannot promise anything > as GCC is a volunteer driven project - but certainly testcases where we > miss optimizations are welcome - often we don't know of all corner cases). > > If you require to get the absolute most out of your CPU I recommend to > provide special routines tuned for the different CPU families and I > recommend the use of the standard intrinsics headers (*mmintr.h) for > this. Of course this comes at a high cost of maintainance (and initial > work), so autovectorization might prove good enough. Often tuning the > source for a given compiler has a similar effect than producing vectorized > code manually. Looking at GCC tree dumps and knowing a bit about > GCC internals helps you here ;) > > > A roadmap or a GCC developer sharing his thoughts would be very helpful. > > Thanks, > Richard. signature.asc Description: This is a digitally signed message part.
Re: Auto-vectorization: need to know what to expect
OK. It's nontrivial as this uses a 2500-line c++ template library, but I'll do my best to come up with something self-contained. Cheers, Benoit On Monday 17 March 2008 18:51:57 Daniel Jacobowitz wrote: > On Mon, Mar 17, 2008 at 06:33:23PM +0100, Benoît Jacob wrote: > > I have looked more closely at the messages generated by the gcc 4.3 > > vectorizer and it seems that they fall into two categories: > > The absolute best thing you can do in cases like this is to make a > small program which shows the message, and send that to Bugzilla. signature.asc Description: This is a digitally signed message part.
Re: Auto-vectorization: need to know what to expect
Thanks a lot Michael for the detailed help! Thanks also n8tm, and sorry to have posted on the wrong list. Well that's a lot of food for thought and it'll keep me busy for some time, so thanks again to all, and bye! Benoit On Monday 17 March 2008 20:08:43 Michael Meissner wrote: > However, SSE instructions need 128-bit alignment, not 64-bit alignment that > -malign-double would give. You can align the arrays yourself with the > __attribute__((__aligned__(16))) declaration, or use a union that has an > element with 16-byte alignment (vector element, such as __m128, __m128d, > __m128i or long double and -m128bit-long-double). Note, if the arrays are > auto rather than static, you probably need to use the -mstackrealign and > -mpreferred-stack-boundary=16 as well. > > It might be nice to think about an option that automatically aligns large > arrays without having to do the declaration (or even have the vectorizer > override the alignment for statics/auto). signature.asc Description: This is a digitally signed message part.