------- Additional Comments From hubicka at ucw dot cz  2004-12-24 21:09 -------
Subject: Re:  [4.0 Regression] threefold performance loss, not inlining as much

> 
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-12-24 
> 20:36 -------
> Reduced testcase:
> const int LMAX = 4;
> const int LMAX41 = 4*LMAX+1;
> const int LMAX12 = (LMAX+1)*(LMAX+2)/2;
> template<int n>
> inline double accu1( const double* p1, const double* p2 )
> {
>      double d = *p1 * *p2;
>      return d + accu1<n-1>( ++p1, ++p2 );
> }
> template <> inline double accu1<0>( const double* p1, const double* p2 )
> {
>     return p1[0] * p2[0]; 
> }
> template <int ny, int nz>
> inline double accu2( const double* py, const double* pz, const double* h )
> {
>     const double d = accu1<nz>( pz, h ) * *py;
>     if( ny == 0 ) return d;
>     return d + accu2<(ny ? ny-1 : -1), (ny ? nz : -1 )>( ++py, pz, ++h );
> }
> template<>
> inline double accu2<-1, -1>( const double* , const double* , const double* )
> {
>     return 0.0;
> }
> template <int ny, int nz>
> inline double accu( const double* py, const double* pz, const double* h )
> {
>     if( ny == 0 ) return accu1<nz>( pz, h );
>     else  if( nz == 0 ) return accu1<ny>( py, h );
>     else if( nz >= ny ) return accu2<ny, nz>( py, pz, h );
>     else return accu2<nz, ny>( pz, py, h );
> }
> template <>
> inline double accu<0,0>( const double* , const double* , const double* h )
> {
>     return *h;
> }
> #define SWYZ( Y, Z ) ((Y+Z) * (Y+Z+1) / 2+Z)
> #define CASA( Y, Z ) case SWYZ( Y, Z ):         \
>         *ap1 = accu<Y, Z>( py, pz, dxb );      \
>     if( z1 == 0 ) break;                        \
>     ++ap1;                                      \
>     z1--; py += LMAX41; pz -= LMAX41;
> #define CAS( Y, Z ) case SWYZ( Y, Z ): *ap1 = accu<Y, Z>( py, pz, dxb ); break
> #define CAS1( Y ) CASA( Y, 1 );  CAS( Y+1, 0 );
> #define CAS2( Y ) CASA( Y, 2 ); CAS1( Y+1 );
> #define CAS3( Y ) CASA( Y, 3 ); CAS2( Y+1 );
> #define CAS4( Y ) CASA( Y, 4 ); CAS3( Y+1 );
> 
> double f(const double *py, const double *pz, double *dxb, double *ap1, int 
> mh_z1234, unsigned int 
> z1)
> {
>   switch( mh_z1234 )
>  {
>     CAS( 0, 0 );
>     CAS1(0);
>     CAS2(0);
>     CAS3(0);
>     CAS4(0);
>   }
> }
> 
> 
> When we do -O3 or -O2, we don't inline accu1<1> into accu1<2> at all, 
> why?????????

Because we inline other functions before we get into this one and
inline-unit-growth is reached...
Actually for very small units the inline-unit-growth limit seems to be
bit too tight, so we might think about bypassing this limit for very
small units, but this won't solve original testcase anyway....

Profiling branch seems to get this testcase right and inline everything
due to slightly different code size estimates, but it does not work
particularly well on tramp3d testcase (right now it is slightly worse
than mainline without profiling, I have patch to bring it back to
mainlie levels that is obviously far from optimum...)

I am unsure if we can come with more realistic cost model without
actually trying to inline the function and see how much it optimize as
suggested by some papers (but apparently not very suitable for
production compiler I would say), but I am all ears about ideas ;))

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17863

Reply via email to