100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-11 Thread Nickolay Kolchin
During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance
regression between gcc 3.4.5 and gcc 4.X.

-- test_cmd.cpp (simplified bashmark memory RW test) ---
#include 
#include 

template 
static void int_membench(uint8_t* mb1, uint8_t* mb2)
{
  for(uint32_t i = 0; i < Loops; i+=1)
  {
#define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
T T T T T
T T T T T
#undef T
  }
}

template 
static void membench()
{
  static uint8_t mb1[Buf_Size];
  static uint8_t mb2[Buf_Size];
  for(uint32_t i = 0; i < 1; i+=1)
int_membench(mb1, mb2);
}

int main()
{
  membench<128, 4000>();
  return 0;
}

---
GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed

Compiler options:
-march=athlon-xp
-O3
-fomit-frame-pointer
-mfpmath=sse -msse
-ftracer -fweb
-maccumulate-outgoing-args
-ffast-math

I've played with various settings (-O2, -O1, without march, without tracer and
web, etc) without any serious difference. I.e. GCC4 is always many times slower
than GCC 3.4.5.

Lurking inside assembler generation showed that GCC4 don't inline memcpy and
memset calls.

-- test.c (uber simplified problem demonstration) -
#include 

char* f(char* b)
{
  static char a[64];
  memcpy(a, b, 64);
  memset(a, 0, 64);
  return a;
}


GCC4 will generate calls to memcpy and memset in this example. GCC3 will inline
all calls.

So, it looks like GCC4 inliner is broken at some point.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Nickolay Kolchin
On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote:
> > During "bashmark" memory benchmark perfomance analyze, I found 100x 
> > perfomance
> > regression between gcc 3.4.5 and gcc 4.X.
> >
> > -- test_cmd.cpp (simplified bashmark memory RW test) ---
> > #include 
> > #include 
> >
> > template 
> > static void int_membench(uint8_t* mb1, uint8_t* mb2)
> > {
> >   for(uint32_t i = 0; i < Loops; i+=1)
> >   {
> > #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
> > T T T T T
> > T T T T T
> > #undef T
> >   }
> > }
> >
> > template 
> > static void membench()
> > {
> >   static uint8_t mb1[Buf_Size];
> >   static uint8_t mb2[Buf_Size];
> >   for(uint32_t i = 0; i < 1; i+=1)
> > int_membench(mb1, mb2);
> > }
> >
> > int main()
> > {
> >   membench<128, 4000>();
> >   return 0;
> > }
> >
> > ---
> > GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
> > GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
> > GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed
> >
> > Compiler options:
> > -march=athlon-xp
> > -O3
> > -fomit-frame-pointer
> > -mfpmath=sse -msse
> > -ftracer -fweb
> > -maccumulate-outgoing-args
> > -ffast-math
> >
> > I've played with various settings (-O2, -O1, without march, without tracer 
> > and
> > web, etc) without any serious difference. I.e. GCC4 is always many times 
> > slower
> > than GCC 3.4.5.
> >
> > Lurking inside assembler generation showed that GCC4 don't inline memcpy and
> > memset calls.
> >
> > -- test.c (uber simplified problem demonstration) -
> > #include 
> >
> > char* f(char* b)
> > {
> >   static char a[64];
> >   memcpy(a, b, 64);
> >   memset(a, 0, 64);
> >   return a;
> > }
> > 
> >
> > GCC4 will generate calls to memcpy and memset in this example. GCC3 will 
> > inline
> > all calls.
> >
> > So, it looks like GCC4 inliner is broken at some point.
>
> Inlining of memcpy/memset is architecture dependent (I see calls
> on ppc for gcc 3.4, too).  This is a stupid benchmark and as such
> not worth optimizing for.
>

bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is
just a test to demonstrate problem and as such can't be stupid. :)

Situation when compiler generates code from simple test that run 100
times slower, than code from previous compiler version is not normal
anyway.  (and GCC3 generates smaller code, too)

I thought that this regression was caused by different "max-inline-*"
params setting in 4.X.

In any case: memcpy/memset inlining is broken in current GCC at least
on athlon arch.

--
Nickolay


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Nickolay Kolchin
On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote:
>
> It is valid.  We should understand why this behavior has changed so 
> drastically.
>

I've attached assembler output from different compiler versions:

3.4.5-athlon-xp: gcc-3.4.5 -O3 -march=athlon-xp
3.4.5-pentium4: gcc-3.4.5 -O3 -march=pentium4
4.1.0-athlon-xp: gcc-4.1.0 -O3 -march=athlon-xp

As you can see, gcc-3.4.5 generates fastest code for
"-march=athlon-xp". This code should also run faster on any pentium
machine.

gcc-4.1.0 generates "same" slow code for "pentium" and "athlon" arch.

--
Nickolay


test_cmd-3.4.5-athlon-xp.s
Description: Binary data


test_cmd-3.4.5-pentium4.s
Description: Binary data


test_cmd-4.1.0-athlon-xp.s
Description: Binary data