On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote:
> > During "bashmark" memory benchmark perfomance analyze, I found 100x
> > perfomance
> > regression between gcc 3.4.5 and gcc 4.X.
> >
> > -- test_cmd.cpp (simplified bashmark memory RW test) ---
> > #include
> > #include
> >
> > template
> > static void int_membench(uint8_t* mb1, uint8_t* mb2)
> > {
> > for(uint32_t i = 0; i < Loops; i+=1)
> > {
> > #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
> > T T T T T
> > T T T T T
> > #undef T
> > }
> > }
> >
> > template
> > static void membench()
> > {
> > static uint8_t mb1[Buf_Size];
> > static uint8_t mb2[Buf_Size];
> > for(uint32_t i = 0; i < 1; i+=1)
> > int_membench(mb1, mb2);
> > }
> >
> > int main()
> > {
> > membench<128, 4000>();
> > return 0;
> > }
> >
> > ---
> > GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
> > GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
> > GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed
> >
> > Compiler options:
> > -march=athlon-xp
> > -O3
> > -fomit-frame-pointer
> > -mfpmath=sse -msse
> > -ftracer -fweb
> > -maccumulate-outgoing-args
> > -ffast-math
> >
> > I've played with various settings (-O2, -O1, without march, without tracer
> > and
> > web, etc) without any serious difference. I.e. GCC4 is always many times
> > slower
> > than GCC 3.4.5.
> >
> > Lurking inside assembler generation showed that GCC4 don't inline memcpy and
> > memset calls.
> >
> > -- test.c (uber simplified problem demonstration) -
> > #include
> >
> > char* f(char* b)
> > {
> > static char a[64];
> > memcpy(a, b, 64);
> > memset(a, 0, 64);
> > return a;
> > }
> >
> >
> > GCC4 will generate calls to memcpy and memset in this example. GCC3 will
> > inline
> > all calls.
> >
> > So, it looks like GCC4 inliner is broken at some point.
>
> Inlining of memcpy/memset is architecture dependent (I see calls
> on ppc for gcc 3.4, too). This is a stupid benchmark and as such
> not worth optimizing for.
>
bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is
just a test to demonstrate problem and as such can't be stupid. :)
Situation when compiler generates code from simple test that run 100
times slower, than code from previous compiler version is not normal
anyway. (and GCC3 generates smaller code, too)
I thought that this regression was caused by different "max-inline-*"
params setting in 4.X.
In any case: memcpy/memset inlining is broken in current GCC at least
on athlon arch.
--
Nickolay