On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote:
> During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance
> regression between gcc 3.4.5 and gcc 4.X.
>
> ------ test_cmd.cpp (simplified bashmark memory RW test) -------
> #include <stdint.h>
> #include <cstring>
>
> template <const uint8_t Block_Size, const uint32_t Loops>
> static void int_membench(uint8_t* mb1, uint8_t* mb2)
> {
>   for(uint32_t i = 0; i < Loops; i+=1)
>   {
> #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
>     T T T T T
>     T T T T T
> #undef T
>   }
> }
>
> template <const uint32_t Buf_Size, const uint32_t Loops>
> static void membench()
> {
>   static uint8_t mb1[Buf_Size];
>   static uint8_t mb2[Buf_Size];
>   for(uint32_t i = 0; i < 10000; i+=1)
>     int_membench<Buf_Size, Loops>(mb1, mb2);
> }
>
> int main()
> {
>   membench<128, 4000>();
>   return 0;
> }
>
> ---------------------------------------------------------------
> GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
> GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
> GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed
>
> Compiler options:
>     -march=athlon-xp
>     -O3
>     -fomit-frame-pointer
>     -mfpmath=sse -msse
>     -ftracer -fweb
>     -maccumulate-outgoing-args
>     -ffast-math
>
> I've played with various settings (-O2, -O1, without march, without tracer and
> web, etc) without any serious difference. I.e. GCC4 is always many times 
> slower
> than GCC 3.4.5.
>
> Lurking inside assembler generation showed that GCC4 don't inline memcpy and
> memset calls.
>
> ------ test.c (uber simplified problem demonstration) ---------
> #include <string.h>
>
> char* f(char* b)
> {
>   static char a[64];
>   memcpy(a, b, 64);
>   memset(a, 0, 64);
>   return a;
> }
> ----------------------------------------------------------------
>
> GCC4 will generate calls to memcpy and memset in this example. GCC3 will 
> inline
> all calls.
>
> So, it looks like GCC4 inliner is broken at some point.

Inlining of memcpy/memset is architecture dependent (I see calls
on ppc for gcc 3.4, too).  This is a stupid benchmark and as such
not worth optimizing for.

Richard.

Reply via email to