A small case is attached to reproduce it.

Here are logs for different loop header alignment (default is 64):
linaro@Linaro-test:~$ gcc test1.c -o t.exe && time ./t.exe

real    0m3.206s
user    0m3.203s
sys     0m0.000s
linaro@Linaro-test:~$ gcc test1.c -DALIGNED_2 -o t.exe && time ./t.exe

real    0m2.898s
user    0m2.875s
sys     0m0.016s
linaro@Linaro-test:~$ gcc test1.c -DALIGNED_4 -o t.exe && time ./t.exe

real    0m2.851s
user    0m2.844s
sys     0m0.008s
linaro@Linaro-test:~$ gcc test1.c -DALIGNED_8 -o t.exe && time ./t.exe

real    0m3.167s
user    0m3.156s
sys     0m0.000s

Thanks!
-Zhenqiang

On 23 August 2012 10:09, Michael Hope <michael.h...@linaro.org> wrote:
> Zhenqiang's been working on the later split 2 patch which causes more
> constants to be built using a movw/movt instead of a constant pool
> load.  There was an unexpected ~10 % regression in one benchmark which
> seems to be due to function alignment.  I think we've tracked down the
> reason but not the action.
>
> Compared to the baseline, the split2 branch took 113 % of the time to
> run, i.e. 13 % longer.  Adding an explicit 16 byte alignment to the
> function changed this to 97 % of the time, i.e. 3 % faster.  The
> reason Zhenqiang and I got different results was the build-id.  He
> used the binary build scripts to make the cross compiler, which turn
> on the build ID, which added an extra 20 bytes ahead of .text, which
> happened to align the function to 16 bytes.  cbuild doesn't use the
> build-id (although it should) which happened to align the function to
> an 8 byte boundary.
>
> The disassembly is identical so I assume the regression is cache or
> fast loop related.  I'm not sure what to do, so let's talk about this
> at the next performance call.
>
> -- Michael
>
> _______________________________________________
> linaro-toolchain mailing list
> linaro-toolchain@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Attachment: test1.asm
Description: Binary data

volatile float a,b,c;

int
__attribute__ ((aligned(64)))
main() {
  int i,j;
  for (j = 0; j < 4; j++)
    {
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");

/******loop header is 64 Byte aligned */
#ifdef ALIGNED_2
      __asm__ ("nop");
#endif
 
#ifdef ALIGNED_4
      __asm__ ("nop");
      __asm__ ("nop");
#endif 
#ifdef ALIGNED_8
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
      __asm__ ("nop");
#endif
 
      for (i = j; i <65535000; i++)
	{
	  a = b + c;
	} 
    }
  return 0;
}
_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to