Performance gain through dereferencing?

2014-04-16 Thread Peter Schneider
I have made a curious performance observation with gcc under 64 bit 
cygwin on a corei7. I'm genuinely puzzled and couldn't find any 
information about it. Perhaps this is only indirectly a gcc question 
though, bear with me.


I have two trivial programs which assign a loop variable to a local 
variable 10^8 times. One does it the obvious way, the other one accesses 
the variable through a pointer, which means it must dereference the 
pointer first. This is reflected nicely in the disassembly snippets of 
the respective loop bodies below. Funny enough, the loop with the extra 
dereferencing runs considerably faster than the loop with the direct 
assignment (>10%). While the issue (indeed the whole program ;-) ) goes 
away with optimization, in less trivial scenarios that may not be so.


My first question is: What makes the smaller code slower?
The gcc question is: Should assignment always be performed through a 
pointer if it is faster? (Probably not, but why not?) A session 
transcript including the compilable source is below.


Here are the disassembled loop bodies:

Direct access
=
localInt = i;
   1004010e6:   8b 45 fcmov-0x4(%rbp),%eax
   1004010e9:   89 45 f8mov%eax,-0x8(%rbp)


Pointer access
=
*localP = i;
   1004010ee:   48 8b 45 f0 mov-0x10(%rbp),%rax
   1004010f2:   8b 55 fcmov-0x4(%rbp),%edx
   1004010f5:   89 10   mov%edx,(%rax)

Note the first instruction which moves the address into %rax. The other 
two are similar to the direct assignment above.--


Here is a session transcript:

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/4.8.2/lto-wrapper.exe
Target: x86_64-pc-cygwin
Configured with: 
/cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2/configure 
--srcdir=/cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2 
--prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin 
--libexecdir=/usr/libexec --datadir=/usr/share --localstatedir=/var 
--sysconfdir=/etc --libdir=/usr/lib --datarootdir=/usr/share 
--docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C 
--build=x86_64-pc-cygwin --host=x86_64-pc-cygwin 
--target=x86_64-pc-cygwin --without-libiconv-prefix 
--without-libintl-prefix --enable-shared --enable-shared-libgcc 
--enable-static --enable-version-specific-runtime-libs 
--enable-bootstrap --disable-__cxa_atexit --with-dwarf2 
--with-tune=generic 
--enable-languages=ada,c,c++,fortran,lto,objc,obj-c++ --enable-graphite 
--enable-threads=posix --enable-libatomic --enable-libgomp 
--disable-libitm --enable-libquadmath --enable-libquadmath-support 
--enable-libssp --enable-libada --enable-libgcj-sublibs 
--disable-java-awt --disable-symvers 
--with-ecj-jar=/usr/share/java/ecj.jar --with-gnu-ld --with-gnu-as 
--with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix 
--without-libintl-prefix --with-system-zlib --libexecdir=/usr/lib

Thread model: posix
gcc version 4.8.2 (GCC)

peter@peter-lap ~/src/test/obj_vs_ptr
$ cat ./t
#!/bin/bash

cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1


peter@peter-lap ~/src/test/obj_vs_ptr
$ ./t obj
int main()
{
int localInt;
for (int i = 0; i < 1; ++i)
localInt = i;
return 0;
}
real0m0.248s
user0m0.234s
sys 0m0.015s

peter@peter-lap ~/src/test/obj_vs_ptr
$ ./t ptr
int main()
{
int localInt;
int *localP = &localInt;
for (int i = 0; i < 1; ++i)
*localP = i;
return 0;
}

real0m0.215s
user0m0.203s
sys 0m0.000s



Re: Performance gain through dereferencing?

2014-04-16 Thread Peter Schneider

Hi David,

Sorry, I had included more information in an earlier draft which I 
edited out for brevity.


> You cannot learn useful timing
> information from a single run of a short
> test like this - there are far too many
> other factors that come into play.

I didn't mention that I have run it dozens of times. I know that blunt 
runtime measurements on a non-realtime system tend to be 
non-reproducible, and that they are inadequate for exact measurements. 
But the difference here is so large that the result is highly 
significant, in spite of the "amateurish" setup. The run I am showing 
here is typical. One of my four cores is surely idle at any given 
moment, and there is no I/O, so the variations are small.



You cannot learn useful timing information from unoptimised code.


I beg to disagree. While in this case the problem (and indeed eventually 
the whole program ;-) ) goes away with optimization that may not be the 
case in less trivial scenarios. And optimization or not -- I would 
always contend that *p = n is **not slower** than i = n. But it is. 
Something is wrong ;-).


So I'd like to direct our attention to the generated code and its 
performance (because such code conceivably could appear as the result of 
an optimized compiler run as well, in less trivial scenarios). What 
puzzles me is: How can it be that two instructions are slower than a 
very similar pair of instructions plus another one? (And that question 
is totally unrelated to optimization.)



Otherwise the
result could be nothing more than a quirk of the way caching worked out.


Could you explain how caching could play a role here if all variables 
and adresses are on the stack and are likely to be in the same memory 
page? (I'm not being sarcastic -- I may miss something obvious).


I can imagine that somehow the processor architecture is better utilized 
by the faster version (e.g. because short inner loops pipleline worse or 
whatever). For what it's worth, the programs were running on a i7-3632QM.


Re: Performance gain through dereferencing?

2014-04-16 Thread Peter Schneider
In order to see what difference a different processor makes I also tried 
the same code on a fairly old 32 bit "AMD Athlon(tm) XP 3000+" with the 
current stable gcc (4.7.2). The difference is even more striking 
(dereferencing is much faster). I see that the size of the code inside 
the loop for the faster pointer access is exactly 8. No idea whether 
that has any significance.


Here as well I performed several runs with similar results. Statistical 
significance was established around n=2 ;-).


gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i486-linux-gnu/4.7/lto-wrapper
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5' 
--with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs 
--enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr 
--program-suffix=-4.7 --enable-shared --enable-linker-build-id 
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext 
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-libstdcxx-time=yes 
--enable-gnu-unique-object --enable-plugin --enable-objc-gc 
--enable-targets=all --with-arch-32=i586 --with-tune=generic 
--enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu 
--target=i486-linux-gnu

Thread model: posix
gcc version 4.7.2 (Debian 4.7.2-5)

ppeterr@www:~/src/test/obj-vs-ptr$  cat t
#!/bin/bash
cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1

ppeterr@www:~/src/test/obj-vs-ptr$ ./t obj
int main()
{
int localInt;
for (int i = 0; i < 1; ++i)
localInt = i;
return 0;
}

real0m0.418s
user0m0.416s
sys 0m0.004s
ppeterr@www:~/src/test/obj-vs-ptr$ ./t ptr
int main()
{
int localInt;
int *localP = &localInt;
for (int i = 0; i < 1; ++i)
*localP = i;
return 0;
}

real0m0.243s
user0m0.240s
sys 0m0.000s

===

The disassembly is for the direct access (slower):

localInt = i;
 80483eb:   8b 45 fcmov-0x4(%ebp),%eax
 80483ee:   89 45 f8mov%eax,-0x8(%ebp)

And for the pointer access (faster):

*localP = i;
 80483f1:   8b 45 f8mov-0x8(%ebp),%eax
 80483f4:   8b 55 fcmov-0x4(%ebp),%edx
 80483f7:   89 10   mov%edx,(%eax)