Performance gain through dereferencing?
I have made a curious performance observation with gcc under 64 bit cygwin on a corei7. I'm genuinely puzzled and couldn't find any information about it. Perhaps this is only indirectly a gcc question though, bear with me. I have two trivial programs which assign a loop variable to a local variable 10^8 times. One does it the obvious way, the other one accesses the variable through a pointer, which means it must dereference the pointer first. This is reflected nicely in the disassembly snippets of the respective loop bodies below. Funny enough, the loop with the extra dereferencing runs considerably faster than the loop with the direct assignment (>10%). While the issue (indeed the whole program ;-) ) goes away with optimization, in less trivial scenarios that may not be so. My first question is: What makes the smaller code slower? The gcc question is: Should assignment always be performed through a pointer if it is faster? (Probably not, but why not?) A session transcript including the compilable source is below. Here are the disassembled loop bodies: Direct access = localInt = i; 1004010e6: 8b 45 fcmov-0x4(%rbp),%eax 1004010e9: 89 45 f8mov%eax,-0x8(%rbp) Pointer access = *localP = i; 1004010ee: 48 8b 45 f0 mov-0x10(%rbp),%rax 1004010f2: 8b 55 fcmov-0x4(%rbp),%edx 1004010f5: 89 10 mov%edx,(%rax) Note the first instruction which moves the address into %rax. The other two are similar to the direct assignment above.-- Here is a session transcript: $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/4.8.2/lto-wrapper.exe Target: x86_64-pc-cygwin Configured with: /cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2/configure --srcdir=/cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2 --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --libexecdir=/usr/libexec --datadir=/usr/share --localstatedir=/var --sysconfdir=/etc --libdir=/usr/lib --datarootdir=/usr/share --docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C --build=x86_64-pc-cygwin --host=x86_64-pc-cygwin --target=x86_64-pc-cygwin --without-libiconv-prefix --without-libintl-prefix --enable-shared --enable-shared-libgcc --enable-static --enable-version-specific-runtime-libs --enable-bootstrap --disable-__cxa_atexit --with-dwarf2 --with-tune=generic --enable-languages=ada,c,c++,fortran,lto,objc,obj-c++ --enable-graphite --enable-threads=posix --enable-libatomic --enable-libgomp --disable-libitm --enable-libquadmath --enable-libquadmath-support --enable-libssp --enable-libada --enable-libgcj-sublibs --disable-java-awt --disable-symvers --with-ecj-jar=/usr/share/java/ecj.jar --with-gnu-ld --with-gnu-as --with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix --without-libintl-prefix --with-system-zlib --libexecdir=/usr/lib Thread model: posix gcc version 4.8.2 (GCC) peter@peter-lap ~/src/test/obj_vs_ptr $ cat ./t #!/bin/bash cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1 peter@peter-lap ~/src/test/obj_vs_ptr $ ./t obj int main() { int localInt; for (int i = 0; i < 1; ++i) localInt = i; return 0; } real0m0.248s user0m0.234s sys 0m0.015s peter@peter-lap ~/src/test/obj_vs_ptr $ ./t ptr int main() { int localInt; int *localP = &localInt; for (int i = 0; i < 1; ++i) *localP = i; return 0; } real0m0.215s user0m0.203s sys 0m0.000s
Re: Performance gain through dereferencing?
Hi David, Sorry, I had included more information in an earlier draft which I edited out for brevity. > You cannot learn useful timing > information from a single run of a short > test like this - there are far too many > other factors that come into play. I didn't mention that I have run it dozens of times. I know that blunt runtime measurements on a non-realtime system tend to be non-reproducible, and that they are inadequate for exact measurements. But the difference here is so large that the result is highly significant, in spite of the "amateurish" setup. The run I am showing here is typical. One of my four cores is surely idle at any given moment, and there is no I/O, so the variations are small. You cannot learn useful timing information from unoptimised code. I beg to disagree. While in this case the problem (and indeed eventually the whole program ;-) ) goes away with optimization that may not be the case in less trivial scenarios. And optimization or not -- I would always contend that *p = n is **not slower** than i = n. But it is. Something is wrong ;-). So I'd like to direct our attention to the generated code and its performance (because such code conceivably could appear as the result of an optimized compiler run as well, in less trivial scenarios). What puzzles me is: How can it be that two instructions are slower than a very similar pair of instructions plus another one? (And that question is totally unrelated to optimization.) Otherwise the result could be nothing more than a quirk of the way caching worked out. Could you explain how caching could play a role here if all variables and adresses are on the stack and are likely to be in the same memory page? (I'm not being sarcastic -- I may miss something obvious). I can imagine that somehow the processor architecture is better utilized by the faster version (e.g. because short inner loops pipleline worse or whatever). For what it's worth, the programs were running on a i7-3632QM.
Re: Performance gain through dereferencing?
In order to see what difference a different processor makes I also tried the same code on a fairly old 32 bit "AMD Athlon(tm) XP 3000+" with the current stable gcc (4.7.2). The difference is even more striking (dereferencing is much faster). I see that the size of the code inside the loop for the faster pointer access is exactly 8. No idea whether that has any significance. Here as well I performed several runs with similar results. Statistical significance was established around n=2 ;-). gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/i486-linux-gnu/4.7/lto-wrapper Target: i486-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --enable-targets=all --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu Thread model: posix gcc version 4.7.2 (Debian 4.7.2-5) ppeterr@www:~/src/test/obj-vs-ptr$ cat t #!/bin/bash cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1 ppeterr@www:~/src/test/obj-vs-ptr$ ./t obj int main() { int localInt; for (int i = 0; i < 1; ++i) localInt = i; return 0; } real0m0.418s user0m0.416s sys 0m0.004s ppeterr@www:~/src/test/obj-vs-ptr$ ./t ptr int main() { int localInt; int *localP = &localInt; for (int i = 0; i < 1; ++i) *localP = i; return 0; } real0m0.243s user0m0.240s sys 0m0.000s === The disassembly is for the direct access (slower): localInt = i; 80483eb: 8b 45 fcmov-0x4(%ebp),%eax 80483ee: 89 45 f8mov%eax,-0x8(%ebp) And for the pointer access (faster): *localP = i; 80483f1: 8b 45 f8mov-0x8(%ebp),%eax 80483f4: 8b 55 fcmov-0x4(%ebp),%edx 80483f7: 89 10 mov%edx,(%eax)