gcc will become the best optimizing x86 compiler

2008-07-23 Thread Agner Fog
Hi, I am doing research on optimization of microprocessors and 
compilers. Some of you already know my optimization manuals 
(www.agner.org/optimize/).


I have tested many different compilers and compared how well they 
optimize C++ code. I have been pleased to observe that gcc has been 
improved a lot in the last couple of years. The gcc compiler itself is 
now matching the optimizing performance of the Intel compiler and it 
beats all other compilers I have tested. All you hard-working developers 
deserve credit for this!


I can imagine that gcc might be the compiler of choice for all x86 and 
x86-64 platforms in the future. Actually, the compiler itself is very 
close to being the best, but it appears that the function libraries are 
lacking behind. I have tested a few of the most important functions in 
libc and compared them with other available libraries (MS, Borland, 
Intel, Mac). The comparison does not look good for gnu libc. See my test 
results in http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6. 
The 64-bit version is better than the 32-bit version, though.


The first thing that you can do to improve the performance is to drop 
the builtin versions of memory and string functions. The speed can be 
improved by up to a factor 5 in some cases by compiling with 
-fno-builtin. The builtin version is never optimal, except for memcpy in 
cases where the count is a small compile-time constant so that it can be 
replaced by simple mov instructions.


Next, the function libraries should have CPU-dispatching and use the 
latest instruction sets where appropriate. You are not even using XMM 
registers for memcpy in 64-bit libc.


I think you can borrow code from the Mac/Darwin/Xnu project. They have 
optimized these functions very carefully for the Intel Core and Core 2 
processors. Of course they have the advantage that they don't need to 
support any other processors, whereas gcc has to support every possible 
Intel and AMD processor. This means more CPU-dispatching.


I have made a few optimized functions myself and published them as a 
multi-platform library (www.agner.org/optimize/asmlib.zip). It is faster 
than most other libraries on an Intel Core2 and up to ten times faster 
than gcc using builtin functions. My library is published with GPL 
license, but I will allow you to use my code in gnu libc if you wish 
(Sorry, I don't have the time to work on the gnu project myself, but you 
may contact me for details about the code).


The Windows version of gcc is not up to date, but I think that when gcc 
gets a reputation as the best compiler, more people will be motivated to 
update cygwin/mingw. A lot of people are actually using it.




Re: gcc will become the best optimizing x86 compiler

2008-07-24 Thread Agner Fog

Dennis Clarke wrote:
>The Sun Studio 12 compiler with Solaris 10 on AMD Opteron or
>UltraSparc beats GCC in almost every single test case that I have
>seen.

This is memcpy on Solaris:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/i386/gen/memcpy.s

It uses exactly the same method as memcpy on gcc libc, with only minor 
differences that have no influence on performance.


Also, you have provided no data at all.  


I have linked to the data rather than copying it here to save space on 
the mailing list. Here is the link again:

http://www.agner.org/optimize/optimizing_cpp.pdf  section 2.6, page 12.


So your assertions are those of a marketing person at the moment.


Who sounds like a marketing person, you or me? :-)

> Please post some code that can be compiled and then tested with high 
resolution timers and perhaps

> we can compare notes.

Here is my code, again:
http://www.agner.org/optimize/asmlib.zip
My test results, referred to above, uses the "core clock cycles" 
performance counter on Intel and RDTSC on AMD. It's the highest 
resolution you can get. Feel free to do you own tests, it's as simple as 
linking my library into your test program.


Tim Prince wrote:
>you identify the library you tested only as "ubuntu g++ 4.2.3."
Where can I see the libc version?

>The corresponding 64-bit linux will see vastly different levels of 
performance, depending on the

>glibc version, as it doesn't use a builtin string move.
Yes, this is exactly what my tests show. 64-bit libc is better than 
32-bit libc, but still 3-4 times slower than the best library for 
unaligned operands on an Intel.


>Certain newer CPUs aim to improve performance of the 32-bit gcc 
builtin string moves, but don't

> entirely eliminate the situations where it isn't optimum.

The Intel manuals are not clear about this. Intel Optimization reference 
manual says:
>In most cases, applications should take advantage of the default 
memory routines provided by Intel compilers.
What an excellent advice - the Intel compiler puts in a library with an 
automatic run-slowly-on-AMD feature!

The Intel library does not use rep movs when running on an Intel CPU.

The AMD software optimization guide mentions specific situations where 
rep movs is optimal. However, my tests on an Opteron (K8) tell that rep 
movs is never optimal on AMD either. I have no access to test it on the 
new AMD K10, but I expect the XMM register code to run much faster on 
K10 than on K8 because K10 has 128-bit data paths where K8 has only 64-bit.


Evidently, the problem with memcpy has been ignored for years, see 
http://softwarecommunity.intel.com/Wiki/Linux/719.htm




Re: gcc will become the best optimizing x86 compiler

2008-07-24 Thread Agner Fog

Joseph S. Myers wrote:

>I don't know if it was proposed in this context, but the ARM EABI has
>various __aeabi_mem* functions for calls known to have particular
>alignment and the idea is relevant to other platforms if you provide such
>functions with the compiler. The compiler could also generate calls to
>different functions depending on the -march options and so save the
>runtime CPU check cost (you could have options to call either generic
>versions, or versions for a particular CPU, depending on whether you are
>building a generic binary for CPU-X-or-newer or a binary just for CPU X).

memcpy in the Intel and Mac libraries, as well as my own code, have 
different branches for different alignments and different CPU 
instruction sets. The runtime cost for this branching is negligible 
compared to the gain, even when the byte count is small. No need to 
bother the programmer with different versions.


You can just copy the code from the Mac library, or from me.



Re: gcc will become the best optimizing x86 compiler

2008-07-24 Thread Agner Fog

Basile STARYNKEVITCH wrote:
>At last, at the recent (july 2008) GCC summit, someone (sorry I forgot 
who, probably someone from SuSE)
> proposed in a BOFS to have architecture and machine specific 
hand-tuned (or even hand-written assembly) low

> level libraries for such basic things as memset etc..

That's exactly what I meant. The most important memory, string and math 
functions should use hand-tuned assembly with CPU dispatching for the 
latest instruction sets. My experiments show that the speed can be 
improved by a factor 3 - 10 for unaligned memcpy on Intel processors 
(http://www.agner.org/optimize/optimizing_cpp.pdf page 12).


There will be more hand-tuning work to do when the 256-bit YMM registes 
become available in a few years - and more to gain in speed.


Re: gcc will become the best optimizing x86 compiler

2008-07-25 Thread Agner Fog

Raksit Ashok wrote:
>There is a more optimized version for 64-bit:
>http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s
>I think this looks similar to your implementation, Agner.

Yes it is similar to my code.

Gnu libc could borrow a lot of optimized functions from Opensolaris and 
Mac and other open source projects. They look better than Gnu libc, but 
there is still room for improvement. For example, Opensolaris does not 
use XMM registers for strlen, although this is simpler than using 
general purpose registers (see my code www.agner.org/optimize/asmlib.zip)


Re: Length-Changing Prefixes problem with the x86 Backend

2008-07-25 Thread Agner Fog

On Thu, 26 Jun 2008 Uros wrote:

>Please also add a runtime test that can be used to analyze the problem.

I am a temporary guest on the gcc mailing list and I haven't seen your 
mail before. In case your problem hasn't been solved yet, I can inform 
you that I have a disassembler which puts comments into the disassembly 
file in case of length-changing prefixes and other sub-optimal or 
illegal instruction codes. Just compile with -c to get an object file 
and run it on the disassembler:

objconv -fasm yourfile.o yourfile.asm

It supports all x86 instruction sets up to SSE4.2 and SSE5 (but not AVX 
and FMA yet). It may be useful for testing other compiler features as 
well, such as support for new instruction sets.


Get it from www.agner.org/optimize/objconv.zip
This is a cross-platform multi-purpose tool. The assembly output is in 
MASM format, not AT&T. Use .intel_syntax noprefix in case you want to 
assemble the disassembly on GAS.




Re: gcc will become the best optimizing x86 compiler

2008-07-26 Thread Agner Fog

Michael Meissner wrote:

On Fri, Jul 25, 2008 at 09:08:42AM +0200, Agner Fog wrote:
  
Gnu libc could borrow a lot of optimized functions from Opensolaris and  
Mac and other open source projects. They look better than Gnu libc, but  
there is still room for improvement. For example, Opensolaris does not  
use XMM registers for strlen, although this is simpler than using  
general purpose registers (see my code www.agner.org/optimize/asmlib.zip)



Note, glibc can only take code that is appropriately licensed and donated to
the FSF.  In addition it must meet the coding standards for glibc.
  
The Mac/Xnu and Opensolaris projects have fairly liberal public 
licenses. If there are legal differences, maybe the copyright owner is 
open to negotiation. My own code has GPL license. The fact that I am 
offering my code to you also means, of course, that I am willing to 
grant the necessary license.



Also note, that it depends on the basic chip level what is fastest for the
operation (for example, using XMM registers are not faster for current AMD
platforms).
  
Indeed. That's why I am talking about CPU dispatching (i.e. different 
branches for different CPUs). The CPU dispatching can be done with just 
a single jump instruction:
At the function entry there is an indirect jump through a pointer to the 
appropriate version. The code pointer initially points to a CPU 
dispatcher. The CPU dispatcher detects which CPU it is running on, and 
replaces the code pointer with a pointer to the appropriate version, 
then jumps to the pointer. The next time the function is called, it 
follows the pointer directly to the right version.


My memcpy runs faster with XMM registers than with 64-bit x64 registers 
on AMD K8.
My strlen runs slower with XMM registers than with 64-bit x64 registers 
on AMD K8.


I expect the XMM versions to run much faster on AMD K10, because it has 
full 128-bit execution units and data paths, where K8 has only 64-bits. 
I have not had the chance to test this on AMD K10 yet.


I believe it is best to optimize for the newest processors, because the 
processor that is brand new today will become mainstream in a few years.

Memcpy/memset optimizations were added to glibc 2.8, though when your favorite
distribution will provide it is a different question:
http://sourceware.org/ml/libc-alpha/2008-04/msg00050.html
  

I have libc version 2.7. Can't find version 2.8.


Re: gcc will become the best optimizing x86 compiler

2008-07-28 Thread Agner Fog

Michael Meissner wrote:
>Memcpy/memset optimizations were added to glibc 2.8, though when your 
favorite

>distribution will provide it is a different question:
>http://sourceware.org/ml/libc-alpha/2008-04/msg00050.html

I finally got a SUSE with glibc 2.8. I can see that 32-bit memcpy has 
been modified with an extra misalignment branch, but no significant 
improvement. Glibc 2.8 is NOT faster than glibc 2.7 in my tests. It 
still doesn't use XMM registers.


Glibc 2.8 is still almost 5 times slower than the best function 
libraries for unaligned data on Intel Core 2, and the default builtin 
function is slower than any other implementation I have seen (copies 1 
byte at a time!).


Tarjei Knapstad wrote:
>2008/7/26 Agner Fog <[EMAIL PROTECTED]>:
>>I have libc version 2.7. Can't find version 2.8
>It's in Fedora 9, I have no idea why the source isn't directly
>available from the glibc homepage.

2.8 is not an official final release yet.



Re: gcc will become the best optimizing x86 compiler

2008-07-28 Thread Agner Fog

Michael Matz wrote:
You must be doing something wrong.  If the compiler decides to inline the 
string ops it either knows the size or you told it to do it anyway 
(-minline-all-stringops or -minline-stringops-dynamically).  In both cases 
will it use wider than byte moves when possible.
  
g++ (v. 4.2.3) without any options converts memcpy with unknown size to  
rep movsb

g++ with option -fno-builtin calls memcpy in libc

The rep movs, stos, scas, cmps instructions are slower than function 
calls except in rare cases. The compiler should never use the string 
instructions. It is OK to use mov instructions if the size is known, but 
not string instructions.


Re: gcc will become the best optimizing x86 compiler

2008-07-28 Thread Agner Fog

Gerald Pfeifer wrote:

See how user friendly we in GCC-land are in comparison? ;-)
  
Since there is no libc mailing list, I thought that the gcc list is the 
place to contact the maintainers of libc. Am I on the wrong list? Or are 
there no maintainers of libc?


Re: gcc will become the best optimizing x86 compiler

2008-07-30 Thread Agner Fog

Denys Vlasenko wrote:

3164 line source file which implements memcpy().
You got to be kidding.
How much of L1 icache it blows away in the process?
I bet it performs wonderfully on microbenchmarks though.

I agree that the OpenSolaris memcpy is bigger than necessary. However, 
it is necessary to have 16 branches for covering all possible alignments 
modulo 16. This is because, unfortunately, there is no XMM shift 
instruction with a variable count, only with a constant count, so we 
need one branch for each value of the shift count. Since only one of the 
branches is used, it doesn't take much space in the code cache. The 
speed is improved by a factor 4-5 by this 16-branch algorithm, so it is 
certainly worth the extra complexity.


The future AMD SSE5 instruction set offers a possibility to join the 
many branches into one, but only on AMD processors. Intel is not going 
to support SSE5, and the future Intel AVX instruction set doesn't have 
an instruction that can be used for this purpose. So we will need 
separate branches for Intel and AMD code in future implementation of 
libc. (Explained in www.agner.org/optimize/asmexamples.zip).



"We unrolled the loop two gazillion times and it's 3% faster now"
is a similarly bad idea.
  
I agree completely. My memcpy code is much smaller than the OpenSolaris 
and Mac implementations and approximately equally fast. Some compilers 
unroll loops way too much in my opinion.


Re: gcc will become the best optimizing x86 compiler

2008-07-30 Thread Agner Fog

Denys Vlasenko wrote:

I tend to doubt that odd-byte aligned large memcpys are anywhere
near typical. malloc and mmap both return well-aligned buffers
(say, 8 byte aligned). Static and on-stack objects are also
at least word-aligned 99% of the time.

memcpy can just use "relatively simple" code for copies in which
either src or dst is not word aligned. This cuts possibilities down
from 16 to 4 (or even 2?).
  
The XMM code is still more than 3 times faster than rep movsl when data 
are aligned by 4 or 8, but not by 16.
Even if odd addresses are rare, they must be supported, but we can put 
the most common cases first.
strcpy and strcat can be implemented efficiently simply by calling 
strlen and memcpy, since both strlen and memcpy can be optimized very 
well. This can give unaligned addresses.


Dennis Clarke wrote:

You forgot to look at PowerPC :

http://cvs.opensolaris.org/source/xref/ppc-dev/ppc-dev/usr/src/lib/libc/ppc/gen/memcpy.s

is that nice and small ?
  

.. and slow. Why doesn't it use Altivec?