gcc will become the best optimizing x86 compiler
Hi, I am doing research on optimization of microprocessors and compilers. Some of you already know my optimization manuals (www.agner.org/optimize/). I have tested many different compilers and compared how well they optimize C++ code. I have been pleased to observe that gcc has been improved a lot in the last couple of years. The gcc compiler itself is now matching the optimizing performance of the Intel compiler and it beats all other compilers I have tested. All you hard-working developers deserve credit for this! I can imagine that gcc might be the compiler of choice for all x86 and x86-64 platforms in the future. Actually, the compiler itself is very close to being the best, but it appears that the function libraries are lacking behind. I have tested a few of the most important functions in libc and compared them with other available libraries (MS, Borland, Intel, Mac). The comparison does not look good for gnu libc. See my test results in http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6. The 64-bit version is better than the 32-bit version, though. The first thing that you can do to improve the performance is to drop the builtin versions of memory and string functions. The speed can be improved by up to a factor 5 in some cases by compiling with -fno-builtin. The builtin version is never optimal, except for memcpy in cases where the count is a small compile-time constant so that it can be replaced by simple mov instructions. Next, the function libraries should have CPU-dispatching and use the latest instruction sets where appropriate. You are not even using XMM registers for memcpy in 64-bit libc. I think you can borrow code from the Mac/Darwin/Xnu project. They have optimized these functions very carefully for the Intel Core and Core 2 processors. Of course they have the advantage that they don't need to support any other processors, whereas gcc has to support every possible Intel and AMD processor. This means more CPU-dispatching. I have made a few optimized functions myself and published them as a multi-platform library (www.agner.org/optimize/asmlib.zip). It is faster than most other libraries on an Intel Core2 and up to ten times faster than gcc using builtin functions. My library is published with GPL license, but I will allow you to use my code in gnu libc if you wish (Sorry, I don't have the time to work on the gnu project myself, but you may contact me for details about the code). The Windows version of gcc is not up to date, but I think that when gcc gets a reputation as the best compiler, more people will be motivated to update cygwin/mingw. A lot of people are actually using it.
Re: gcc will become the best optimizing x86 compiler
Dennis Clarke wrote: >The Sun Studio 12 compiler with Solaris 10 on AMD Opteron or >UltraSparc beats GCC in almost every single test case that I have >seen. This is memcpy on Solaris: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/i386/gen/memcpy.s It uses exactly the same method as memcpy on gcc libc, with only minor differences that have no influence on performance. Also, you have provided no data at all. I have linked to the data rather than copying it here to save space on the mailing list. Here is the link again: http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6, page 12. So your assertions are those of a marketing person at the moment. Who sounds like a marketing person, you or me? :-) > Please post some code that can be compiled and then tested with high resolution timers and perhaps > we can compare notes. Here is my code, again: http://www.agner.org/optimize/asmlib.zip My test results, referred to above, uses the "core clock cycles" performance counter on Intel and RDTSC on AMD. It's the highest resolution you can get. Feel free to do you own tests, it's as simple as linking my library into your test program. Tim Prince wrote: >you identify the library you tested only as "ubuntu g++ 4.2.3." Where can I see the libc version? >The corresponding 64-bit linux will see vastly different levels of performance, depending on the >glibc version, as it doesn't use a builtin string move. Yes, this is exactly what my tests show. 64-bit libc is better than 32-bit libc, but still 3-4 times slower than the best library for unaligned operands on an Intel. >Certain newer CPUs aim to improve performance of the 32-bit gcc builtin string moves, but don't > entirely eliminate the situations where it isn't optimum. The Intel manuals are not clear about this. Intel Optimization reference manual says: >In most cases, applications should take advantage of the default memory routines provided by Intel compilers. What an excellent advice - the Intel compiler puts in a library with an automatic run-slowly-on-AMD feature! The Intel library does not use rep movs when running on an Intel CPU. The AMD software optimization guide mentions specific situations where rep movs is optimal. However, my tests on an Opteron (K8) tell that rep movs is never optimal on AMD either. I have no access to test it on the new AMD K10, but I expect the XMM register code to run much faster on K10 than on K8 because K10 has 128-bit data paths where K8 has only 64-bit. Evidently, the problem with memcpy has been ignored for years, see http://softwarecommunity.intel.com/Wiki/Linux/719.htm
Re: gcc will become the best optimizing x86 compiler
Joseph S. Myers wrote: >I don't know if it was proposed in this context, but the ARM EABI has >various __aeabi_mem* functions for calls known to have particular >alignment and the idea is relevant to other platforms if you provide such >functions with the compiler. The compiler could also generate calls to >different functions depending on the -march options and so save the >runtime CPU check cost (you could have options to call either generic >versions, or versions for a particular CPU, depending on whether you are >building a generic binary for CPU-X-or-newer or a binary just for CPU X). memcpy in the Intel and Mac libraries, as well as my own code, have different branches for different alignments and different CPU instruction sets. The runtime cost for this branching is negligible compared to the gain, even when the byte count is small. No need to bother the programmer with different versions. You can just copy the code from the Mac library, or from me.
Re: gcc will become the best optimizing x86 compiler
Basile STARYNKEVITCH wrote: >At last, at the recent (july 2008) GCC summit, someone (sorry I forgot who, probably someone from SuSE) > proposed in a BOFS to have architecture and machine specific hand-tuned (or even hand-written assembly) low > level libraries for such basic things as memset etc.. That's exactly what I meant. The most important memory, string and math functions should use hand-tuned assembly with CPU dispatching for the latest instruction sets. My experiments show that the speed can be improved by a factor 3 - 10 for unaligned memcpy on Intel processors (http://www.agner.org/optimize/optimizing_cpp.pdf page 12). There will be more hand-tuning work to do when the 256-bit YMM registes become available in a few years - and more to gain in speed.
Re: gcc will become the best optimizing x86 compiler
Raksit Ashok wrote: >There is a more optimized version for 64-bit: >http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s >I think this looks similar to your implementation, Agner. Yes it is similar to my code. Gnu libc could borrow a lot of optimized functions from Opensolaris and Mac and other open source projects. They look better than Gnu libc, but there is still room for improvement. For example, Opensolaris does not use XMM registers for strlen, although this is simpler than using general purpose registers (see my code www.agner.org/optimize/asmlib.zip)
Re: Length-Changing Prefixes problem with the x86 Backend
On Thu, 26 Jun 2008 Uros wrote: >Please also add a runtime test that can be used to analyze the problem. I am a temporary guest on the gcc mailing list and I haven't seen your mail before. In case your problem hasn't been solved yet, I can inform you that I have a disassembler which puts comments into the disassembly file in case of length-changing prefixes and other sub-optimal or illegal instruction codes. Just compile with -c to get an object file and run it on the disassembler: objconv -fasm yourfile.o yourfile.asm It supports all x86 instruction sets up to SSE4.2 and SSE5 (but not AVX and FMA yet). It may be useful for testing other compiler features as well, such as support for new instruction sets. Get it from www.agner.org/optimize/objconv.zip This is a cross-platform multi-purpose tool. The assembly output is in MASM format, not AT&T. Use .intel_syntax noprefix in case you want to assemble the disassembly on GAS.
Re: gcc will become the best optimizing x86 compiler
Michael Meissner wrote: On Fri, Jul 25, 2008 at 09:08:42AM +0200, Agner Fog wrote: Gnu libc could borrow a lot of optimized functions from Opensolaris and Mac and other open source projects. They look better than Gnu libc, but there is still room for improvement. For example, Opensolaris does not use XMM registers for strlen, although this is simpler than using general purpose registers (see my code www.agner.org/optimize/asmlib.zip) Note, glibc can only take code that is appropriately licensed and donated to the FSF. In addition it must meet the coding standards for glibc. The Mac/Xnu and Opensolaris projects have fairly liberal public licenses. If there are legal differences, maybe the copyright owner is open to negotiation. My own code has GPL license. The fact that I am offering my code to you also means, of course, that I am willing to grant the necessary license. Also note, that it depends on the basic chip level what is fastest for the operation (for example, using XMM registers are not faster for current AMD platforms). Indeed. That's why I am talking about CPU dispatching (i.e. different branches for different CPUs). The CPU dispatching can be done with just a single jump instruction: At the function entry there is an indirect jump through a pointer to the appropriate version. The code pointer initially points to a CPU dispatcher. The CPU dispatcher detects which CPU it is running on, and replaces the code pointer with a pointer to the appropriate version, then jumps to the pointer. The next time the function is called, it follows the pointer directly to the right version. My memcpy runs faster with XMM registers than with 64-bit x64 registers on AMD K8. My strlen runs slower with XMM registers than with 64-bit x64 registers on AMD K8. I expect the XMM versions to run much faster on AMD K10, because it has full 128-bit execution units and data paths, where K8 has only 64-bits. I have not had the chance to test this on AMD K10 yet. I believe it is best to optimize for the newest processors, because the processor that is brand new today will become mainstream in a few years. Memcpy/memset optimizations were added to glibc 2.8, though when your favorite distribution will provide it is a different question: http://sourceware.org/ml/libc-alpha/2008-04/msg00050.html I have libc version 2.7. Can't find version 2.8.
Re: gcc will become the best optimizing x86 compiler
Michael Meissner wrote: >Memcpy/memset optimizations were added to glibc 2.8, though when your favorite >distribution will provide it is a different question: >http://sourceware.org/ml/libc-alpha/2008-04/msg00050.html I finally got a SUSE with glibc 2.8. I can see that 32-bit memcpy has been modified with an extra misalignment branch, but no significant improvement. Glibc 2.8 is NOT faster than glibc 2.7 in my tests. It still doesn't use XMM registers. Glibc 2.8 is still almost 5 times slower than the best function libraries for unaligned data on Intel Core 2, and the default builtin function is slower than any other implementation I have seen (copies 1 byte at a time!). Tarjei Knapstad wrote: >2008/7/26 Agner Fog <[EMAIL PROTECTED]>: >>I have libc version 2.7. Can't find version 2.8 >It's in Fedora 9, I have no idea why the source isn't directly >available from the glibc homepage. 2.8 is not an official final release yet.
Re: gcc will become the best optimizing x86 compiler
Michael Matz wrote: You must be doing something wrong. If the compiler decides to inline the string ops it either knows the size or you told it to do it anyway (-minline-all-stringops or -minline-stringops-dynamically). In both cases will it use wider than byte moves when possible. g++ (v. 4.2.3) without any options converts memcpy with unknown size to rep movsb g++ with option -fno-builtin calls memcpy in libc The rep movs, stos, scas, cmps instructions are slower than function calls except in rare cases. The compiler should never use the string instructions. It is OK to use mov instructions if the size is known, but not string instructions.
Re: gcc will become the best optimizing x86 compiler
Gerald Pfeifer wrote: See how user friendly we in GCC-land are in comparison? ;-) Since there is no libc mailing list, I thought that the gcc list is the place to contact the maintainers of libc. Am I on the wrong list? Or are there no maintainers of libc?
Re: gcc will become the best optimizing x86 compiler
Denys Vlasenko wrote: 3164 line source file which implements memcpy(). You got to be kidding. How much of L1 icache it blows away in the process? I bet it performs wonderfully on microbenchmarks though. I agree that the OpenSolaris memcpy is bigger than necessary. However, it is necessary to have 16 branches for covering all possible alignments modulo 16. This is because, unfortunately, there is no XMM shift instruction with a variable count, only with a constant count, so we need one branch for each value of the shift count. Since only one of the branches is used, it doesn't take much space in the code cache. The speed is improved by a factor 4-5 by this 16-branch algorithm, so it is certainly worth the extra complexity. The future AMD SSE5 instruction set offers a possibility to join the many branches into one, but only on AMD processors. Intel is not going to support SSE5, and the future Intel AVX instruction set doesn't have an instruction that can be used for this purpose. So we will need separate branches for Intel and AMD code in future implementation of libc. (Explained in www.agner.org/optimize/asmexamples.zip). "We unrolled the loop two gazillion times and it's 3% faster now" is a similarly bad idea. I agree completely. My memcpy code is much smaller than the OpenSolaris and Mac implementations and approximately equally fast. Some compilers unroll loops way too much in my opinion.
Re: gcc will become the best optimizing x86 compiler
Denys Vlasenko wrote: I tend to doubt that odd-byte aligned large memcpys are anywhere near typical. malloc and mmap both return well-aligned buffers (say, 8 byte aligned). Static and on-stack objects are also at least word-aligned 99% of the time. memcpy can just use "relatively simple" code for copies in which either src or dst is not word aligned. This cuts possibilities down from 16 to 4 (or even 2?). The XMM code is still more than 3 times faster than rep movsl when data are aligned by 4 or 8, but not by 16. Even if odd addresses are rare, they must be supported, but we can put the most common cases first. strcpy and strcat can be implemented efficiently simply by calling strlen and memcpy, since both strlen and memcpy can be optimized very well. This can give unaligned addresses. Dennis Clarke wrote: You forgot to look at PowerPC : http://cvs.opensolaris.org/source/xref/ppc-dev/ppc-dev/usr/src/lib/libc/ppc/gen/memcpy.s is that nice and small ? .. and slow. Why doesn't it use Altivec?