Re: gcc will become the best optimizing x86 compiler

Agner Fog Wed, 30 Jul 2008 10:15:26 -0700

Denys Vlasenko wrote:

3164 line source file which implements memcpy().
You got to be kidding.
How much of L1 icache it blows away in the process?
I bet it performs wonderfully on microbenchmarks though.

I agree that the OpenSolaris memcpy is bigger than necessary. However,it is necessary to have 16 branches for covering all possible alignmentsmodulo 16. This is because, unfortunately, there is no XMM shiftinstruction with a variable count, only with a constant count, so weneed one branch for each value of the shift count. Since only one of thebranches is used, it doesn't take much space in the code cache. Thespeed is improved by a factor 4-5 by this 16-branch algorithm, so it iscertainly worth the extra complexity.

The future AMD SSE5 instruction set offers a possibility to join themany branches into one, but only on AMD processors. Intel is not goingto support SSE5, and the future Intel AVX instruction set doesn't havean instruction that can be used for this purpose. So we will needseparate branches for Intel and AMD code in future implementation oflibc. (Explained in www.agner.org/optimize/asmexamples.zip).

"We unrolled the loop two gazillion times and it's 3% faster now"
is a similarly bad idea.

I agree completely. My memcpy code is much smaller than the OpenSolarisand Mac implementations and approximately equally fast. Some compilersunroll loops way too much in my opinion.

Re: gcc will become the best optimizing x86 compiler

Reply via email to