Denys Vlasenko wrote:
3164 line source file which implements memcpy().
You got to be kidding.
How much of L1 icache it blows away in the process?
I bet it performs wonderfully on microbenchmarks though.
I agree that the OpenSolaris memcpy is bigger than necessary. However, it is necessary to have 16 branches for covering all possible alignments modulo 16. This is because, unfortunately, there is no XMM shift instruction with a variable count, only with a constant count, so we need one branch for each value of the shift count. Since only one of the branches is used, it doesn't take much space in the code cache. The speed is improved by a factor 4-5 by this 16-branch algorithm, so it is certainly worth the extra complexity.

The future AMD SSE5 instruction set offers a possibility to join the many branches into one, but only on AMD processors. Intel is not going to support SSE5, and the future Intel AVX instruction set doesn't have an instruction that can be used for this purpose. So we will need separate branches for Intel and AMD code in future implementation of libc. (Explained in www.agner.org/optimize/asmexamples.zip).

"We unrolled the loop two gazillion times and it's 3% faster now"
is a similarly bad idea.
I agree completely. My memcpy code is much smaller than the OpenSolaris and Mac implementations and approximately equally fast. Some compilers unroll loops way too much in my opinion.

Reply via email to