Re: __builtin_memcpy() slower than memcpy/bcopy (and on linux it is the opposite) ?

Dimitry Andric Wed, 23 Jan 2013 11:26:23 -0800

On 2013-01-23 17:32, Luigi Rizzo wrote:

Probably our compiler folks have some ideas on this...


When doing netmap i found that on FreeBSD memcpy/bcopy was expensive,
__builtin_memcpy() was even worse,


Which compilation flags did you use to test this?  When I compiled your
testcase program with clang 3.2, gcc 4.2 and gcc 4.7 at -O2, with all
other settings at their defaults, all three compilers just called libc's
memcpy() for the __builtin_memcpy tests.

For example, with gcc 4.7, the loop in test_builtin_memcpy becomes:

.L116:
        movq    %rbx, %rax
        addq    $1, %rbx
        andl    $262143, %eax
        movq    %rax, %rdx
        salq    $12, %rax
        salq    $8, %rdx
        leaq    huge(%rdx,%rax), %rsi
        movq    %r12, %rdx
        call    memcpy
        movq    24(%rbp), %rax
        movq    0(%rbp), %rdi
        addq    $1, %rax
        cmpq    %rbx, 4096(%rdi)
        movq    %rax, 24(%rbp)
        jg      .L116

The other routines are emitted as similar code.  For test_bcopy() the
loop becomes:

.L123:
        movq    %rbx, %rax
        addq    $1, %rbx
        andl    $262143, %eax
        movq    %rax, %rdx
        salq    $12, %rax
        salq    $8, %rdx
        leaq    huge(%rdx,%rax), %rsi
        movq    %r12, %rdx
        call    bcopy
        movq    24(%rbp), %rax
        movq    0(%rbp), %rdi
        addq    $1, %rax
        cmpq    %rbx, 4096(%rdi)
        movq    %rax, 24(%rbp)
        jg      .L123

and similarly, for test_memcpy() it becomes:

.L109:
        movq    %rbx, %rax
        addq    $1, %rbx
        andl    $262143, %eax
        movq    %rax, %rdx
        salq    $12, %rax
        salq    $8, %rdx
        leaq    huge(%rdx,%rax), %rdi
        movq    %r12, %rdx
        call    memcpy
        movq    24(%rbp), %rax
        movq    0(%rbp), %rsi
        addq    $1, %rax
        cmpq    %rbx, 4096(%rsi)
        movq    %rax, 24(%rbp)
        jg      .L109

In our libc, bcopy and memcpy are implemented from the same source file,
which just the arguments swapped around.  So I fail to see what could
cause the performance difference between __builtin_memcpy, memcpy and
bcopy you are seeing.

Also, on amd64, this is implemented in lib/libc/amd64/string/bcopy.S, so
the compiler does not have any influence on its performance.  Note the
routine uses "rep movsq" as its main loop, which is apparently not the
best way on modern CPUs.  Maybe you have found another instance where
hand-rolled assembly is slower than compiler-optimized code... :-)

With gcc 4.7, your fast_bcopy() gets inlined to this:

.L131:
        movq    (%rax), %rdx
        subl    $64, %ecx
        movq    %rdx, (%rsi)
        movq    8(%rax), %rdx
        movq    %rdx, 8(%rsi)
        movq    16(%rax), %rdx
        movq    %rdx, 16(%rsi)
        movq    24(%rax), %rdx
        movq    %rdx, 24(%rsi)
        movq    32(%rax), %rdx
        movq    %rdx, 32(%rsi)
        movq    40(%rax), %rdx
        movq    %rdx, 40(%rsi)
        movq    48(%rax), %r9
        movq    %r9, 48(%rsi)
        movq    56(%rax), %r9
        addq    $64, %rax
        movq    %r9, 56(%rsi)
        addq    $64, %rsi
        testl   %ecx, %ecx
        jg      .L131

while clang 3.2 produces:

.LBB14_5:
        movq    (%rdi), %rcx
        movq    %rcx, (%rsi)
        movq    8(%rdi), %rcx
        movq    %rcx, 8(%rsi)
        movq    16(%rdi), %rcx
        movq    %rcx, 16(%rsi)
        addl    $-64, %eax
        movq    24(%rdi), %rcx
        movq    %rcx, 24(%rsi)
        testl   %eax, %eax
        movq    32(%rdi), %rcx
        movq    %rcx, 32(%rsi)
        movq    40(%rdi), %rcx
        movq    %rcx, 40(%rsi)
        movq    48(%rdi), %rcx
        movq    %rcx, 48(%rsi)
        movq    56(%rdi), %rcx
        leaq    64(%rdi), %rdi
        movq    %rcx, 56(%rsi)
        leaq    64(%rsi), %rsi
        jg      .LBB14_5

Both are most likely faster than the "rep movsq" logic in bcopy.S.

and so i ended up writing
my custom routine, (called pkt_copy() in the program below).
This happens with gcc 4.2.1, clang, gcc 4.6.4

I was then surprised to notice that on a recent ubuntu using
gcc 4.6.2 (if that matters) the __builtin_memcpy beats other
methods by a large factor.


On Ubuntu, I see the same thing as on FreeBSD; __builtin_memcpy just
calls the regular memcpy.  However, eglibc's memcpy looks to be more
highly optimized; there are several CPU-specific implementations, for
example for i386 and amd64 arches:

sysdeps/i386/i586/memcpy_chk.S
sysdeps/i386/i586/memcpy.S
sysdeps/i386/i686/memcpy_chk.S
sysdeps/i386/i686/memcpy.S
sysdeps/i386/i686/multiarch/memcpy_chk.S
sysdeps/i386/i686/multiarch/memcpy.S
sysdeps/i386/i686/multiarch/memcpy-ssse3-rep.S
sysdeps/i386/i686/multiarch/memcpy-ssse3.S
sysdeps/x86_64/memcpy_chk.S
sysdeps/x86_64/memcpy.S
sysdeps/x86_64/multiarch/memcpy_chk.S
sysdeps/x86_64/multiarch/memcpy.S
sysdeps/x86_64/multiarch/memcpy-ssse3-back.S
sysdeps/x86_64/multiarch/memcpy-ssse3.S

Most likely, your test program on Ubuntu is calling the ssse3 version,
which should be much faster than any of the above loops.

Here are the number in millions of calls per second.  Is the test
program flawed, or the compiler is built with different options ?


I think the test program looks fine after lightly skimming it.
FreeBSD's memcpy is probably just slower for the CPUs you have been
testing on.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: __builtin_memcpy() slower than memcpy/bcopy (and on linux it is the opposite) ?

Reply via email to