On 2013-01-23 17:32, Luigi Rizzo wrote:
Probably our compiler folks have some ideas on this...
When doing netmap i found that on FreeBSD memcpy/bcopy was expensive,
__builtin_memcpy() was even worse,
Which compilation flags did you use to test this? When I compiled your
testcase program with clang 3.2, gcc 4.2 and gcc 4.7 at -O2, with all
other settings at their defaults, all three compilers just called libc's
memcpy() for the __builtin_memcpy tests.
For example, with gcc 4.7, the loop in test_builtin_memcpy becomes:
.L116:
movq %rbx, %rax
addq $1, %rbx
andl $262143, %eax
movq %rax, %rdx
salq $12, %rax
salq $8, %rdx
leaq huge(%rdx,%rax), %rsi
movq %r12, %rdx
call memcpy
movq 24(%rbp), %rax
movq 0(%rbp), %rdi
addq $1, %rax
cmpq %rbx, 4096(%rdi)
movq %rax, 24(%rbp)
jg .L116
The other routines are emitted as similar code. For test_bcopy() the
loop becomes:
.L123:
movq %rbx, %rax
addq $1, %rbx
andl $262143, %eax
movq %rax, %rdx
salq $12, %rax
salq $8, %rdx
leaq huge(%rdx,%rax), %rsi
movq %r12, %rdx
call bcopy
movq 24(%rbp), %rax
movq 0(%rbp), %rdi
addq $1, %rax
cmpq %rbx, 4096(%rdi)
movq %rax, 24(%rbp)
jg .L123
and similarly, for test_memcpy() it becomes:
.L109:
movq %rbx, %rax
addq $1, %rbx
andl $262143, %eax
movq %rax, %rdx
salq $12, %rax
salq $8, %rdx
leaq huge(%rdx,%rax), %rdi
movq %r12, %rdx
call memcpy
movq 24(%rbp), %rax
movq 0(%rbp), %rsi
addq $1, %rax
cmpq %rbx, 4096(%rsi)
movq %rax, 24(%rbp)
jg .L109
In our libc, bcopy and memcpy are implemented from the same source file,
which just the arguments swapped around. So I fail to see what could
cause the performance difference between __builtin_memcpy, memcpy and
bcopy you are seeing.
Also, on amd64, this is implemented in lib/libc/amd64/string/bcopy.S, so
the compiler does not have any influence on its performance. Note the
routine uses "rep movsq" as its main loop, which is apparently not the
best way on modern CPUs. Maybe you have found another instance where
hand-rolled assembly is slower than compiler-optimized code... :-)
With gcc 4.7, your fast_bcopy() gets inlined to this:
.L131:
movq (%rax), %rdx
subl $64, %ecx
movq %rdx, (%rsi)
movq 8(%rax), %rdx
movq %rdx, 8(%rsi)
movq 16(%rax), %rdx
movq %rdx, 16(%rsi)
movq 24(%rax), %rdx
movq %rdx, 24(%rsi)
movq 32(%rax), %rdx
movq %rdx, 32(%rsi)
movq 40(%rax), %rdx
movq %rdx, 40(%rsi)
movq 48(%rax), %r9
movq %r9, 48(%rsi)
movq 56(%rax), %r9
addq $64, %rax
movq %r9, 56(%rsi)
addq $64, %rsi
testl %ecx, %ecx
jg .L131
while clang 3.2 produces:
.LBB14_5:
movq (%rdi), %rcx
movq %rcx, (%rsi)
movq 8(%rdi), %rcx
movq %rcx, 8(%rsi)
movq 16(%rdi), %rcx
movq %rcx, 16(%rsi)
addl $-64, %eax
movq 24(%rdi), %rcx
movq %rcx, 24(%rsi)
testl %eax, %eax
movq 32(%rdi), %rcx
movq %rcx, 32(%rsi)
movq 40(%rdi), %rcx
movq %rcx, 40(%rsi)
movq 48(%rdi), %rcx
movq %rcx, 48(%rsi)
movq 56(%rdi), %rcx
leaq 64(%rdi), %rdi
movq %rcx, 56(%rsi)
leaq 64(%rsi), %rsi
jg .LBB14_5
Both are most likely faster than the "rep movsq" logic in bcopy.S.
and so i ended up writing
my custom routine, (called pkt_copy() in the program below).
This happens with gcc 4.2.1, clang, gcc 4.6.4
I was then surprised to notice that on a recent ubuntu using
gcc 4.6.2 (if that matters) the __builtin_memcpy beats other
methods by a large factor.
On Ubuntu, I see the same thing as on FreeBSD; __builtin_memcpy just
calls the regular memcpy. However, eglibc's memcpy looks to be more
highly optimized; there are several CPU-specific implementations, for
example for i386 and amd64 arches:
sysdeps/i386/i586/memcpy_chk.S
sysdeps/i386/i586/memcpy.S
sysdeps/i386/i686/memcpy_chk.S
sysdeps/i386/i686/memcpy.S
sysdeps/i386/i686/multiarch/memcpy_chk.S
sysdeps/i386/i686/multiarch/memcpy.S
sysdeps/i386/i686/multiarch/memcpy-ssse3-rep.S
sysdeps/i386/i686/multiarch/memcpy-ssse3.S
sysdeps/x86_64/memcpy_chk.S
sysdeps/x86_64/memcpy.S
sysdeps/x86_64/multiarch/memcpy_chk.S
sysdeps/x86_64/multiarch/memcpy.S
sysdeps/x86_64/multiarch/memcpy-ssse3-back.S
sysdeps/x86_64/multiarch/memcpy-ssse3.S
Most likely, your test program on Ubuntu is calling the ssse3 version,
which should be much faster than any of the above loops.
Here are the number in millions of calls per second. Is the test
program flawed, or the compiler is built with different options ?
I think the test program looks fine after lightly skimming it.
FreeBSD's memcpy is probably just slower for the CPUs you have been
testing on.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"