Hey Andrew,
On 2011/8/22 18:37, Andrew Pinski wrote:
On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky<oleg.smol...@riverbed.com> wrote:
On 2011/8/22 18:09, Oleg Smolsky wrote:
Both compilers fully inline the templated function and the emitted code
looks very similar. I am puzzled as to why one of these loops is
significantly slower than the other. I've attached disassembled listings -
perhaps someone could have a look please? (the body of the loop starts at
0000000000400FD for gcc41 and at 0000000000400D90 for gcc46)
The difference, theoretically, should be due to the inner loop:
v4.6:
.text:0000000000400DA0 loc_400DA0:
.text:0000000000400DA0 add eax, 0Ah
.text:0000000000400DA3 add al, [rdx]
.text:0000000000400DA5 add rdx, 1
.text:0000000000400DA9 cmp rdx, 5034E0h
.text:0000000000400DB0 jnz short loc_400DA0
v4.1:
.text:0000000000400FE0 loc_400FE0:
.text:0000000000400FE0 movzx eax, ds:data8[rdx]
.text:0000000000400FE7 add rdx, 1
.text:0000000000400FEB add eax, 0Ah
.text:0000000000400FEE cmp rdx, 1F40h
.text:0000000000400FF5 lea ecx, [rax+rcx]
.text:0000000000400FF8 jnz short loc_400FE0
However, I cannot see how the first version would be slow... The custom
templated "shifter" degenerates into "add 0xa", which is the point of the
test... Hmm...
It is slower because of the subregister depedency between eax and al.
Hmm... it is little difficult to reason about these fragments as they
are not equivalent in functionality. The g++4.1 version discards the
result while the other version (correctly) accumulates. Oh, I've just
realized that I grabbed the first iteration of the inner loop which was
factored out (perhaps due to unrolling?) Oops, my apologies.
Here are complete loops, out of a further digested test:
g++ 4.1 (1.35 sec, 1185M ops/s):
.text:0000000000400FDB loc_400FDB:
.text:0000000000400FDB xor ecx, ecx
.text:0000000000400FDD xor edx, edx
.text:0000000000400FDF nop
.text:0000000000400FE0
.text:0000000000400FE0 loc_400FE0:
.text:0000000000400FE0 movzx eax, ds:data8[rdx]
.text:0000000000400FE7 add rdx, 1
.text:0000000000400FEB add eax, 0Ah
.text:0000000000400FEE cmp rdx, 1F40h
.text:0000000000400FF5 lea ecx, [rax+rcx]
.text:0000000000400FF8 jnz short loc_400FE0
.text:0000000000400FFA movsx eax, cl
.text:0000000000400FFD add esi, 1
.text:0000000000401000 add ebx, eax
.text:0000000000401002 cmp esi, edi
.text:0000000000401004 jnz short loc_400FDB
g++ 4.6 (2.86s, 563M ops/s) :
.text:0000000000400D80 loc_400D80:
.text:0000000000400D80 mov edx, offset data8
.text:0000000000400D85 xor eax, eax
.text:0000000000400D87 db 66h, 66h
.text:0000000000400D87 nop
.text:0000000000400D8A db 66h, 66h
.text:0000000000400D8A nop
.text:0000000000400D8D db 66h, 66h
.text:0000000000400D8D nop
.text:0000000000400D90
.text:0000000000400D90 loc_400D90:
.text:0000000000400D90 add eax, 0Ah
.text:0000000000400D93 add al, [rdx]
.text:0000000000400D95 add rdx, 1
.text:0000000000400D99 cmp rdx, 503480h
.text:0000000000400DA0 jnz short loc_400D90
.text:0000000000400DA2 movsx eax, al
.text:0000000000400DA5 add ecx, 1
.text:0000000000400DA8 add ebx, eax
.text:0000000000400DAA cmp ecx, esi
.text:0000000000400DAC jnz short loc_400D80
Your observation still holds - there are two sequential instructions
that operate on the same register. So, I manually patched the 4.6
binary's inner loop to the following:
.text:0000000000400D90 add al, [rdx]
.text:0000000000400D92 add rdx, 1
.text:0000000000400D96 add eax, 0Ah
.text:0000000000400D99 cmp rdx, 503480h
.text:0000000000400DA0 jnz short loc_400D90
and that made no significant difference in performance.
Is this dependency really a performance issue? BTW, the outer loop
executes 200,000 times...
Thanks!
Oleg.
P.S. GDB disassembles the v4.6 emitted padding as:
0x0000000000400d87 <+231>: data32 xchg ax,ax
0x0000000000400d8a <+234>: data32 xchg ax,ax
0x0000000000400d8d <+237>: data32 xchg ax,ax