Hey Andrew,

On 2011/8/22 18:37, Andrew Pinski wrote:
On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky<oleg.smol...@riverbed.com>  wrote:
On 2011/8/22 18:09, Oleg Smolsky wrote:
Both compilers fully inline the templated function and the emitted code
looks very similar. I am puzzled as to why one of these loops is
significantly slower than the other. I've attached disassembled listings -
perhaps someone could have a look please? (the body of the loop starts at
0000000000400FD for gcc41 and at 0000000000400D90 for gcc46)
The difference, theoretically, should be due to the inner loop:

v4.6:
.text:0000000000400DA0 loc_400DA0:
.text:0000000000400DA0                 add     eax, 0Ah
.text:0000000000400DA3                 add     al, [rdx]
.text:0000000000400DA5                 add     rdx, 1
.text:0000000000400DA9                 cmp     rdx, 5034E0h
.text:0000000000400DB0                 jnz     short loc_400DA0

v4.1:
.text:0000000000400FE0 loc_400FE0:
.text:0000000000400FE0                 movzx   eax, ds:data8[rdx]
.text:0000000000400FE7                 add     rdx, 1
.text:0000000000400FEB                 add     eax, 0Ah
.text:0000000000400FEE                 cmp     rdx, 1F40h
.text:0000000000400FF5                 lea     ecx, [rax+rcx]
.text:0000000000400FF8                 jnz     short loc_400FE0

However, I cannot see how the first version would be slow... The custom
templated "shifter" degenerates into "add 0xa", which is the point of the
test... Hmm...
It is slower because of the subregister depedency between eax and al.

Hmm... it is little difficult to reason about these fragments as they are not equivalent in functionality. The g++4.1 version discards the result while the other version (correctly) accumulates. Oh, I've just realized that I grabbed the first iteration of the inner loop which was factored out (perhaps due to unrolling?) Oops, my apologies.

Here are complete loops, out of a further digested test:

g++ 4.1 (1.35 sec, 1185M ops/s):

.text:0000000000400FDB loc_400FDB:
.text:0000000000400FDB                 xor     ecx, ecx
.text:0000000000400FDD                 xor     edx, edx
.text:0000000000400FDF                 nop
.text:0000000000400FE0
.text:0000000000400FE0 loc_400FE0:
.text:0000000000400FE0                 movzx   eax, ds:data8[rdx]
.text:0000000000400FE7                 add     rdx, 1
.text:0000000000400FEB                 add     eax, 0Ah
.text:0000000000400FEE                 cmp     rdx, 1F40h
.text:0000000000400FF5                 lea     ecx, [rax+rcx]
.text:0000000000400FF8                 jnz     short loc_400FE0
.text:0000000000400FFA                 movsx   eax, cl
.text:0000000000400FFD                 add     esi, 1
.text:0000000000401000                 add     ebx, eax
.text:0000000000401002                 cmp     esi, edi
.text:0000000000401004                 jnz     short loc_400FDB

g++ 4.6 (2.86s, 563M ops/s) :

.text:0000000000400D80 loc_400D80:
.text:0000000000400D80                 mov     edx, offset data8
.text:0000000000400D85                 xor     eax, eax
.text:0000000000400D87                 db      66h, 66h
.text:0000000000400D87                 nop
.text:0000000000400D8A                 db      66h, 66h
.text:0000000000400D8A                 nop
.text:0000000000400D8D                 db      66h, 66h
.text:0000000000400D8D                 nop
.text:0000000000400D90
.text:0000000000400D90 loc_400D90:
.text:0000000000400D90                 add     eax, 0Ah
.text:0000000000400D93                 add     al, [rdx]
.text:0000000000400D95                 add     rdx, 1
.text:0000000000400D99                 cmp     rdx, 503480h
.text:0000000000400DA0                 jnz     short loc_400D90
.text:0000000000400DA2                 movsx   eax, al
.text:0000000000400DA5                 add     ecx, 1
.text:0000000000400DA8                 add     ebx, eax
.text:0000000000400DAA                 cmp     ecx, esi
.text:0000000000400DAC                 jnz     short loc_400D80

Your observation still holds - there are two sequential instructions that operate on the same register. So, I manually patched the 4.6 binary's inner loop to the following:

.text:0000000000400D90                 add     al, [rdx]
.text:0000000000400D92                 add     rdx, 1
.text:0000000000400D96                 add     eax, 0Ah
.text:0000000000400D99                 cmp     rdx, 503480h
.text:0000000000400DA0                 jnz     short loc_400D90

and that made no significant difference in performance.

Is this dependency really a performance issue? BTW, the outer loop executes 200,000 times...

Thanks!

Oleg.

P.S. GDB disassembles the v4.6 emitted padding as:

   0x0000000000400d87 <+231>:   data32 xchg ax,ax
   0x0000000000400d8a <+234>:   data32 xchg ax,ax
   0x0000000000400d8d <+237>:   data32 xchg ax,ax

Reply via email to