On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky <oleg.smol...@riverbed.com> wrote:
> On 2011/8/22 18:09, Oleg Smolsky wrote:
>>
>> Both compilers fully inline the templated function and the emitted code
>> looks very similar. I am puzzled as to why one of these loops is
>> significantly slower than the other. I've attached disassembled listings -
>> perhaps someone could have a look please? (the body of the loop starts at
>> 0000000000400FD for gcc41 and at 0000000000400D90 for gcc46)
>
> The difference, theoretically, should be due to the inner loop:
>
> v4.6:
> .text:0000000000400DA0 loc_400DA0:
> .text:0000000000400DA0                 add     eax, 0Ah
> .text:0000000000400DA3                 add     al, [rdx]
> .text:0000000000400DA5                 add     rdx, 1
> .text:0000000000400DA9                 cmp     rdx, 5034E0h
> .text:0000000000400DB0                 jnz     short loc_400DA0
>
> v4.1:
> .text:0000000000400FE0 loc_400FE0:
> .text:0000000000400FE0                 movzx   eax, ds:data8[rdx]
> .text:0000000000400FE7                 add     rdx, 1
> .text:0000000000400FEB                 add     eax, 0Ah
> .text:0000000000400FEE                 cmp     rdx, 1F40h
> .text:0000000000400FF5                 lea     ecx, [rax+rcx]
> .text:0000000000400FF8                 jnz     short loc_400FE0
>
> However, I cannot see how the first version would be slow... The custom
> templated "shifter" degenerates into "add 0xa", which is the point of the
> test... Hmm...

It is slower because of the subregister depedency between eax and al.

Thanks,
Andrew Pinski

Reply via email to