On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky <oleg.smol...@riverbed.com> wrote: > On 2011/8/22 18:09, Oleg Smolsky wrote: >> >> Both compilers fully inline the templated function and the emitted code >> looks very similar. I am puzzled as to why one of these loops is >> significantly slower than the other. I've attached disassembled listings - >> perhaps someone could have a look please? (the body of the loop starts at >> 0000000000400FD for gcc41 and at 0000000000400D90 for gcc46) > > The difference, theoretically, should be due to the inner loop: > > v4.6: > .text:0000000000400DA0 loc_400DA0: > .text:0000000000400DA0 add eax, 0Ah > .text:0000000000400DA3 add al, [rdx] > .text:0000000000400DA5 add rdx, 1 > .text:0000000000400DA9 cmp rdx, 5034E0h > .text:0000000000400DB0 jnz short loc_400DA0 > > v4.1: > .text:0000000000400FE0 loc_400FE0: > .text:0000000000400FE0 movzx eax, ds:data8[rdx] > .text:0000000000400FE7 add rdx, 1 > .text:0000000000400FEB add eax, 0Ah > .text:0000000000400FEE cmp rdx, 1F40h > .text:0000000000400FF5 lea ecx, [rax+rcx] > .text:0000000000400FF8 jnz short loc_400FE0 > > However, I cannot see how the first version would be slow... The custom > templated "shifter" degenerates into "add 0xa", which is the point of the > test... Hmm...
It is slower because of the subregister depedency between eax and al. Thanks, Andrew Pinski