On 2011/8/22 18:09, Oleg Smolsky wrote:
Both compilers fully inline the templated function and the emitted
code looks very similar. I am puzzled as to why one of these loops is
significantly slower than the other. I've attached disassembled
listings - perhaps someone could have a look please? (the body of the
loop starts at 0000000000400FD for gcc41 and at 0000000000400D90 for
gcc46)
The difference, theoretically, should be due to the inner loop:
v4.6:
.text:0000000000400DA0 loc_400DA0:
.text:0000000000400DA0 add eax, 0Ah
.text:0000000000400DA3 add al, [rdx]
.text:0000000000400DA5 add rdx, 1
.text:0000000000400DA9 cmp rdx, 5034E0h
.text:0000000000400DB0 jnz short loc_400DA0
v4.1:
.text:0000000000400FE0 loc_400FE0:
.text:0000000000400FE0 movzx eax, ds:data8[rdx]
.text:0000000000400FE7 add rdx, 1
.text:0000000000400FEB add eax, 0Ah
.text:0000000000400FEE cmp rdx, 1F40h
.text:0000000000400FF5 lea ecx, [rax+rcx]
.text:0000000000400FF8 jnz short loc_400FE0
However, I cannot see how the first version would be slow... The custom
templated "shifter" degenerates into "add 0xa", which is the point of
the test... Hmm...
Oleg.