On 2011/8/22 18:09, Oleg Smolsky wrote:
Both compilers fully inline the templated function and the emitted code looks very similar. I am puzzled as to why one of these loops is significantly slower than the other. I've attached disassembled listings - perhaps someone could have a look please? (the body of the loop starts at 0000000000400FD for gcc41 and at 0000000000400D90 for gcc46)
The difference, theoretically, should be due to the inner loop:

v4.6:
.text:0000000000400DA0 loc_400DA0:
.text:0000000000400DA0                 add     eax, 0Ah
.text:0000000000400DA3                 add     al, [rdx]
.text:0000000000400DA5                 add     rdx, 1
.text:0000000000400DA9                 cmp     rdx, 5034E0h
.text:0000000000400DB0                 jnz     short loc_400DA0

v4.1:
.text:0000000000400FE0 loc_400FE0:
.text:0000000000400FE0                 movzx   eax, ds:data8[rdx]
.text:0000000000400FE7                 add     rdx, 1
.text:0000000000400FEB                 add     eax, 0Ah
.text:0000000000400FEE                 cmp     rdx, 1F40h
.text:0000000000400FF5                 lea     ecx, [rax+rcx]
.text:0000000000400FF8                 jnz     short loc_400FE0

However, I cannot see how the first version would be slow... The custom templated "shifter" degenerates into "add 0xa", which is the point of the test... Hmm...

Oleg.

Reply via email to