On 2011/8/23 11:38, Xinliang David Li wrote:
Partial register stall happens when there is a 32bit register read
followed by a partial register write. In your case, the stall probably
happens in the next iteration when 'add eax, 0Ah' executes, so your
manual patch does not work. Try change
add al, [dx] into two instructions (assuming esi is available here)
movzx esi, ds:data8[dx]
add eax, esi
I patched the code to use "movzx edi" but the result is a little clumsy
as the loop is based on the virtual address rather than index. Also, the
sequence is a bit bigger so I had to spill the patch into the preceding
padding:
.text:0000000000400D80 loc_400D80:
.text:0000000000400D80 mov edx, offset data8
.text:0000000000400D85 xor eax, eax
.text:0000000000400D87 nop
.text:0000000000400D88 nop
.text:0000000000400D89 nop
.text:0000000000400D8A nop
.text:0000000000400D8B nop
.text:0000000000400D8C
.text:0000000000400D8C loc_400D8C:
.text:0000000000400D8C movzx edi, byte ptr [rdx+0]
.text:0000000000400D90 add eax, edi
.text:0000000000400D92 add eax, 0Ah
.text:0000000000400D95 add rdx, 1
.text:0000000000400D99 cmp rdx, 503480h
.text:0000000000400DA0 jnz short loc_400D8C
.text:0000000000400DA2 movsx eax, al
.text:0000000000400DA5 add ecx, 1
.text:0000000000400DA8 add ebx, eax
.text:0000000000400DAA cmp ecx, esi
.text:0000000000400DAC jnz short loc_400D80
The performance improved from 2.84 sec (563.38 M ops/s) to 1.51 sec
(1059.60 M ops/s). It's close to the code emitted by g++4.1 now. Very funky!
So, this is one test out of the suite. Many of them degraded... Are you
guys interested in looking at other ones? Or is there something to be
fixed in the register allocation logic?
Oleg.