On Fri, Apr 26, 2019 at 02:43:44PM +0800, Kewen.Lin wrote:
> > Does it create worse code now? What we have before your patch isn't
> > so super either (it has an sldi in the loop, it has two mtctr too).
> > Maybe you can show the generated code?
>
> It's a good question! From the generated codes for the core loop, the
> code after my patch doesn't have bdnz to leverage hardware CTR, it has
> extra cmpld and branch instead, looks worse. But I wrote a tiny case
> to invoke the foo and evaluated the running time, they are equal.
>
> * Measured time:
> After:
> real 199.47
> user 198.35
> sys 1.11
> Before:
> real 199.19
> user 198.56
> sys 0.62
Before:
> .L3: # core loop
> stw 10,0(8)
> addi 8,8,-1024
> bdnz .L3
So it didn't use an update instruction here, although it could. Not that
that changes anything: it would still be three cycles per iteration (that's
the minimum for any loop: instruction fetch is the bottleneck).
After:
> .L3: # core loop
> stw 8,0(9)
> addi 9,9,-1024
> cmpld 0,9,10 # cmp
> beqlr 0 # if eq, return
> stw 8,0(9)
> addi 9,9,-1024
> cmpld 0,9,10 # cmp again
> bne 0,.L3 # if ne, jump to L3.
This is unrolled a factor 2. It should be faster, unfortunately it updates
r9 twice per unrolled loop, making the dependency chain too long.
The bdnz loop could be like
0:
stw 10,-1024(8)
bdzlr
stwu 10,-2048(8)
bdnz 0b
or similar. There are multiple problems before we can get that :-)
(The important one is that the pointer (r8 here) should be updated only
once per unrolled loop iteration; just like in the version without bdnz.
Using bdzlr and stwu is just niceties, compared to that).
> I practiced whether we can adjust the decision made in ivopts.
[ snip ]
> Need more investigation.
Yeah.
Segher