On Mon, Sep 17, 2018 at 10:50:58AM +0200, Ilya Leoshkevich wrote:
> > Am 14.09.2018 um 23:35 schrieb Segher Boessenkool
> > <[email protected]>:
> > Could you please show generated code before and after this patch?
> > I mean generated assembler code. What -S gives you.
>
> Before:
>
> foo4:
> .LFB0:
> .cfi_startproc
> lt %r1,0(%r2)
> jne .L2
> lhi %r3,1
> cs %r1,%r3,0(%r2)
> .L2:
> jne .L5
> br %r14
> .L5:
> jg bar
>
> After:
>
> foo4:
> .LFB0:
> .cfi_startproc
> lt %r1,0(%r2)
> jne .L4
> lhi %r3,1
> cs %r1,%r3,0(%r2)
> jne .L4
> br %r14
> .L4:
> jg bar
Ah. And a compiler of some weeks old gives
foo4:
.LFB0:
.cfi_startproc
lhi %r3,0
lhi %r4,1
cs %r3,%r4,0(%r2)
jne .L4
br %r14
.L4:
jg bar
so this is all caused by the recent optimisation that avoids the "cs" if
it can.
> > Why does the existing jump threading not work for you; should it happen
> > at another time?
>
> We call cleanup_cfg (CLEANUP_THREADING) only once - during the „jump“
> pass, which happens before combine. There is also „jump2“ pass, which
> happens afterwards, and after discussion with Ulrich Weigand I tried to
> move jump threading there. While this change had the desired effect on
> the testcase, the code got worse in another places. Example from
> GemsFDTD:
>
> Before:
>
> 103e96c: b9 04 00 31 lgr %r3,%r1
> 103e970: a7 18 00 00 lhi %r1,0
> 103e974: e3 20 f0 d0 00 0d dsg %r2,208(%r15)
> 103e97a: b9 20 00 c3 cgr %r12,%r3
> 103e97e: a7 29 ff ff lghi %r2,-1
> 103e982: ec 12 00 01 00 42 lochih %r1,1
> 103e988: e3 20 f0 f8 00 82 xg %r2,248(%r15)
> 103e98e: 1a 15 ar %r1,%r5
> 103e990: b9 e9 c0 72 sgrk %r7,%r2,%r12
> 103e994: b3 c1 00 e2 ldgr %f14,%r2
> 103e998: b9 f8 a0 21 ark %r2,%r1,%r10
> 103e99c: 12 99 ltr %r9,%r9
> 103e99e: a7 18 00 00 lhi %r1,0
> 103e9a2: ec 1c 00 01 00 42 lochile %r1,1
> 103e9a8: 50 10 f1 28 st %r1,296(%r15)
> 103e9ac: c2 9d 00 00 00 00 cfi %r9,0
> 103e9b2: c0 c4 00 01 28 4e jgle 1063a4e
> 103e9b8: e5 4c f1 08 00 00 mvhi 264(%r15),0
> 103e9be: eb 14 00 03 00 0d sllg %r1,%r4,3
>
> After:
>
> 103e96c: b9 04 00 31 lgr %r3,%r1
> 103e970: e5 4c f1 18 00 01 mvhi 280(%r15),1
> 103e976: a7 18 00 00 lhi %r1,0
> 103e97a: e3 20 f0 d0 00 0d dsg %r2,208(%r15)
> 103e980: b9 20 00 c3 cgr %r12,%r3
> 103e984: a7 29 ff ff lghi %r2,-1
> 103e988: ec 12 00 01 00 42 lochih %r1,1
> 103e98e: e3 20 f0 f8 00 82 xg %r2,248(%r15)
> 103e994: 1a 15 ar %r1,%r5
> 103e996: b3 c1 00 e2 ldgr %f14,%r2
> 103e99a: b9 e9 c0 72 sgrk %r7,%r2,%r12
> 103e99e: b9 f8 a0 21 ark %r2,%r1,%r10
> 103e9a2: c2 9d 00 00 00 00 cfi %r9,0
> 103e9a8: c0 c4 00 01 28 51 jgle 1063a4a
> 103e9ae: e5 4c f1 18 00 00 mvhi 280(%r15),0
> 103e9b4: e5 4c f1 08 00 00 mvhi 264(%r15),0
> 103e9ba: eb 14 00 03 00 0d sllg %r1,%r4,3
>
> Diff:
>
> lgr %rn,%rn
> - mvhi m(%rx),1
> lhi %rn,0
>
> ar %rn,%rn
> - ldgr %fn,%rn
> sgrk %rn,%rn,%rn
> + ldgr %fn,%rn
> ark %rn,%rn,%rn
> + ltr %rn,%rn
> + lhi %rn,0
> + lochile %rn,1
> + st %rn,m(%rx)
> cfi %rn,0
>
> mvhi m(%rx),0
> - mvhi m(%rx),0
> sllg %rn,%rn,3
>
> The code used to compare %r9 with 0, jump based on the comparison
> result, and set the local variable in the branch. Now it compares %r9
> with 0 twice - I didn’t look at the details yet, but apparently we’re
> missing some optimization because jump threading is done too late.
Yeah, jump2 is quite late. I think you should do it before postreload_cse
or similar.
Btw, what percentage of files (or subroutines) does jump threading run
after combine, with your patch?
Segher