Re: LTO slows down calculix by more than 10% on aarch64

Prathamesh Kulkarni via Gcc Thu, 24 Sep 2020 03:37:40 -0700

On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guent...@gmail.com> wrote:
>
> On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> <prathamesh.kulka...@linaro.org> wrote:
> >
> > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guent...@gmail.com> 
> > wrote:
> > >
> > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > <prathamesh.kulka...@linaro.org> wrote:
> > > >
> > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener 
> > > > <richard.guent...@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > <prathamesh.kulka...@linaro.org> wrote:
> > > > > >
> > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener 
> > > > > > <richard.guent...@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > <prathamesh.kulka...@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > <prathamesh.kulka...@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > <prathamesh.kulka...@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov 
> > > > > > > > > > <amona...@ispras.ru> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I obtained perf stat results for following benchmark 
> > > > > > > > > > > > runs:
> > > > > > > > > > > >
> > > > > > > > > > > > -O2:
> > > > > > > > > > > >
> > > > > > > > > > > >     7856832.692380      task-clock (msec)         #    
> > > > > > > > > > > > 1.000 CPUs utilized
> > > > > > > > > > > >               3758               context-switches       
> > > > > > > > > > > >    #    0.000 K/sec
> > > > > > > > > > > >                 40                 cpu-migrations       
> > > > > > > > > > > >       #    0.000 K/sec
> > > > > > > > > > > >              40847              page-faults             
> > > > > > > > > > > >       #    0.005 K/sec
> > > > > > > > > > > >      7856782413676      cycles                          
> > > > > > > > > > > >  #    1.000 GHz
> > > > > > > > > > > >      6034510093417      instructions                   
> > > > > > > > > > > > #    0.77  insn per cycle
> > > > > > > > > > > >       363937274287       branches                       
> > > > > > > > > > > > #   46.321 M/sec
> > > > > > > > > > > >        48557110132       branch-misses                # 
> > > > > > > > > > > >   13.34% of all branches
> > > > > > > > > > >
> > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile 
> > > > > > > > > > > over a minute should be
> > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > >
> > > > > > > > > > > >     8319643.114380      task-clock (msec)       #    
> > > > > > > > > > > > 1.000 CPUs utilized
> > > > > > > > > > > >               4285               context-switches       
> > > > > > > > > > > >   #    0.001 K/sec
> > > > > > > > > > > >                 28                 cpu-migrations       
> > > > > > > > > > > >      #    0.000 K/sec
> > > > > > > > > > > >              40843              page-faults             
> > > > > > > > > > > >      #    0.005 K/sec
> > > > > > > > > > > >      8319591038295      cycles                          
> > > > > > > > > > > > #    1.000 GHz
> > > > > > > > > > > >      6276338800377      instructions                  # 
> > > > > > > > > > > >    0.75  insn per cycle
> > > > > > > > > > > >       467400726106       branches                      
> > > > > > > > > > > > #   56.180 M/sec
> > > > > > > > > > > >        45986364011        branch-misses              #  
> > > > > > > > > > > >   9.84% of all branches
> > > > > > > > > > >
> > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 
> > > > > > > > > > > cycles, probably implying
> > > > > > > > > > > that extra instructions are appearing in this loop nest, 
> > > > > > > > > > > but not in the innermost
> > > > > > > > > > > loop. As a reminder for others, the innermost loop has 
> > > > > > > > > > > only 3 iterations.
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes 
> > > > > > > > > > > > the extra branches):
> > > > > > > > > > > >
> > > > > > > > > > > >    8207331.088040      task-clock (msec)   #    1.000 
> > > > > > > > > > > > CPUs utilized
> > > > > > > > > > > >               2266               context-switches    #  
> > > > > > > > > > > >   0.000 K/sec
> > > > > > > > > > > >                 32                 cpu-migrations       
> > > > > > > > > > > > #    0.000 K/sec
> > > > > > > > > > > >              40846              page-faults             
> > > > > > > > > > > > #    0.005 K/sec
> > > > > > > > > > > >      8207292032467      cycles                     #   
> > > > > > > > > > > > 1.000 GHz
> > > > > > > > > > > >      6035724436440      instructions             #    
> > > > > > > > > > > > 0.74  insn per cycle
> > > > > > > > > > > >       364415440156       branches                 #   
> > > > > > > > > > > > 44.401 M/sec
> > > > > > > > > > > >        53138327276        branch-misses         #   
> > > > > > > > > > > > 14.58% of all branches
> > > > > > > > > > >
> > > > > > > > > > > This seems to match baseline in terms of instruction 
> > > > > > > > > > > count, but without PRE
> > > > > > > > > > > the loop nest may be carrying some dependencies over 
> > > > > > > > > > > memory. I would simply
> > > > > > > > > > > check the assembly for the entire 6-level loop nest in 
> > > > > > > > > > > question, I hope it's
> > > > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > > > >
> > > > > > > > > > > >    7797265.206850      task-clock (msec)         #    
> > > > > > > > > > > > 1.000 CPUs utilized
> > > > > > > > > > > >               3139              context-switches        
> > > > > > > > > > > >   #    0.000 K/sec
> > > > > > > > > > > >                 20                cpu-migrations        
> > > > > > > > > > > >      #    0.000 K/sec
> > > > > > > > > > > >              40846              page-faults             
> > > > > > > > > > > >      #    0.005 K/sec
> > > > > > > > > > > >      7797221351467      cycles                          
> > > > > > > > > > > > #    1.000 GHz
> > > > > > > > > > > >      6187348757324      instructions                  # 
> > > > > > > > > > > >    0.79  insn per cycle
> > > > > > > > > > > >       461840800061       branches                      
> > > > > > > > > > > > #   59.231 M/sec
> > > > > > > > > > > >        26920311761        branch-misses             #   
> > > > > > > > > > > >  5.83% of all branches
> > > > > > > > > > >
> > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 
> > > > > > > > > > > reduction in cycle count.
> > > > > > > > > > > I don't think the former fully covers the latter (there's 
> > > > > > > > > > > also a 90e9 reduction
> > > > > > > > > > > in insn count).
> > > > > > > > > > >
> > > > > > > > > > > Given that the inner loop iterates only 3 times, my main 
> > > > > > > > > > > suggestion is to
> > > > > > > > > > > consider how the profile for the entire loop nest looks 
> > > > > > > > > > > like (it's 6 loops deep,
> > > > > > > > > > > each iterating only 3 times).
> > > > > > > > > > >
> > > > > > > > > > > > Perf profiles for
> > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > >
> > > > > > > > > > > >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > > > > > > > > > > > 216348301800│            add    w0, w0, #0x1
> > > > > > > > > > > >             985098 |            add    x2, x2, #0x18
> > > > > > > > > > > > 216215999206│            add    x1, x1, #0x48
> > > > > > > > > > > > 215630376504│            fmul   d1, d5, d1
> > > > > > > > > > > > 863829148015│            fmul   d1, d1, d6
> > > > > > > > > > > > 864228353526│            fmul   d0, d1, d0
> > > > > > > > > > > > 864568163014│            fmadd  d2, d0, d16, d2
> > > > > > > > > > > >                         │             cmp    w0, #0x4
> > > > > > > > > > > > 216125427594│          ↓ b.eq   1f34
> > > > > > > > > > > >         15010377│             ldur   d0, [x2, #-8]
> > > > > > > > > > > > 143753737468│          ↑ b      1f04
> > > > > > > > > > > >
> > > > > > > > > > > > -O2 with inlined orthonl:
> > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > > > > > >
> > > > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > > > > 144055883055│            add    w0, w0, #0x1
> > > > > > > > > > > >   72262104254│            add    x2, x2, #0x18
> > > > > > > > > > > > 143991169721│            add    x1, x1, #0x48
> > > > > > > > > > > > 288648917780│            fmul   d15, d17, d15
> > > > > > > > > > > > 864665644756│            fmul   d15, d15, d18
> > > > > > > > > > > > 863868426387│            fmul   d14, d15, d14
> > > > > > > > > > > > 865228159813│            fmadd  d16, d14, d31, d16
> > > > > > > > > > > >             245967│            cmp    w0, #0x4
> > > > > > > > > > > > 215396760545│         ↓ b.eq   1f28
> > > > > > > > > > > >       704732365│            ldur   d14, [x2, #-8]
> > > > > > > > > > > > 143775979620│         ↑ b      1ef8
> > > > > > > > > > >
> > > > > > > > > > > This indicates that the loop only covers about 46-48% of 
> > > > > > > > > > > overall time.
> > > > > > > > > > >
> > > > > > > > > > > High count on the initial ldur instruction could be 
> > > > > > > > > > > explained if the loop
> > > > > > > > > > > is not entered by "fallthru" from the preceding block, or 
> > > > > > > > > > > if its backedge
> > > > > > > > > > > is mispredicted. Sampling mispredictions should be 
> > > > > > > > > > > possible with perf record,
> > > > > > > > > > > and you may be able to check if loop entry is fallthrough 
> > > > > > > > > > > by inspecting
> > > > > > > > > > > assembly.
> > > > > > > > > > >
> > > > > > > > > > > It may also be possible to check if code alignment 
> > > > > > > > > > > matters, by compiling with
> > > > > > > > > > > -falign-loops=32.
> > > > > > > > > > Hi,
> > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry for 
> > > > > > > > > > late response.
> > > > > > > > > >
> > > > > > > > > > The hoisting region is:
> > > > > > > > > > if(mattyp.eq.1) then
> > > > > > > > > >   4 loops
> > > > > > > > > > elseif(mattyp.eq.2) then
> > > > > > > > > >   {
> > > > > > > > > >      orthonl inlined into basic block;
> > > > > > > > > >      loads w[0] .. w[8]
> > > > > > > > > >   }
> > > > > > > > > > else
> > > > > > > > > >    6 loops  // load anisox
> > > > > > > > > >
> > > > > > > > > > followed by basic block:
> > > > > > > > > >
> > > > > > > > > >  senergy=
> > > > > > > > > >      &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > > > > >      &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > > > > >      &                    
> > > > > > > > > > +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > > > > >                      s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > > > > >                      s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > > > > >                      s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > > > > > >
> > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy 
> > > > > > > > > > block,
> > > > > > > > > > right in block 181, which is:
> > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > > > > > >
> > > > > > > > > > which is then further hoisted to block 173:
> > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > > > > > >
> > > > > > > > > > From block 181, we have two paths towards senergy block (bb 
> > > > > > > > > > 194):
> > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > > > > > AND
> > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > > > > > which has a path length of around 18 blocks.
> > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > > > > > >
> > > > > > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == 
> > > > > > > > > > REFERENCE),
> > > > > > > > > > avoid hoisting of 'w' array and brings back most of 
> > > > > > > > > > performance. Which
> > > > > > > > > > verifies that it is hoisting of the
> > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > > > > > >
> > > > > > > > > > I obtained perf profiles for full hoisting, and disabled 
> > > > > > > > > > hoisting of
> > > > > > > > > > 'w' array for the 6 loops, and the most drastic difference 
> > > > > > > > > > was
> > > > > > > > > > for ldur instruction:
> > > > > > > > > >
> > > > > > > > > > With full hoisting:
> > > > > > > > > > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > > > > > > > > >
> > > > > > > > > > Without full hoisting:
> > > > > > > > > > 3441224 │1edc:   ldur   d1, [x1, #-248]
> > > > > > > > > >
> > > > > > > > > > (The loop entry seems to be fall thru in both cases. I have 
> > > > > > > > > > attached
> > > > > > > > > > profiles for both cases).
> > > > > > > > > >
> > > > > > > > > > IIUC, the instruction seems to be loading the first element 
> > > > > > > > > > from anisox array,
> > > > > > > > > > which makes me wonder if the issue was with data-cache miss 
> > > > > > > > > > for slower version.
> > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses 
> > > > > > > > > > with period = 1million,
> > > > > > > > > > and it reported two cache misses on the ldur instruction in 
> > > > > > > > > > full hoisting case,
> > > > > > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > > > > > So I wonder if the slowdown happens because hoisting of 'w' 
> > > > > > > > > > array
> > > > > > > > > > possibly results
> > > > > > > > > > in eviction of anisox thus causing a cache miss inside the 
> > > > > > > > > > inner loop
> > > > > > > > > > and making load slower ?
> > > > > > > > > >
> > > > > > > > > > Hoisting also seems to improve the number of overall cache 
> > > > > > > > > > misses tho.
> > > > > > > > > > For disabled hoisting  of 'w' array case, there were a 
> > > > > > > > > > total of 463
> > > > > > > > > > cache misses, while with full hoisting there were 357 cache 
> > > > > > > > > > misses
> > > > > > > > > > (with period = 1 million).
> > > > > > > > > > Does that happen because hoisting probably reduces cache 
> > > > > > > > > > misses along
> > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > > > > > Hi,
> > > > > > > > > In general I feel for this or PR80155 case, the issues come 
> > > > > > > > > with long
> > > > > > > > > range hoistings, inside a large CFG, since we don't have an 
> > > > > > > > > accurate
> > > > > > > > > way to model target resources (register pressure in PR80155 
> > > > > > > > > case / or
> > > > > > > > > possibly cache pressure in this case?) at tree level and we 
> > > > > > > > > end up
> > > > > > > > > with register spill or cache miss inside loops, which may 
> > > > > > > > > offset the
> > > > > > > > > benefit of hoisting. As previously discussed the right way to 
> > > > > > > > > go is a
> > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which can 
> > > > > > > > > also help
> > > > > > > > > with other code-movement optimizations (or if the source had 
> > > > > > > > > variables
> > > > > > > > > with long live ranges).
> > > > > > > > >
> > > > > > > > > I was wondering tho as a cheap workaround, would it make 
> > > > > > > > > sense to
> > > > > > > > > check if we are hoisting across a "large" region of nested 
> > > > > > > > > loops, and
> > > > > > > > > avoid in that case since hoisting may exert resource pressure 
> > > > > > > > > inside
> > > > > > > > > loop region ? (Especially, in the cases where hoisted 
> > > > > > > > > expressions were
> > > > > > > > > not originally AVAIL in any of the loop blocks, and the loop 
> > > > > > > > > region
> > > > > > > > > doesn't benefit from hoisting).
> > > > > > > > >
> > > > > > > > > For instance:
> > > > > > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > > > > >   {
> > > > > > > > >     /* Avoid hoisting across more than 3 nested loops */
> > > > > > > > >     if (e->dest is a loop pre-header or loop header
> > > > > > > > >         && nesting depth of loop is > 3)
> > > > > > > > >      return false;
> > > > > > > > >   }
> > > > > > > > >
> > > > > > > > > I think this would work for resolving the calculix issue 
> > > > > > > > > because it
> > > > > > > > > hoists across one region of 6 loops and another of 4 loops 
> > > > > > > > > (didn' test
> > > > > > > > > yet).
> > > > > > > > > It's not bulletproof in that it will miss detecting cases 
> > > > > > > > > where loop
> > > > > > > > > header (or pre-header) isn't a successor of candidate block 
> > > > > > > > > (checking
> > > > > > > > > for
> > > > > > > > > that might get expensive tho?). I will test it on gcc suite 
> > > > > > > > > and SPEC
> > > > > > > > > for any regressions.
> > > > > > > > > Does this sound like a reasonable heuristic ?
> > > > > > > > Hi,
> > > > > > > > The attached patch implements the above heuristic.
> > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > > > > > And it brings back most of performance for calculix on par with 
> > > > > > > > O2
> > > > > > > > (without inlining orthonl).
> > > > > > > > I verified that with patch there is no cache miss happening on 
> > > > > > > > load
> > > > > > > > insn inside loop
> > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > > > > > >
> > > > > > > > I am in the process of benchmarking the patch on aarch64 for 
> > > > > > > > SPEC for
> > > > > > > > speed and will report numbers
> > > > > > > > in couple of days. (If required, we could parametrize number of 
> > > > > > > > nested
> > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > > > > > and set it in backend to not affect other targets).
> > > > > > >
> > > > > > > I don't think this patch captures the case in a sensible way - it 
> > > > > > > will simply
> > > > > > > never hoist computations out of loop header blocks with depth > 3 
> > > > > > > which
> > > > > > > is certainly not what you want.  Also the pre-header check is odd 
> > > > > > > - we're
> > > > > > > looking for computations in successors of BLOCK but clearly a 
> > > > > > > computation
> > > > > > > in a pre-header is not at the same loop level as one in the 
> > > > > > > header itself.
> > > > > > Well, my intent was to check if we are hoisting across a region,
> > > > > > which has multiple nested loops, and in that case, avoid hoisting 
> > > > > > expressions
> > > > > > that do not belong to any loop blocks, because that may increase
> > > > > > resource pressure
> > > > > > inside loops. For instance, in the calculix issue we hoist 'w' array
> > > > > > from post-dom and neither
> > > > > > loop region has any uses of 'w'.  I agree checking just for loop 
> > > > > > level
> > > > > > is too coarse.
> > > > > > The check with pre-header was essentially the same to see if we are
> > > > > > hoisting across a loop region,
> > > > > > not necessarily from within the loops.
> > > > >
> > > > > But it will fail to hoist *p in
> > > > >
> > > > >    if (x)
> > > > >      {
> > > > >         a = *p;
> > > > >      }
> > > > >   else
> > > > >     {
> > > > >        b = *p;
> > > > >     }
> > > > >
> > > > > <make distance large>
> > > > > pdom:
> > > > >   c = *p;
> > > > >
> > > > > so it isn't what matters either.  What happens at the immediate 
> > > > > post-dominator
> > > > > isn't directly relevant - what matters would be if the pdom is the 
> > > > > one making
> > > > > the value antic on one of the outgoing edges.  In that case we're 
> > > > > also going
> > > > > to PRE *p into the arm not computing *p (albeit in a later position). 
> > > > >  But
> > > > > that property is impossible to compute from the sets itself (not even 
> > > > > mentioning
> > > > > the arbitrary CFG that can be inbetween the block and its pdom or the 
> > > > > weird
> > > > > pdoms we compute for regions not having a path to exit, like infinite 
> > > > > loops).
> > > > >
> > > > > You could eventually look at the pdom predecessors and if *p is not 
> > > > > AVAIL_OUT
> > > > > in each of them we _might_ have the situation you want to protect 
> > > > > against.
> > > > > But as said PRE insertion will likely have made sure it _is_ 
> > > > > AVAIL_OUT in each
> > > > > of them ...
> > > > Hi Richard,
> > > > Thanks for the suggestions. Right, the issue seems to be here that
> > > > post-dom block is making expressions ANTIC. Before doing insert, could
> > > > we simply copy AVAIL_OUT sets of each block into another set say 
> > > > ORIG_AVAIL_OUT,
> > > > as a guard against PRE eventually inserting expressions in pred blocks
> > > > of pdom and making them available?
> > > > And during hoisting, we could check if each expr that is ANTIC_IN
> > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom,
> > > > if the distance is "large".
> > >
> > > Did you try if it works w/o copying AVAIL_OUT?  Because AVAIL_OUT is
> > > very large (it's actually quadratic in size of the CFG * # values), in
> > > particular
> > > we're inserting in RPO and update AVAIL_OUT only up to the current block
> > > (from dominators) so the PDOM block should have the original AVAIL_OUT
> > > (but from the last iteration - we do iterate insertion).
> > >
> > > Note I'm still not happy with pulling off this kind of heuristics.
> > > What the suggestion
> > > means is that for
> > >
> > >    if (x)
> > >      y = *p;
> > >    z = *p;
> > >
> > > we'll end up with
> > >
> > >   if (x)
> > >     y = *p;
> > >   else
> > >     z = *p;
> > >
> > > instead of
> > >
> > >    tem = *p;
> > >    if (x)
> > >     y = tem;
> > >   else
> > >     z = tem;
> > >
> > > that is, we get the runtime benefit but not the code-size one
> > > (hoisting mostly helps code size plus allows if-conversion as followup
> > > in some cases).  Now, if we iterate (like if we'd have a second hoisting 
> > > pass)
> > > then the above would still cause hoisting - so the heuristic isn't 
> > > sustainable.
> > > Again, sth like "distance" isn't really available.
> > Hi Richard,
> > It doesn't work without copying AVAIL_OUT.
> > I tried for small example with attached patch:
> >
> > int foo(int cond, int x, int y)
> > {
> >   int t;
> >   void f(int);
> >
> >   if (cond)
> >     t = x + y;
> >   else
> >     t = x - y;
> >
> >   f (t);
> >   int t2 = (x + y) * 10;
> >   return t2;
> > }
> >
> > By intersecting availout_in_some with AVAIL_OUT of preds of pdom,
> > it does not hoist in first pass, but then PRE inserts x + y in the "else 
> > block",
> > and we eventually hoist before if (cond). Similarly for e_c3d
> > hoistings in calculix.
> >
> > IIUC, we want hoisting to be:
> > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block)
> > to ensure that hoisted expressions are along all paths from block to 
> > post-dom ?
> > If copying AVAIL_OUT sets is too large, could we keep another set that
> > precomputes intersection of AVAIL_OUT of pdom preds
> > for each block and then use this info during hoisting ?
> >
> > For computing "distance", I implemented a simple DFS walk from block
> > to post-dom, that gives up if depth crosses
> > threshold before reaching post-dom. I am not sure tho, how expensive
> > that can get.
>
> As written it is quadratic in the CFG size.
>
> You can optimize away the
>
> +    FOR_EACH_EDGE (e, ei, pdom_bb->preds)
> +      bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values);
>
> loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty.
>
> As said, I don't think this is the way to go - trying to avoid code
> hoisting isn't
> what we'd want to do - your quoted assembly instead points to a loop
> with a non-empty latch which is usually caused by PRE and avoided with -O3
> because it also harms vectorization.
But disabling PRE (which removes non empty latch), only results in
marginal performance improvement,
while disabling hoisting of 'w' array, with non-empty latch, seems to
gain most of the performance.
AFAIU, that was happening, because after disabling hoisting of 'w',
there wasn't a cache miss (as seen with perf -e
L1-dcache-load-misses),
on the load instruction inside the inner loop.


For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per
node, since that's quadratic in terms of CFG size.
Would it make sense to break interaction between PRE and hoisting,
only for the case when inserting into preds of pdom ?
I tried doing that in attached patch, where insert runs in two phases:
(a) PRE and hoisting, where hoisting marks block to not do PRE for.
(b) Second phase, which only runs PRE on all blocks.
This (expectedly) regresses ssa-hoist-3.c.

If the heuristic isn't acceptable, I suppose encoding distance of expr
within ANTIC sets
during compute_antic would be the right way to fix this ?
So ANTIC_IN (block) contains the anticipated expressions, and for each
antic expr, the "distance" from the furthest block
it's computed in ? Could you please elaborate a bit on how we could go
about encoding distance during compute_antic ?

Thanks,
Prathamesh


Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > >
> > > > >
> > > > > > >
> > > > > > > Note the difficulty to capture "distance" is that the distance is 
> > > > > > > simply not
> > > > > > > available at this point - it is the anticipated values from the 
> > > > > > > successors
> > > > > > > that do _not_ compute the value itself that are the issue.  To 
> > > > > > > capture
> > > > > > > "distance" you'd need to somehow "age" anticipated value when
> > > > > > > propagating them upwards during compute_antic (where it then
> > > > > > > would also apply to PRE in general).
> > > > > > Yes, indeed.  As a hack, would it make sense to avoid inserting an
> > > > > > expression in the block,
> > > > > > if it's ANTIC in post-dom block as a trade-off between extending 
> > > > > > live
> > > > > > range and hoisting
> > > > > > if the "distance" between block and post-dom is "too far" ? In
> > > > > > general, as you point out, we'd need to compute,
> > > > > > distance info for successors block during compute_antic, but special
> > > > > > casing for post-dom should be easy enough
> > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > > > > > post-dom could be potentially "long range", if the region is large.
> > > > > > It's still a coarse heuristic tho. I tried it in the attached patch.
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > As with all other heuristics the only place one could do hackish 
> > > > > > > attempts
> > > > > > > with at least some reasoning is the elimination phase where
> > > > > > > we make use of the (hoist) insertions - of course for hoisting we 
> > > > > > > already
> > > > > > > know we'll get the "close" use in one of the successors so I fear 
> > > > > > > even
> > > > > > > there it will be impossible to do something sensible.
> > > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Prathamesh
> > > > > > > > > > >
> > > > > > > > > > > Alexander

gnu-659-pdom-5.diff
Description: Binary data

Re: LTO slows down calculix by more than 10% on aarch64

Reply via email to