On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> wrote: > > On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guent...@gmail.com> > wrote: > > > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener <richard.guent...@gmail.com> > > > wrote: > > > > > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener > > > > > <richard.guent...@gmail.com> wrote: > > > > > > > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener > > > > > > > <richard.guent...@gmail.com> wrote: > > > > > > > > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni > > > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni > > > > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni > > > > > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov > > > > > > > > > > > <amona...@ispras.ru> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > I obtained perf stat results for following benchmark > > > > > > > > > > > > > runs: > > > > > > > > > > > > > > > > > > > > > > > > > > -O2: > > > > > > > > > > > > > > > > > > > > > > > > > > 7856832.692380 task-clock (msec) # > > > > > > > > > > > > > 1.000 CPUs utilized > > > > > > > > > > > > > 3758 context-switches > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > 40 cpu-migrations > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > 40847 page-faults > > > > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > > > > 7856782413676 cycles > > > > > > > > > > > > > # 1.000 GHz > > > > > > > > > > > > > 6034510093417 instructions > > > > > > > > > > > > > # 0.77 insn per cycle > > > > > > > > > > > > > 363937274287 branches > > > > > > > > > > > > > # 46.321 M/sec > > > > > > > > > > > > > 48557110132 branch-misses > > > > > > > > > > > > > # 13.34% of all branches > > > > > > > > > > > > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile > > > > > > > > > > > > over a minute should be > > > > > > > > > > > > enough for this kind of code) > > > > > > > > > > > > > > > > > > > > > > > > > -O2 with orthonl inlined: > > > > > > > > > > > > > > > > > > > > > > > > > > 8319643.114380 task-clock (msec) # > > > > > > > > > > > > > 1.000 CPUs utilized > > > > > > > > > > > > > 4285 context-switches > > > > > > > > > > > > > # 0.001 K/sec > > > > > > > > > > > > > 28 cpu-migrations > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > 40843 page-faults > > > > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > > > > 8319591038295 cycles > > > > > > > > > > > > > # 1.000 GHz > > > > > > > > > > > > > 6276338800377 instructions > > > > > > > > > > > > > # 0.75 insn per cycle > > > > > > > > > > > > > 467400726106 branches > > > > > > > > > > > > > # 56.180 M/sec > > > > > > > > > > > > > 45986364011 branch-misses > > > > > > > > > > > > > # 9.84% of all branches > > > > > > > > > > > > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 > > > > > > > > > > > > cycles, probably implying > > > > > > > > > > > > that extra instructions are appearing in this loop > > > > > > > > > > > > nest, but not in the innermost > > > > > > > > > > > > loop. As a reminder for others, the innermost loop has > > > > > > > > > > > > only 3 iterations. > > > > > > > > > > > > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this > > > > > > > > > > > > > removes the extra branches): > > > > > > > > > > > > > > > > > > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 > > > > > > > > > > > > > CPUs utilized > > > > > > > > > > > > > 2266 context-switches > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > 32 cpu-migrations > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > 40846 page-faults > > > > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > > > > 8207292032467 cycles # > > > > > > > > > > > > > 1.000 GHz > > > > > > > > > > > > > 6035724436440 instructions # > > > > > > > > > > > > > 0.74 insn per cycle > > > > > > > > > > > > > 364415440156 branches # > > > > > > > > > > > > > 44.401 M/sec > > > > > > > > > > > > > 53138327276 branch-misses # > > > > > > > > > > > > > 14.58% of all branches > > > > > > > > > > > > > > > > > > > > > > > > This seems to match baseline in terms of instruction > > > > > > > > > > > > count, but without PRE > > > > > > > > > > > > the loop nest may be carrying some dependencies over > > > > > > > > > > > > memory. I would simply > > > > > > > > > > > > check the assembly for the entire 6-level loop nest in > > > > > > > > > > > > question, I hope it's > > > > > > > > > > > > not very complicated (though Fortran array > > > > > > > > > > > > addressing...). > > > > > > > > > > > > > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled: > > > > > > > > > > > > > > > > > > > > > > > > > > 7797265.206850 task-clock (msec) # > > > > > > > > > > > > > 1.000 CPUs utilized > > > > > > > > > > > > > 3139 context-switches > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > 20 cpu-migrations > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > 40846 page-faults > > > > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > > > > 7797221351467 cycles > > > > > > > > > > > > > # 1.000 GHz > > > > > > > > > > > > > 6187348757324 instructions > > > > > > > > > > > > > # 0.79 insn per cycle > > > > > > > > > > > > > 461840800061 branches > > > > > > > > > > > > > # 59.231 M/sec > > > > > > > > > > > > > 26920311761 branch-misses # > > > > > > > > > > > > > 5.83% of all branches > > > > > > > > > > > > > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 > > > > > > > > > > > > reduction in cycle count. > > > > > > > > > > > > I don't think the former fully covers the latter > > > > > > > > > > > > (there's also a 90e9 reduction > > > > > > > > > > > > in insn count). > > > > > > > > > > > > > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my > > > > > > > > > > > > main suggestion is to > > > > > > > > > > > > consider how the profile for the entire loop nest looks > > > > > > > > > > > > like (it's 6 loops deep, > > > > > > > > > > > > each iterating only 3 times). > > > > > > > > > > > > > > > > > > > > > > > > > Perf profiles for > > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl: > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > > > > > > > > > > > > > > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248] > > > > > > > > > > > > > 216348301800│ add w0, w0, #0x1 > > > > > > > > > > > > > 985098 | add x2, x2, #0x18 > > > > > > > > > > > > > 216215999206│ add x1, x1, #0x48 > > > > > > > > > > > > > 215630376504│ fmul d1, d5, d1 > > > > > > > > > > > > > 863829148015│ fmul d1, d1, d6 > > > > > > > > > > > > > 864228353526│ fmul d0, d1, d0 > > > > > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2 > > > > > > > > > > > > > │ cmp w0, #0x4 > > > > > > > > > > > > > 216125427594│ ↓ b.eq 1f34 > > > > > > > > > > > > > 15010377│ ldur d0, [x2, #-8] > > > > > > > > > > > > > 143753737468│ ↑ b 1f04 > > > > > > > > > > > > > > > > > > > > > > > > > > -O2 with inlined orthonl: > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > > > > > > > > > > > > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > > > > > > > > > > > > 144055883055│ add w0, w0, #0x1 > > > > > > > > > > > > > 72262104254│ add x2, x2, #0x18 > > > > > > > > > > > > > 143991169721│ add x1, x1, #0x48 > > > > > > > > > > > > > 288648917780│ fmul d15, d17, d15 > > > > > > > > > > > > > 864665644756│ fmul d15, d15, d18 > > > > > > > > > > > > > 863868426387│ fmul d14, d15, d14 > > > > > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16 > > > > > > > > > > > > > 245967│ cmp w0, #0x4 > > > > > > > > > > > > > 215396760545│ ↓ b.eq 1f28 > > > > > > > > > > > > > 704732365│ ldur d14, [x2, #-8] > > > > > > > > > > > > > 143775979620│ ↑ b 1ef8 > > > > > > > > > > > > > > > > > > > > > > > > This indicates that the loop only covers about 46-48% > > > > > > > > > > > > of overall time. > > > > > > > > > > > > > > > > > > > > > > > > High count on the initial ldur instruction could be > > > > > > > > > > > > explained if the loop > > > > > > > > > > > > is not entered by "fallthru" from the preceding block, > > > > > > > > > > > > or if its backedge > > > > > > > > > > > > is mispredicted. Sampling mispredictions should be > > > > > > > > > > > > possible with perf record, > > > > > > > > > > > > and you may be able to check if loop entry is > > > > > > > > > > > > fallthrough by inspecting > > > > > > > > > > > > assembly. > > > > > > > > > > > > > > > > > > > > > > > > It may also be possible to check if code alignment > > > > > > > > > > > > matters, by compiling with > > > > > > > > > > > > -falign-loops=32. > > > > > > > > > > > Hi, > > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry > > > > > > > > > > > for late response. > > > > > > > > > > > > > > > > > > > > > > The hoisting region is: > > > > > > > > > > > if(mattyp.eq.1) then > > > > > > > > > > > 4 loops > > > > > > > > > > > elseif(mattyp.eq.2) then > > > > > > > > > > > { > > > > > > > > > > > orthonl inlined into basic block; > > > > > > > > > > > loads w[0] .. w[8] > > > > > > > > > > > } > > > > > > > > > > > else > > > > > > > > > > > 6 loops // load anisox > > > > > > > > > > > > > > > > > > > > > > followed by basic block: > > > > > > > > > > > > > > > > > > > > > > senergy= > > > > > > > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1)) > > > > > > > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2) > > > > > > > > > > > & > > > > > > > > > > > +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight > > > > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy > > > > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy > > > > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy > > > > > > > > > > > > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and > > > > > > > > > > > senergy block, > > > > > > > > > > > right in block 181, which is: > > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193> > > > > > > > > > > > > > > > > > > > > > > which is then further hoisted to block 173: > > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181> > > > > > > > > > > > > > > > > > > > > > > From block 181, we have two paths towards senergy block > > > > > > > > > > > (bb 194): > > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block) > > > > > > > > > > > AND > > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194 > > > > > > > > > > > which has a path length of around 18 blocks. > > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173). > > > > > > > > > > > > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181 > > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == > > > > > > > > > > > REFERENCE), > > > > > > > > > > > avoid hoisting of 'w' array and brings back most of > > > > > > > > > > > performance. Which > > > > > > > > > > > verifies that it is hoisting of the > > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ? > > > > > > > > > > > > > > > > > > > > > > I obtained perf profiles for full hoisting, and disabled > > > > > > > > > > > hoisting of > > > > > > > > > > > 'w' array for the 6 loops, and the most drastic > > > > > > > > > > > difference was > > > > > > > > > > > for ldur instruction: > > > > > > > > > > > > > > > > > > > > > > With full hoisting: > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > > > > > > > > > > > > > > > > > > > > > Without full hoisting: > > > > > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248] > > > > > > > > > > > > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I > > > > > > > > > > > have attached > > > > > > > > > > > profiles for both cases). > > > > > > > > > > > > > > > > > > > > > > IIUC, the instruction seems to be loading the first > > > > > > > > > > > element from anisox array, > > > > > > > > > > > which makes me wonder if the issue was with data-cache > > > > > > > > > > > miss for slower version. > > > > > > > > > > > I ran perf script on perf data for L1-dcache-load-misses > > > > > > > > > > > with period = 1million, > > > > > > > > > > > and it reported two cache misses on the ldur instruction > > > > > > > > > > > in full hoisting case, > > > > > > > > > > > while it reported zero for the disabled load hoisting > > > > > > > > > > > case. > > > > > > > > > > > So I wonder if the slowdown happens because hoisting of > > > > > > > > > > > 'w' array > > > > > > > > > > > possibly results > > > > > > > > > > > in eviction of anisox thus causing a cache miss inside > > > > > > > > > > > the inner loop > > > > > > > > > > > and making load slower ? > > > > > > > > > > > > > > > > > > > > > > Hoisting also seems to improve the number of overall > > > > > > > > > > > cache misses tho. > > > > > > > > > > > For disabled hoisting of 'w' array case, there were a > > > > > > > > > > > total of 463 > > > > > > > > > > > cache misses, while with full hoisting there were 357 > > > > > > > > > > > cache misses > > > > > > > > > > > (with period = 1 million). > > > > > > > > > > > Does that happen because hoisting probably reduces cache > > > > > > > > > > > misses along > > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ? > > > > > > > > > > Hi, > > > > > > > > > > In general I feel for this or PR80155 case, the issues come > > > > > > > > > > with long > > > > > > > > > > range hoistings, inside a large CFG, since we don't have an > > > > > > > > > > accurate > > > > > > > > > > way to model target resources (register pressure in PR80155 > > > > > > > > > > case / or > > > > > > > > > > possibly cache pressure in this case?) at tree level and we > > > > > > > > > > end up > > > > > > > > > > with register spill or cache miss inside loops, which may > > > > > > > > > > offset the > > > > > > > > > > benefit of hoisting. As previously discussed the right way > > > > > > > > > > to go is a > > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which > > > > > > > > > > can also help > > > > > > > > > > with other code-movement optimizations (or if the source > > > > > > > > > > had variables > > > > > > > > > > with long live ranges). > > > > > > > > > > > > > > > > > > > > I was wondering tho as a cheap workaround, would it make > > > > > > > > > > sense to > > > > > > > > > > check if we are hoisting across a "large" region of nested > > > > > > > > > > loops, and > > > > > > > > > > avoid in that case since hoisting may exert resource > > > > > > > > > > pressure inside > > > > > > > > > > loop region ? (Especially, in the cases where hoisted > > > > > > > > > > expressions were > > > > > > > > > > not originally AVAIL in any of the loop blocks, and the > > > > > > > > > > loop region > > > > > > > > > > doesn't benefit from hoisting). > > > > > > > > > > > > > > > > > > > > For instance: > > > > > > > > > > FOR_EACH_EDGE (e, ei, block) > > > > > > > > > > { > > > > > > > > > > /* Avoid hoisting across more than 3 nested loops */ > > > > > > > > > > if (e->dest is a loop pre-header or loop header > > > > > > > > > > && nesting depth of loop is > 3) > > > > > > > > > > return false; > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > I think this would work for resolving the calculix issue > > > > > > > > > > because it > > > > > > > > > > hoists across one region of 6 loops and another of 4 loops > > > > > > > > > > (didn' test > > > > > > > > > > yet). > > > > > > > > > > It's not bulletproof in that it will miss detecting cases > > > > > > > > > > where loop > > > > > > > > > > header (or pre-header) isn't a successor of candidate block > > > > > > > > > > (checking > > > > > > > > > > for > > > > > > > > > > that might get expensive tho?). I will test it on gcc suite > > > > > > > > > > and SPEC > > > > > > > > > > for any regressions. > > > > > > > > > > Does this sound like a reasonable heuristic ? > > > > > > > > > Hi, > > > > > > > > > The attached patch implements the above heuristic. > > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions. > > > > > > > > > And it brings back most of performance for calculix on par > > > > > > > > > with O2 > > > > > > > > > (without inlining orthonl). > > > > > > > > > I verified that with patch there is no cache miss happening > > > > > > > > > on load > > > > > > > > > insn inside loop > > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/) > > > > > > > > > > > > > > > > > > I am in the process of benchmarking the patch on aarch64 for > > > > > > > > > SPEC for > > > > > > > > > speed and will report numbers > > > > > > > > > in couple of days. (If required, we could parametrize number > > > > > > > > > of nested > > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch, > > > > > > > > > and set it in backend to not affect other targets). > > > > > > > > > > > > > > > > I don't think this patch captures the case in a sensible way - > > > > > > > > it will simply > > > > > > > > never hoist computations out of loop header blocks with depth > > > > > > > > > 3 which > > > > > > > > is certainly not what you want. Also the pre-header check is > > > > > > > > odd - we're > > > > > > > > looking for computations in successors of BLOCK but clearly a > > > > > > > > computation > > > > > > > > in a pre-header is not at the same loop level as one in the > > > > > > > > header itself. > > > > > > > Well, my intent was to check if we are hoisting across a region, > > > > > > > which has multiple nested loops, and in that case, avoid hoisting > > > > > > > expressions > > > > > > > that do not belong to any loop blocks, because that may increase > > > > > > > resource pressure > > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' > > > > > > > array > > > > > > > from post-dom and neither > > > > > > > loop region has any uses of 'w'. I agree checking just for loop > > > > > > > level > > > > > > > is too coarse. > > > > > > > The check with pre-header was essentially the same to see if we > > > > > > > are > > > > > > > hoisting across a loop region, > > > > > > > not necessarily from within the loops. > > > > > > > > > > > > But it will fail to hoist *p in > > > > > > > > > > > > if (x) > > > > > > { > > > > > > a = *p; > > > > > > } > > > > > > else > > > > > > { > > > > > > b = *p; > > > > > > } > > > > > > > > > > > > <make distance large> > > > > > > pdom: > > > > > > c = *p; > > > > > > > > > > > > so it isn't what matters either. What happens at the immediate > > > > > > post-dominator > > > > > > isn't directly relevant - what matters would be if the pdom is the > > > > > > one making > > > > > > the value antic on one of the outgoing edges. In that case we're > > > > > > also going > > > > > > to PRE *p into the arm not computing *p (albeit in a later > > > > > > position). But > > > > > > that property is impossible to compute from the sets itself (not > > > > > > even mentioning > > > > > > the arbitrary CFG that can be inbetween the block and its pdom or > > > > > > the weird > > > > > > pdoms we compute for regions not having a path to exit, like > > > > > > infinite loops). > > > > > > > > > > > > You could eventually look at the pdom predecessors and if *p is not > > > > > > AVAIL_OUT > > > > > > in each of them we _might_ have the situation you want to protect > > > > > > against. > > > > > > But as said PRE insertion will likely have made sure it _is_ > > > > > > AVAIL_OUT in each > > > > > > of them ... > > > > > Hi Richard, > > > > > Thanks for the suggestions. Right, the issue seems to be here that > > > > > post-dom block is making expressions ANTIC. Before doing insert, could > > > > > we simply copy AVAIL_OUT sets of each block into another set say > > > > > ORIG_AVAIL_OUT, > > > > > as a guard against PRE eventually inserting expressions in pred blocks > > > > > of pdom and making them available? > > > > > And during hoisting, we could check if each expr that is ANTIC_IN > > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom, > > > > > if the distance is "large". > > > > > > > > Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is > > > > very large (it's actually quadratic in size of the CFG * # values), in > > > > particular > > > > we're inserting in RPO and update AVAIL_OUT only up to the current block > > > > (from dominators) so the PDOM block should have the original AVAIL_OUT > > > > (but from the last iteration - we do iterate insertion). > > > > > > > > Note I'm still not happy with pulling off this kind of heuristics. > > > > What the suggestion > > > > means is that for > > > > > > > > if (x) > > > > y = *p; > > > > z = *p; > > > > > > > > we'll end up with > > > > > > > > if (x) > > > > y = *p; > > > > else > > > > z = *p; > > > > > > > > instead of > > > > > > > > tem = *p; > > > > if (x) > > > > y = tem; > > > > else > > > > z = tem; > > > > > > > > that is, we get the runtime benefit but not the code-size one > > > > (hoisting mostly helps code size plus allows if-conversion as followup > > > > in some cases). Now, if we iterate (like if we'd have a second > > > > hoisting pass) > > > > then the above would still cause hoisting - so the heuristic isn't > > > > sustainable. > > > > Again, sth like "distance" isn't really available. > > > Hi Richard, > > > It doesn't work without copying AVAIL_OUT. > > > I tried for small example with attached patch: > > > > > > int foo(int cond, int x, int y) > > > { > > > int t; > > > void f(int); > > > > > > if (cond) > > > t = x + y; > > > else > > > t = x - y; > > > > > > f (t); > > > int t2 = (x + y) * 10; > > > return t2; > > > } > > > > > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom, > > > it does not hoist in first pass, but then PRE inserts x + y in the "else > > > block", > > > and we eventually hoist before if (cond). Similarly for e_c3d > > > hoistings in calculix. > > > > > > IIUC, we want hoisting to be: > > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT (block) > > > to ensure that hoisted expressions are along all paths from block to > > > post-dom ? > > > If copying AVAIL_OUT sets is too large, could we keep another set that > > > precomputes intersection of AVAIL_OUT of pdom preds > > > for each block and then use this info during hoisting ? > > > > > > For computing "distance", I implemented a simple DFS walk from block > > > to post-dom, that gives up if depth crosses > > > threshold before reaching post-dom. I am not sure tho, how expensive > > > that can get. > > > > As written it is quadratic in the CFG size. > > > > You can optimize away the > > > > + FOR_EACH_EDGE (e, ei, pdom_bb->preds) > > + bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values); > > > > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty. > > > > As said, I don't think this is the way to go - trying to avoid code > > hoisting isn't > > what we'd want to do - your quoted assembly instead points to a loop > > with a non-empty latch which is usually caused by PRE and avoided with -O3 > > because it also harms vectorization. > But disabling PRE (which removes non empty latch), only results in > marginal performance improvement, > while disabling hoisting of 'w' array, with non-empty latch, seems to > gain most of the performance. > AFAIU, that was happening, because after disabling hoisting of 'w', > there wasn't a cache miss (as seen with perf -e > L1-dcache-load-misses), > on the load instruction inside the inner loop.
But that doesn't make much sense then. If code generation isn't an issue I don't see how the hoisted loads should cause a L1 dcache load miss for data that is accessed in the respective loop as well (though not hoisted from it since at -O2 not sufficiently unrolled) > For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per > node, since that's quadratic in terms of CFG size. > Would it make sense to break interaction between PRE and hoisting, > only for the case when inserting into preds of pdom ? > I tried doing that in attached patch, where insert runs in two phases: > (a) PRE and hoisting, where hoisting marks block to not do PRE for. > (b) Second phase, which only runs PRE on all blocks. > This (expectedly) regresses ssa-hoist-3.c. > > If the heuristic isn't acceptable, I suppose encoding distance of expr > within ANTIC sets > during compute_antic would be the right way to fix this ? > So ANTIC_IN (block) contains the anticipated expressions, and for each > antic expr, the "distance" from the furthest block > it's computed in ? Could you please elaborate a bit on how we could go > about encoding distance during compute_antic ? But the distance in this case is just one CFG node ... we have if (mattyp.eq.1) ... use of w but not with constant indices else if (mattyp.eq.2) .. inlined orthonl with constant index w() accesses, single BB else ... use of w but not with constant indices - the actual relevant loop of calculix endif ... constant index w() accesses, single BB so the CFG distance is one node - unless you want to compute the maximum distance? Btw, I only see 9 loads hoisted. I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark. And indeed this case is exactly one where hoisting is superior to PRE which would happily insert the 9 loads into the two variable-access predecessors to get rid of the redundancy wrt the mattyp.eq.1 path. In .optimized I see pretmp_5573 = w[0]; pretmp_5574 = w[3]; pretmp_5575 = w[6]; pretmp_5576 = w[1]; pretmp_5577 = w[4]; pretmp_5578 = w[7]; pretmp_5579 = w[2]; pretmp_5580 = w[5]; pretmp_5581 = w[8]; if (mattyp.157_742 == 1) I do remember talks/patches about ordering of such sequences of loads to make them prefetch-happier. Are the loads actually emitted in-order for arm? Thus w[0]...w[8] rather than as seen above with some random permutes inbetween? On x86 they are emitted in random order (they are also spilled immediately). Richard. > Thanks, > Prathamesh > > > Thanks, > Prathamesh > > > > Richard. > > > > > Thanks, > > > Prathamesh > > > > > > > > Richard. > > > > > > > > > Thanks, > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Note the difficulty to capture "distance" is that the distance > > > > > > > > is simply not > > > > > > > > available at this point - it is the anticipated values from the > > > > > > > > successors > > > > > > > > that do _not_ compute the value itself that are the issue. To > > > > > > > > capture > > > > > > > > "distance" you'd need to somehow "age" anticipated value when > > > > > > > > propagating them upwards during compute_antic (where it then > > > > > > > > would also apply to PRE in general). > > > > > > > Yes, indeed. As a hack, would it make sense to avoid inserting an > > > > > > > expression in the block, > > > > > > > if it's ANTIC in post-dom block as a trade-off between extending > > > > > > > live > > > > > > > range and hoisting > > > > > > > if the "distance" between block and post-dom is "too far" ? In > > > > > > > general, as you point out, we'd need to compute, > > > > > > > distance info for successors block during compute_antic, but > > > > > > > special > > > > > > > casing for post-dom should be easy enough > > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in > > > > > > > post-dom could be potentially "long range", if the region is > > > > > > > large. > > > > > > > It's still a coarse heuristic tho. I tried it in the attached > > > > > > > patch. > > > > > > > > > > > > > > Thanks, > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As with all other heuristics the only place one could do > > > > > > > > hackish attempts > > > > > > > > with at least some reasoning is the elimination phase where > > > > > > > > we make use of the (hoist) insertions - of course for hoisting > > > > > > > > we already > > > > > > > > know we'll get the "close" use in one of the successors so I > > > > > > > > fear even > > > > > > > > there it will be impossible to do something sensible. > > > > > > > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > Alexander