On Thu, 24 Sep 2020 at 16:44, Richard Biener <richard.guent...@gmail.com> wrote: > > On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni > <prathamesh.kulka...@linaro.org> wrote: > > > > On Wed, 23 Sep 2020 at 16:40, Richard Biener <richard.guent...@gmail.com> > > wrote: > > > > > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener > > > > <richard.guent...@gmail.com> wrote: > > > > > > > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener > > > > > > <richard.guent...@gmail.com> wrote: > > > > > > > > > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni > > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener > > > > > > > > <richard.guent...@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni > > > > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni > > > > > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni > > > > > > > > > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov > > > > > > > > > > > > <amona...@ispras.ru> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > I obtained perf stat results for following > > > > > > > > > > > > > > benchmark runs: > > > > > > > > > > > > > > > > > > > > > > > > > > > > -O2: > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7856832.692380 task-clock (msec) # > > > > > > > > > > > > > > 1.000 CPUs utilized > > > > > > > > > > > > > > 3758 context-switches > > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > > 40 cpu-migrations > > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > > 40847 page-faults > > > > > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > > > > > 7856782413676 cycles > > > > > > > > > > > > > > # 1.000 GHz > > > > > > > > > > > > > > 6034510093417 instructions > > > > > > > > > > > > > > # 0.77 insn per cycle > > > > > > > > > > > > > > 363937274287 branches > > > > > > > > > > > > > > # 46.321 M/sec > > > > > > > > > > > > > > 48557110132 branch-misses > > > > > > > > > > > > > > # 13.34% of all branches > > > > > > > > > > > > > > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a > > > > > > > > > > > > > profile over a minute should be > > > > > > > > > > > > > enough for this kind of code) > > > > > > > > > > > > > > > > > > > > > > > > > > > -O2 with orthonl inlined: > > > > > > > > > > > > > > > > > > > > > > > > > > > > 8319643.114380 task-clock (msec) # > > > > > > > > > > > > > > 1.000 CPUs utilized > > > > > > > > > > > > > > 4285 context-switches > > > > > > > > > > > > > > # 0.001 K/sec > > > > > > > > > > > > > > 28 cpu-migrations > > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > > 40843 page-faults > > > > > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > > > > > 8319591038295 cycles > > > > > > > > > > > > > > # 1.000 GHz > > > > > > > > > > > > > > 6276338800377 instructions > > > > > > > > > > > > > > # 0.75 insn per cycle > > > > > > > > > > > > > > 467400726106 branches > > > > > > > > > > > > > > # 56.180 M/sec > > > > > > > > > > > > > > 45986364011 branch-misses > > > > > > > > > > > > > > # 9.84% of all branches > > > > > > > > > > > > > > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and > > > > > > > > > > > > > +480e9 cycles, probably implying > > > > > > > > > > > > > that extra instructions are appearing in this loop > > > > > > > > > > > > > nest, but not in the innermost > > > > > > > > > > > > > loop. As a reminder for others, the innermost loop > > > > > > > > > > > > > has only 3 iterations. > > > > > > > > > > > > > > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this > > > > > > > > > > > > > > removes the extra branches): > > > > > > > > > > > > > > > > > > > > > > > > > > > > 8207331.088040 task-clock (msec) # > > > > > > > > > > > > > > 1.000 CPUs utilized > > > > > > > > > > > > > > 2266 context-switches > > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > > 32 cpu-migrations > > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > > 40846 page-faults > > > > > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > > > > > 8207292032467 cycles > > > > > > > > > > > > > > # 1.000 GHz > > > > > > > > > > > > > > 6035724436440 instructions # > > > > > > > > > > > > > > 0.74 insn per cycle > > > > > > > > > > > > > > 364415440156 branches # > > > > > > > > > > > > > > 44.401 M/sec > > > > > > > > > > > > > > 53138327276 branch-misses # > > > > > > > > > > > > > > 14.58% of all branches > > > > > > > > > > > > > > > > > > > > > > > > > > This seems to match baseline in terms of instruction > > > > > > > > > > > > > count, but without PRE > > > > > > > > > > > > > the loop nest may be carrying some dependencies over > > > > > > > > > > > > > memory. I would simply > > > > > > > > > > > > > check the assembly for the entire 6-level loop nest > > > > > > > > > > > > > in question, I hope it's > > > > > > > > > > > > > not very complicated (though Fortran array > > > > > > > > > > > > > addressing...). > > > > > > > > > > > > > > > > > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled: > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7797265.206850 task-clock (msec) # > > > > > > > > > > > > > > 1.000 CPUs utilized > > > > > > > > > > > > > > 3139 context-switches > > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > > 20 cpu-migrations > > > > > > > > > > > > > > # 0.000 K/sec > > > > > > > > > > > > > > 40846 page-faults > > > > > > > > > > > > > > # 0.005 K/sec > > > > > > > > > > > > > > 7797221351467 cycles > > > > > > > > > > > > > > # 1.000 GHz > > > > > > > > > > > > > > 6187348757324 instructions > > > > > > > > > > > > > > # 0.79 insn per cycle > > > > > > > > > > > > > > 461840800061 branches > > > > > > > > > > > > > > # 59.231 M/sec > > > > > > > > > > > > > > 26920311761 branch-misses > > > > > > > > > > > > > > # 5.83% of all branches > > > > > > > > > > > > > > > > > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 > > > > > > > > > > > > > reduction in cycle count. > > > > > > > > > > > > > I don't think the former fully covers the latter > > > > > > > > > > > > > (there's also a 90e9 reduction > > > > > > > > > > > > > in insn count). > > > > > > > > > > > > > > > > > > > > > > > > > > Given that the inner loop iterates only 3 times, my > > > > > > > > > > > > > main suggestion is to > > > > > > > > > > > > > consider how the profile for the entire loop nest > > > > > > > > > > > > > looks like (it's 6 loops deep, > > > > > > > > > > > > > each iterating only 3 times). > > > > > > > > > > > > > > > > > > > > > > > > > > > Perf profiles for > > > > > > > > > > > > > > -O2 -fno-code-hoisting and inlined orthonl: > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248] > > > > > > > > > > > > > > 216348301800│ add w0, w0, #0x1 > > > > > > > > > > > > > > 985098 | add x2, x2, #0x18 > > > > > > > > > > > > > > 216215999206│ add x1, x1, #0x48 > > > > > > > > > > > > > > 215630376504│ fmul d1, d5, d1 > > > > > > > > > > > > > > 863829148015│ fmul d1, d1, d6 > > > > > > > > > > > > > > 864228353526│ fmul d0, d1, d0 > > > > > > > > > > > > > > 864568163014│ fmadd d2, d0, d16, d2 > > > > > > > > > > > > > > │ cmp w0, > > > > > > > > > > > > > > #0x4 > > > > > > > > > > > > > > 216125427594│ ↓ b.eq 1f34 > > > > > > > > > > > > > > 15010377│ ldur d0, [x2, #-8] > > > > > > > > > > > > > > 143753737468│ ↑ b 1f04 > > > > > > > > > > > > > > > > > > > > > > > > > > > > -O2 with inlined orthonl: > > > > > > > > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > > > > > > > > > > > > > > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > > > > > > > > > > > > > 144055883055│ add w0, w0, #0x1 > > > > > > > > > > > > > > 72262104254│ add x2, x2, #0x18 > > > > > > > > > > > > > > 143991169721│ add x1, x1, #0x48 > > > > > > > > > > > > > > 288648917780│ fmul d15, d17, d15 > > > > > > > > > > > > > > 864665644756│ fmul d15, d15, d18 > > > > > > > > > > > > > > 863868426387│ fmul d14, d15, d14 > > > > > > > > > > > > > > 865228159813│ fmadd d16, d14, d31, d16 > > > > > > > > > > > > > > 245967│ cmp w0, #0x4 > > > > > > > > > > > > > > 215396760545│ ↓ b.eq 1f28 > > > > > > > > > > > > > > 704732365│ ldur d14, [x2, #-8] > > > > > > > > > > > > > > 143775979620│ ↑ b 1ef8 > > > > > > > > > > > > > > > > > > > > > > > > > > This indicates that the loop only covers about 46-48% > > > > > > > > > > > > > of overall time. > > > > > > > > > > > > > > > > > > > > > > > > > > High count on the initial ldur instruction could be > > > > > > > > > > > > > explained if the loop > > > > > > > > > > > > > is not entered by "fallthru" from the preceding > > > > > > > > > > > > > block, or if its backedge > > > > > > > > > > > > > is mispredicted. Sampling mispredictions should be > > > > > > > > > > > > > possible with perf record, > > > > > > > > > > > > > and you may be able to check if loop entry is > > > > > > > > > > > > > fallthrough by inspecting > > > > > > > > > > > > > assembly. > > > > > > > > > > > > > > > > > > > > > > > > > > It may also be possible to check if code alignment > > > > > > > > > > > > > matters, by compiling with > > > > > > > > > > > > > -falign-loops=32. > > > > > > > > > > > > Hi, > > > > > > > > > > > > Thanks a lot for the detailed feedback, and I am sorry > > > > > > > > > > > > for late response. > > > > > > > > > > > > > > > > > > > > > > > > The hoisting region is: > > > > > > > > > > > > if(mattyp.eq.1) then > > > > > > > > > > > > 4 loops > > > > > > > > > > > > elseif(mattyp.eq.2) then > > > > > > > > > > > > { > > > > > > > > > > > > orthonl inlined into basic block; > > > > > > > > > > > > loads w[0] .. w[8] > > > > > > > > > > > > } > > > > > > > > > > > > else > > > > > > > > > > > > 6 loops // load anisox > > > > > > > > > > > > > > > > > > > > > > > > followed by basic block: > > > > > > > > > > > > > > > > > > > > > > > > senergy= > > > > > > > > > > > > & > > > > > > > > > > > > (s11*w(1,1)+s12*(w(1,2)+w(2,1)) > > > > > > > > > > > > & > > > > > > > > > > > > +s13*(w(1,3)+w(3,1))+s22*w(2,2) > > > > > > > > > > > > & > > > > > > > > > > > > +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight > > > > > > > > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy > > > > > > > > > > > > > > > > > > > > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy > > > > > > > > > > > > > > > > > > > > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy > > > > > > > > > > > > > > > > > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and > > > > > > > > > > > > senergy block, > > > > > > > > > > > > right in block 181, which is: > > > > > > > > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193> > > > > > > > > > > > > > > > > > > > > > > > > which is then further hoisted to block 173: > > > > > > > > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181> > > > > > > > > > > > > > > > > > > > > > > > > From block 181, we have two paths towards senergy block > > > > > > > > > > > > (bb 194): > > > > > > > > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy > > > > > > > > > > > > block) > > > > > > > > > > > > AND > > > > > > > > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194 > > > > > > > > > > > > which has a path length of around 18 blocks. > > > > > > > > > > > > (bb 194 post-dominates bb 181 and bb 173). > > > > > > > > > > > > > > > > > > > > > > > > Disabling only load hoisting within blocks 173 and 181 > > > > > > > > > > > > (simply avoid inserting pre_expr if pre_expr->kind == > > > > > > > > > > > > REFERENCE), > > > > > > > > > > > > avoid hoisting of 'w' array and brings back most of > > > > > > > > > > > > performance. Which > > > > > > > > > > > > verifies that it is hoisting of the > > > > > > > > > > > > 'w' array (w[0] ... w[8]), which is causing the > > > > > > > > > > > > slowdown ? > > > > > > > > > > > > > > > > > > > > > > > > I obtained perf profiles for full hoisting, and > > > > > > > > > > > > disabled hoisting of > > > > > > > > > > > > 'w' array for the 6 loops, and the most drastic > > > > > > > > > > > > difference was > > > > > > > > > > > > for ldur instruction: > > > > > > > > > > > > > > > > > > > > > > > > With full hoisting: > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > > > > > > > > > > > > > > > > > > > > > > > Without full hoisting: > > > > > > > > > > > > 3441224 │1edc: ldur d1, [x1, #-248] > > > > > > > > > > > > > > > > > > > > > > > > (The loop entry seems to be fall thru in both cases. I > > > > > > > > > > > > have attached > > > > > > > > > > > > profiles for both cases). > > > > > > > > > > > > > > > > > > > > > > > > IIUC, the instruction seems to be loading the first > > > > > > > > > > > > element from anisox array, > > > > > > > > > > > > which makes me wonder if the issue was with data-cache > > > > > > > > > > > > miss for slower version. > > > > > > > > > > > > I ran perf script on perf data for > > > > > > > > > > > > L1-dcache-load-misses with period = 1million, > > > > > > > > > > > > and it reported two cache misses on the ldur > > > > > > > > > > > > instruction in full hoisting case, > > > > > > > > > > > > while it reported zero for the disabled load hoisting > > > > > > > > > > > > case. > > > > > > > > > > > > So I wonder if the slowdown happens because hoisting of > > > > > > > > > > > > 'w' array > > > > > > > > > > > > possibly results > > > > > > > > > > > > in eviction of anisox thus causing a cache miss inside > > > > > > > > > > > > the inner loop > > > > > > > > > > > > and making load slower ? > > > > > > > > > > > > > > > > > > > > > > > > Hoisting also seems to improve the number of overall > > > > > > > > > > > > cache misses tho. > > > > > > > > > > > > For disabled hoisting of 'w' array case, there were a > > > > > > > > > > > > total of 463 > > > > > > > > > > > > cache misses, while with full hoisting there were 357 > > > > > > > > > > > > cache misses > > > > > > > > > > > > (with period = 1 million). > > > > > > > > > > > > Does that happen because hoisting probably reduces > > > > > > > > > > > > cache misses along > > > > > > > > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb > > > > > > > > > > > > 194) ? > > > > > > > > > > > Hi, > > > > > > > > > > > In general I feel for this or PR80155 case, the issues > > > > > > > > > > > come with long > > > > > > > > > > > range hoistings, inside a large CFG, since we don't have > > > > > > > > > > > an accurate > > > > > > > > > > > way to model target resources (register pressure in > > > > > > > > > > > PR80155 case / or > > > > > > > > > > > possibly cache pressure in this case?) at tree level and > > > > > > > > > > > we end up > > > > > > > > > > > with register spill or cache miss inside loops, which may > > > > > > > > > > > offset the > > > > > > > > > > > benefit of hoisting. As previously discussed the right > > > > > > > > > > > way to go is a > > > > > > > > > > > live range splitting pass, at GIMPLE -> RTL border which > > > > > > > > > > > can also help > > > > > > > > > > > with other code-movement optimizations (or if the source > > > > > > > > > > > had variables > > > > > > > > > > > with long live ranges). > > > > > > > > > > > > > > > > > > > > > > I was wondering tho as a cheap workaround, would it make > > > > > > > > > > > sense to > > > > > > > > > > > check if we are hoisting across a "large" region of > > > > > > > > > > > nested loops, and > > > > > > > > > > > avoid in that case since hoisting may exert resource > > > > > > > > > > > pressure inside > > > > > > > > > > > loop region ? (Especially, in the cases where hoisted > > > > > > > > > > > expressions were > > > > > > > > > > > not originally AVAIL in any of the loop blocks, and the > > > > > > > > > > > loop region > > > > > > > > > > > doesn't benefit from hoisting). > > > > > > > > > > > > > > > > > > > > > > For instance: > > > > > > > > > > > FOR_EACH_EDGE (e, ei, block) > > > > > > > > > > > { > > > > > > > > > > > /* Avoid hoisting across more than 3 nested loops */ > > > > > > > > > > > if (e->dest is a loop pre-header or loop header > > > > > > > > > > > && nesting depth of loop is > 3) > > > > > > > > > > > return false; > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > I think this would work for resolving the calculix issue > > > > > > > > > > > because it > > > > > > > > > > > hoists across one region of 6 loops and another of 4 > > > > > > > > > > > loops (didn' test > > > > > > > > > > > yet). > > > > > > > > > > > It's not bulletproof in that it will miss detecting cases > > > > > > > > > > > where loop > > > > > > > > > > > header (or pre-header) isn't a successor of candidate > > > > > > > > > > > block (checking > > > > > > > > > > > for > > > > > > > > > > > that might get expensive tho?). I will test it on gcc > > > > > > > > > > > suite and SPEC > > > > > > > > > > > for any regressions. > > > > > > > > > > > Does this sound like a reasonable heuristic ? > > > > > > > > > > Hi, > > > > > > > > > > The attached patch implements the above heuristic. > > > > > > > > > > Bootstrapped + tested on x86_64-linux-gnu with no > > > > > > > > > > regressions. > > > > > > > > > > And it brings back most of performance for calculix on par > > > > > > > > > > with O2 > > > > > > > > > > (without inlining orthonl). > > > > > > > > > > I verified that with patch there is no cache miss happening > > > > > > > > > > on load > > > > > > > > > > insn inside loop > > > > > > > > > > (with perf report -e L1-dcache-load-misses/period=1000000/) > > > > > > > > > > > > > > > > > > > > I am in the process of benchmarking the patch on aarch64 > > > > > > > > > > for SPEC for > > > > > > > > > > speed and will report numbers > > > > > > > > > > in couple of days. (If required, we could parametrize > > > > > > > > > > number of nested > > > > > > > > > > loops, hardcoded (arbitrarily to) 3 in this patch, > > > > > > > > > > and set it in backend to not affect other targets). > > > > > > > > > > > > > > > > > > I don't think this patch captures the case in a sensible way > > > > > > > > > - it will simply > > > > > > > > > never hoist computations out of loop header blocks with depth > > > > > > > > > > 3 which > > > > > > > > > is certainly not what you want. Also the pre-header check is > > > > > > > > > odd - we're > > > > > > > > > looking for computations in successors of BLOCK but clearly a > > > > > > > > > computation > > > > > > > > > in a pre-header is not at the same loop level as one in the > > > > > > > > > header itself. > > > > > > > > Well, my intent was to check if we are hoisting across a region, > > > > > > > > which has multiple nested loops, and in that case, avoid > > > > > > > > hoisting expressions > > > > > > > > that do not belong to any loop blocks, because that may increase > > > > > > > > resource pressure > > > > > > > > inside loops. For instance, in the calculix issue we hoist 'w' > > > > > > > > array > > > > > > > > from post-dom and neither > > > > > > > > loop region has any uses of 'w'. I agree checking just for > > > > > > > > loop level > > > > > > > > is too coarse. > > > > > > > > The check with pre-header was essentially the same to see if we > > > > > > > > are > > > > > > > > hoisting across a loop region, > > > > > > > > not necessarily from within the loops. > > > > > > > > > > > > > > But it will fail to hoist *p in > > > > > > > > > > > > > > if (x) > > > > > > > { > > > > > > > a = *p; > > > > > > > } > > > > > > > else > > > > > > > { > > > > > > > b = *p; > > > > > > > } > > > > > > > > > > > > > > <make distance large> > > > > > > > pdom: > > > > > > > c = *p; > > > > > > > > > > > > > > so it isn't what matters either. What happens at the immediate > > > > > > > post-dominator > > > > > > > isn't directly relevant - what matters would be if the pdom is > > > > > > > the one making > > > > > > > the value antic on one of the outgoing edges. In that case we're > > > > > > > also going > > > > > > > to PRE *p into the arm not computing *p (albeit in a later > > > > > > > position). But > > > > > > > that property is impossible to compute from the sets itself (not > > > > > > > even mentioning > > > > > > > the arbitrary CFG that can be inbetween the block and its pdom or > > > > > > > the weird > > > > > > > pdoms we compute for regions not having a path to exit, like > > > > > > > infinite loops). > > > > > > > > > > > > > > You could eventually look at the pdom predecessors and if *p is > > > > > > > not AVAIL_OUT > > > > > > > in each of them we _might_ have the situation you want to protect > > > > > > > against. > > > > > > > But as said PRE insertion will likely have made sure it _is_ > > > > > > > AVAIL_OUT in each > > > > > > > of them ... > > > > > > Hi Richard, > > > > > > Thanks for the suggestions. Right, the issue seems to be here that > > > > > > post-dom block is making expressions ANTIC. Before doing insert, > > > > > > could > > > > > > we simply copy AVAIL_OUT sets of each block into another set say > > > > > > ORIG_AVAIL_OUT, > > > > > > as a guard against PRE eventually inserting expressions in pred > > > > > > blocks > > > > > > of pdom and making them available? > > > > > > And during hoisting, we could check if each expr that is ANTIC_IN > > > > > > (pdom) is ORIG_AVAIL_OUT in each pred of pdom, > > > > > > if the distance is "large". > > > > > > > > > > Did you try if it works w/o copying AVAIL_OUT? Because AVAIL_OUT is > > > > > very large (it's actually quadratic in size of the CFG * # values), in > > > > > particular > > > > > we're inserting in RPO and update AVAIL_OUT only up to the current > > > > > block > > > > > (from dominators) so the PDOM block should have the original AVAIL_OUT > > > > > (but from the last iteration - we do iterate insertion). > > > > > > > > > > Note I'm still not happy with pulling off this kind of heuristics. > > > > > What the suggestion > > > > > means is that for > > > > > > > > > > if (x) > > > > > y = *p; > > > > > z = *p; > > > > > > > > > > we'll end up with > > > > > > > > > > if (x) > > > > > y = *p; > > > > > else > > > > > z = *p; > > > > > > > > > > instead of > > > > > > > > > > tem = *p; > > > > > if (x) > > > > > y = tem; > > > > > else > > > > > z = tem; > > > > > > > > > > that is, we get the runtime benefit but not the code-size one > > > > > (hoisting mostly helps code size plus allows if-conversion as followup > > > > > in some cases). Now, if we iterate (like if we'd have a second > > > > > hoisting pass) > > > > > then the above would still cause hoisting - so the heuristic isn't > > > > > sustainable. > > > > > Again, sth like "distance" isn't really available. > > > > Hi Richard, > > > > It doesn't work without copying AVAIL_OUT. > > > > I tried for small example with attached patch: > > > > > > > > int foo(int cond, int x, int y) > > > > { > > > > int t; > > > > void f(int); > > > > > > > > if (cond) > > > > t = x + y; > > > > else > > > > t = x - y; > > > > > > > > f (t); > > > > int t2 = (x + y) * 10; > > > > return t2; > > > > } > > > > > > > > By intersecting availout_in_some with AVAIL_OUT of preds of pdom, > > > > it does not hoist in first pass, but then PRE inserts x + y in the > > > > "else block", > > > > and we eventually hoist before if (cond). Similarly for e_c3d > > > > hoistings in calculix. > > > > > > > > IIUC, we want hoisting to be: > > > > (ANTIC_IN (block) intersect AVAIL_OUT (preds of pdom)) - AVAIL_OUT > > > > (block) > > > > to ensure that hoisted expressions are along all paths from block to > > > > post-dom ? > > > > If copying AVAIL_OUT sets is too large, could we keep another set that > > > > precomputes intersection of AVAIL_OUT of pdom preds > > > > for each block and then use this info during hoisting ? > > > > > > > > For computing "distance", I implemented a simple DFS walk from block > > > > to post-dom, that gives up if depth crosses > > > > threshold before reaching post-dom. I am not sure tho, how expensive > > > > that can get. > > > > > > As written it is quadratic in the CFG size. > > > > > > You can optimize away the > > > > > > + FOR_EACH_EDGE (e, ei, pdom_bb->preds) > > > + bitmap_and_into (&availout_in_some, &AVAIL_OUT (e->src)->values); > > > > > > loop if the intersection of availout_in_some and ANTIC_IN (pdom) is empty. > > > > > > As said, I don't think this is the way to go - trying to avoid code > > > hoisting isn't > > > what we'd want to do - your quoted assembly instead points to a loop > > > with a non-empty latch which is usually caused by PRE and avoided with -O3 > > > because it also harms vectorization. > > But disabling PRE (which removes non empty latch), only results in > > marginal performance improvement, > > while disabling hoisting of 'w' array, with non-empty latch, seems to > > gain most of the performance. > > AFAIU, that was happening, because after disabling hoisting of 'w', > > there wasn't a cache miss (as seen with perf -e > > L1-dcache-load-misses), > > on the load instruction inside the inner loop. > > But that doesn't make much sense then. If code generation isn't > an issue I don't see how the hoisted loads should cause a L1 > dcache load miss for data that is accessed in the respective loop > as well (though not hoisted from it since at -O2 not sufficiently > unrolled) Hi Richard, I am very sorry to respond late, I was away from work for some personal commitments, and couldn't respond earlier. Yes, I agree this doesn't seem to make much sense but I am consistently seeing L1 dcache load misses, which goes away after disabling hoisting of 'w'. I am not sure tho why this happens. Also the load instruction is the only one that has most significant performance difference across several runs. Or maybe I am interpreting the results incorrectly. Do you have suggestions for any benchmarking experiment I could try ? > > > For the pdom heuristic, I guess we cannot copy AVAIL_OUT sets per > > node, since that's quadratic in terms of CFG size. > > Would it make sense to break interaction between PRE and hoisting, > > only for the case when inserting into preds of pdom ? > > I tried doing that in attached patch, where insert runs in two phases: > > (a) PRE and hoisting, where hoisting marks block to not do PRE for. > > (b) Second phase, which only runs PRE on all blocks. > > This (expectedly) regresses ssa-hoist-3.c. > > > > If the heuristic isn't acceptable, I suppose encoding distance of expr > > within ANTIC sets > > during compute_antic would be the right way to fix this ? > > So ANTIC_IN (block) contains the anticipated expressions, and for each > > antic expr, the "distance" from the furthest block > > it's computed in ? Could you please elaborate a bit on how we could go > > about encoding distance during compute_antic ? > > But the distance in this case is just one CFG node ... we have > > if (mattyp.eq.1) > ... use of w but not with constant indices > else if (mattyp.eq.2) > .. inlined orthonl with constant index w() accesses, single BB > else > ... use of w but not with constant indices - the actual relevant > loop of calculix > endif > ... constant index w() accesses, single BB > > so the CFG distance is one node - unless you want to compute the > maximum distance? Btw, I only see 9 loads hoisted. > > I'm not sure how relevant -O2 -flto SPEC performance is for a FP benchmark. > > And indeed this case is exactly one where hoisting is superior to > PRE which would happily insert the 9 loads into the two variable-access > predecessors to get rid of the redundancy wrt the mattyp.eq.1 path. > > In .optimized I see > > pretmp_5573 = w[0]; > pretmp_5574 = w[3]; > pretmp_5575 = w[6]; > pretmp_5576 = w[1]; > pretmp_5577 = w[4]; > pretmp_5578 = w[7]; > pretmp_5579 = w[2]; > pretmp_5580 = w[5]; > pretmp_5581 = w[8]; > if (mattyp.157_742 == 1) > > I do remember talks/patches about ordering of such sequences of loads > to make them prefetch-happier. Are the loads actually emitted in-order > for arm? Thus w[0]...w[8] rather than as seen above with some random > permutes inbetween? On x86 they are emitted in random order > (they are also spilled immediately). On aarch64, they are emitted in random order as well.
Thanks, Prathamesh Thanks, Prathamesh > > Richard. > > > Thanks, > > Prathamesh > > > > > > Thanks, > > Prathamesh > > > > > > Richard. > > > > > > > Thanks, > > > > Prathamesh > > > > > > > > > > Richard. > > > > > > > > > > > Thanks, > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Note the difficulty to capture "distance" is that the > > > > > > > > > distance is simply not > > > > > > > > > available at this point - it is the anticipated values from > > > > > > > > > the successors > > > > > > > > > that do _not_ compute the value itself that are the issue. > > > > > > > > > To capture > > > > > > > > > "distance" you'd need to somehow "age" anticipated value when > > > > > > > > > propagating them upwards during compute_antic (where it then > > > > > > > > > would also apply to PRE in general). > > > > > > > > Yes, indeed. As a hack, would it make sense to avoid inserting > > > > > > > > an > > > > > > > > expression in the block, > > > > > > > > if it's ANTIC in post-dom block as a trade-off between > > > > > > > > extending live > > > > > > > > range and hoisting > > > > > > > > if the "distance" between block and post-dom is "too far" ? In > > > > > > > > general, as you point out, we'd need to compute, > > > > > > > > distance info for successors block during compute_antic, but > > > > > > > > special > > > > > > > > casing for post-dom should be easy enough > > > > > > > > during do_hoist_insertion, and hoisting an expr that is ANTIC in > > > > > > > > post-dom could be potentially "long range", if the region is > > > > > > > > large. > > > > > > > > It's still a coarse heuristic tho. I tried it in the attached > > > > > > > > patch. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As with all other heuristics the only place one could do > > > > > > > > > hackish attempts > > > > > > > > > with at least some reasoning is the elimination phase where > > > > > > > > > we make use of the (hoist) insertions - of course for > > > > > > > > > hoisting we already > > > > > > > > > know we'll get the "close" use in one of the successors so I > > > > > > > > > fear even > > > > > > > > > there it will be impossible to do something sensible. > > > > > > > > > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Prathamesh > > > > > > > > > > > > > > > > > > > > > > > > > > Alexander