On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni > <prathamesh.kulka...@linaro.org> wrote: > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni > > > <prathamesh.kulka...@linaro.org> wrote: > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amona...@ispras.ru> > > > > wrote: > > > > > > > > > > > I obtained perf stat results for following benchmark runs: > > > > > > > > > > > > -O2: > > > > > > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs > > > > > > utilized > > > > > > 3758 context-switches # > > > > > > 0.000 K/sec > > > > > > 40 cpu-migrations # > > > > > > 0.000 K/sec > > > > > > 40847 page-faults # > > > > > > 0.005 K/sec > > > > > > 7856782413676 cycles # 1.000 > > > > > > GHz > > > > > > 6034510093417 instructions # 0.77 > > > > > > insn per cycle > > > > > > 363937274287 branches # 46.321 > > > > > > M/sec > > > > > > 48557110132 branch-misses # 13.34% of > > > > > > all branches > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute > > > > > should be > > > > > enough for this kind of code) > > > > > > > > > > > -O2 with orthonl inlined: > > > > > > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs > > > > > > utilized > > > > > > 4285 context-switches # > > > > > > 0.001 K/sec > > > > > > 28 cpu-migrations # > > > > > > 0.000 K/sec > > > > > > 40843 page-faults # > > > > > > 0.005 K/sec > > > > > > 8319591038295 cycles # 1.000 > > > > > > GHz > > > > > > 6276338800377 instructions # 0.75 > > > > > > insn per cycle > > > > > > 467400726106 branches # 56.180 > > > > > > M/sec > > > > > > 45986364011 branch-misses # 9.84% of > > > > > > all branches > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, > > > > > probably implying > > > > > that extra instructions are appearing in this loop nest, but not in > > > > > the innermost > > > > > loop. As a reminder for others, the innermost loop has only 3 > > > > > iterations. > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra > > > > > > branches): > > > > > > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized > > > > > > 2266 context-switches # 0.000 > > > > > > K/sec > > > > > > 32 cpu-migrations # 0.000 > > > > > > K/sec > > > > > > 40846 page-faults # 0.005 > > > > > > K/sec > > > > > > 8207292032467 cycles # 1.000 GHz > > > > > > 6035724436440 instructions # 0.74 insn > > > > > > per cycle > > > > > > 364415440156 branches # 44.401 M/sec > > > > > > 53138327276 branch-misses # 14.58% of all > > > > > > branches > > > > > > > > > > This seems to match baseline in terms of instruction count, but > > > > > without PRE > > > > > the loop nest may be carrying some dependencies over memory. I would > > > > > simply > > > > > check the assembly for the entire 6-level loop nest in question, I > > > > > hope it's > > > > > not very complicated (though Fortran array addressing...). > > > > > > > > > > > -O2 with orthonl inlined and hoisting disabled: > > > > > > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs > > > > > > utilized > > > > > > 3139 context-switches # > > > > > > 0.000 K/sec > > > > > > 20 cpu-migrations # > > > > > > 0.000 K/sec > > > > > > 40846 page-faults # > > > > > > 0.005 K/sec > > > > > > 7797221351467 cycles # 1.000 > > > > > > GHz > > > > > > 6187348757324 instructions # 0.79 > > > > > > insn per cycle > > > > > > 461840800061 branches # 59.231 > > > > > > M/sec > > > > > > 26920311761 branch-misses # 5.83% of > > > > > > all branches > > > > > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in > > > > > cycle count. > > > > > I don't think the former fully covers the latter (there's also a 90e9 > > > > > reduction > > > > > in insn count). > > > > > > > > > > Given that the inner loop iterates only 3 times, my main suggestion > > > > > is to > > > > > consider how the profile for the entire loop nest looks like (it's 6 > > > > > loops deep, > > > > > each iterating only 3 times). > > > > > > > > > > > Perf profiles for > > > > > > -O2 -fno-code-hoisting and inlined orthonl: > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > > > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248] > > > > > > 216348301800│ add w0, w0, #0x1 > > > > > > 985098 | add x2, x2, #0x18 > > > > > > 216215999206│ add x1, x1, #0x48 > > > > > > 215630376504│ fmul d1, d5, d1 > > > > > > 863829148015│ fmul d1, d1, d6 > > > > > > 864228353526│ fmul d0, d1, d0 > > > > > > 864568163014│ fmadd d2, d0, d16, d2 > > > > > > │ cmp w0, #0x4 > > > > > > 216125427594│ ↓ b.eq 1f34 > > > > > > 15010377│ ldur d0, [x2, #-8] > > > > > > 143753737468│ ↑ b 1f04 > > > > > > > > > > > > -O2 with inlined orthonl: > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > > > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > > > > > 144055883055│ add w0, w0, #0x1 > > > > > > 72262104254│ add x2, x2, #0x18 > > > > > > 143991169721│ add x1, x1, #0x48 > > > > > > 288648917780│ fmul d15, d17, d15 > > > > > > 864665644756│ fmul d15, d15, d18 > > > > > > 863868426387│ fmul d14, d15, d14 > > > > > > 865228159813│ fmadd d16, d14, d31, d16 > > > > > > 245967│ cmp w0, #0x4 > > > > > > 215396760545│ ↓ b.eq 1f28 > > > > > > 704732365│ ldur d14, [x2, #-8] > > > > > > 143775979620│ ↑ b 1ef8 > > > > > > > > > > This indicates that the loop only covers about 46-48% of overall time. > > > > > > > > > > High count on the initial ldur instruction could be explained if the > > > > > loop > > > > > is not entered by "fallthru" from the preceding block, or if its > > > > > backedge > > > > > is mispredicted. Sampling mispredictions should be possible with perf > > > > > record, > > > > > and you may be able to check if loop entry is fallthrough by > > > > > inspecting > > > > > assembly. > > > > > > > > > > It may also be possible to check if code alignment matters, by > > > > > compiling with > > > > > -falign-loops=32. > > > > Hi, > > > > Thanks a lot for the detailed feedback, and I am sorry for late > > > > response. > > > > > > > > The hoisting region is: > > > > if(mattyp.eq.1) then > > > > 4 loops > > > > elseif(mattyp.eq.2) then > > > > { > > > > orthonl inlined into basic block; > > > > loads w[0] .. w[8] > > > > } > > > > else > > > > 6 loops // load anisox > > > > > > > > followed by basic block: > > > > > > > > senergy= > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1)) > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2) > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight > > > > s(ii1,jj1)=s(ii1,jj1)+senergy > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy > > > > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block, > > > > right in block 181, which is: > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193> > > > > > > > > which is then further hoisted to block 173: > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181> > > > > > > > > From block 181, we have two paths towards senergy block (bb 194): > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block) > > > > AND > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194 > > > > which has a path length of around 18 blocks. > > > > (bb 194 post-dominates bb 181 and bb 173). > > > > > > > > Disabling only load hoisting within blocks 173 and 181 > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE), > > > > avoid hoisting of 'w' array and brings back most of performance. Which > > > > verifies that it is hoisting of the > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ? > > > > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of > > > > 'w' array for the 6 loops, and the most drastic difference was > > > > for ldur instruction: > > > > > > > > With full hoisting: > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > > > > > > > Without full hoisting: > > > > 3441224 │1edc: ldur d1, [x1, #-248] > > > > > > > > (The loop entry seems to be fall thru in both cases. I have attached > > > > profiles for both cases). > > > > > > > > IIUC, the instruction seems to be loading the first element from anisox > > > > array, > > > > which makes me wonder if the issue was with data-cache miss for slower > > > > version. > > > > I ran perf script on perf data for L1-dcache-load-misses with period = > > > > 1million, > > > > and it reported two cache misses on the ldur instruction in full > > > > hoisting case, > > > > while it reported zero for the disabled load hoisting case. > > > > So I wonder if the slowdown happens because hoisting of 'w' array > > > > possibly results > > > > in eviction of anisox thus causing a cache miss inside the inner loop > > > > and making load slower ? > > > > > > > > Hoisting also seems to improve the number of overall cache misses tho. > > > > For disabled hoisting of 'w' array case, there were a total of 463 > > > > cache misses, while with full hoisting there were 357 cache misses > > > > (with period = 1 million). > > > > Does that happen because hoisting probably reduces cache misses along > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ? > > > Hi, > > > In general I feel for this or PR80155 case, the issues come with long > > > range hoistings, inside a large CFG, since we don't have an accurate > > > way to model target resources (register pressure in PR80155 case / or > > > possibly cache pressure in this case?) at tree level and we end up > > > with register spill or cache miss inside loops, which may offset the > > > benefit of hoisting. As previously discussed the right way to go is a > > > live range splitting pass, at GIMPLE -> RTL border which can also help > > > with other code-movement optimizations (or if the source had variables > > > with long live ranges). > > > > > > I was wondering tho as a cheap workaround, would it make sense to > > > check if we are hoisting across a "large" region of nested loops, and > > > avoid in that case since hoisting may exert resource pressure inside > > > loop region ? (Especially, in the cases where hoisted expressions were > > > not originally AVAIL in any of the loop blocks, and the loop region > > > doesn't benefit from hoisting). > > > > > > For instance: > > > FOR_EACH_EDGE (e, ei, block) > > > { > > > /* Avoid hoisting across more than 3 nested loops */ > > > if (e->dest is a loop pre-header or loop header > > > && nesting depth of loop is > 3) > > > return false; > > > } > > > > > > I think this would work for resolving the calculix issue because it > > > hoists across one region of 6 loops and another of 4 loops (didn' test > > > yet). > > > It's not bulletproof in that it will miss detecting cases where loop > > > header (or pre-header) isn't a successor of candidate block (checking > > > for > > > that might get expensive tho?). I will test it on gcc suite and SPEC > > > for any regressions. > > > Does this sound like a reasonable heuristic ? > > Hi, > > The attached patch implements the above heuristic. > > Bootstrapped + tested on x86_64-linux-gnu with no regressions. > > And it brings back most of performance for calculix on par with O2 > > (without inlining orthonl). > > I verified that with patch there is no cache miss happening on load > > insn inside loop > > (with perf report -e L1-dcache-load-misses/period=1000000/) > > > > I am in the process of benchmarking the patch on aarch64 for SPEC for > > speed and will report numbers > > in couple of days. (If required, we could parametrize number of nested > > loops, hardcoded (arbitrarily to) 3 in this patch, > > and set it in backend to not affect other targets). > > I don't think this patch captures the case in a sensible way - it will simply > never hoist computations out of loop header blocks with depth > 3 which > is certainly not what you want. Also the pre-header check is odd - we're > looking for computations in successors of BLOCK but clearly a computation > in a pre-header is not at the same loop level as one in the header itself. Well, my intent was to check if we are hoisting across a region, which has multiple nested loops, and in that case, avoid hoisting expressions that do not belong to any loop blocks, because that may increase resource pressure inside loops. For instance, in the calculix issue we hoist 'w' array from post-dom and neither loop region has any uses of 'w'. I agree checking just for loop level is too coarse. The check with pre-header was essentially the same to see if we are hoisting across a loop region, not necessarily from within the loops. > > Note the difficulty to capture "distance" is that the distance is simply not > available at this point - it is the anticipated values from the successors > that do _not_ compute the value itself that are the issue. To capture > "distance" you'd need to somehow "age" anticipated value when > propagating them upwards during compute_antic (where it then > would also apply to PRE in general). Yes, indeed. As a hack, would it make sense to avoid inserting an expression in the block, if it's ANTIC in post-dom block as a trade-off between extending live range and hoisting if the "distance" between block and post-dom is "too far" ? In general, as you point out, we'd need to compute, distance info for successors block during compute_antic, but special casing for post-dom should be easy enough during do_hoist_insertion, and hoisting an expr that is ANTIC in post-dom could be potentially "long range", if the region is large. It's still a coarse heuristic tho. I tried it in the attached patch.
Thanks, Prathamesh > > As with all other heuristics the only place one could do hackish attempts > with at least some reasoning is the elimination phase where > we make use of the (hoist) insertions - of course for hoisting we already > know we'll get the "close" use in one of the successors so I fear even > there it will be impossible to do something sensible. > > Richard. > > > Thanks, > > Prathamesh > > > > > > Thanks, > > > Prathamesh > > > > > > > > > > > > Thanks, > > > Prathamesh > > > > > > > > Thanks, > > > > Prathamesh > > > > > > > > > > Alexander
gnu-659-pdom-3.diff
Description: Binary data