On Thu, Nov 30, 2017 at 1:01 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng <bin.ch...@arm.com> wrote: >> Hi, >> This is updated patch with review comments resolved. Some explanation >> embedded below. >> >>> + >>> + iloop->nb_iterations = nb_iterations; >>> >>> use std::swap? Also I think if you can keep nb_iterations you >>> can certainly keep the upper bounds. You're probably >>> afraid of the ->stmt references in the nb_iter_bound entries? >>> >>> Anyway, either scrap everything or try to keep everything. >> Yeah, not only the stmts, but also the control_iv information because the >> SCEV >> information may be corrupted during code transformation. >> Now I discarded all the information. > > Note that given you interchange the loops but not the CFG or the loop > structures > you might want to swap loop->num and flags like ->force_vectorize. That is, > essentially change the ->header/latch association (and other CFG related stuff > like recorded exits). > > It might also be we want to / need to disable interchange for, say, > ->force_vectorize > inner loops or ->unroll != 0? Or we need to clear them, maybe > optionally diagnosing > that fact. > > At least we need to think about what it means to preserve loop > structure (semantically, > loop->num should maintain association to the same source-level loop > throughout the > compilation) for transforms like interchange. > >>> >>> + for (i = 0; oloop.reductions.iterate (i, &re); ++i) >>> + { >>> + if (re->type != DOUBLE_RTYPE) >>> + gcc_unreachable (); >>> + >>> + use_operand_p use_p; >>> + imm_use_iterator iterator; >>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var) >>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var); >>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next) >>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next); >>> + if (TREE_CODE (re->init) == SSA_NAME) >>> + { >>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init) >>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init); >>> + } >>> >>> can you add a comment what you are doing here? >>> >>> Note that other loop opts simply scrap all debug stmts ... >> As mentioned above, updated patch doesn't try hard to maintain debug use >> info any more. >> >>> >>> +static void >>> +compute_access_stride (struct loop *loop_nest, struct loop *loop, >>> + data_reference_p dr) >>> +{ >>> ... >>> + tree ref = DR_REF (dr); >>> + tree scev_base = build_fold_addr_expr (ref); >>> + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref)); >>> + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER); >>> + access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size); >>> + >>> + do { >>> + tree scev_fn = analyze_scalar_evolution (loop, scev_base); >>> + if (chrec_contains_undetermined (scev_fn) >>> + || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num)) >>> + break; >>> ... >>> + strides->safe_push (scev_step); >>> + } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL); >>> + >>> >>> I _think_ you want to do >>> >>> scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming >>> DR_STMT (dr) is in loop >>> scev_fn = instantiate_parameters (nest, scev_fn); >>> if (chrec_contains_undetermined (scev_fn)) >>> return; // false? >>> >>> and analyze the result which should be of the form >>> >>> { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ... >>> >>> if canonical. I think estimate_val_by_simplify_replace isn't needed >>> if you do that >>> (it also looks odd to replace all vairables in step by niter...). >> I replied on this in previous message, instantiate_parameters doesn't always >> give canonical form result as expected. The loop here could be seen as a >> local instantiate process, right? > > Kind of. I'll see if I can reproduce the difference with any of your > intercahnge > testcases - any hint which one to look at? For added tests, I think there will be no difference between the two. I noticed the difference for pointer cases like: for (i...) for (j...) for (k...) p[i*n*n + j*n+ k] =...
> >> Also estimate_val_by_simplify_replace is needed for pointers, where strides >> are computed from niters of loops which could be non compilation time >> constant. >> But yes, it's an odd fixup after I failed to do anything better. > > But you are for example computing _1 - _2 to zero, right? Because both _1 > and _2 are not constant and thus you replace it with the same (symbolical) > constant 'niter'. > > I think that asks for garbage-in-garbage-out ... > > Which testcase is this important for so I can have a look? So far this is only for the above pointer case. Actually I don't think it's that important, and thought about skip it. So we don't have to do estimate_val_by_simplify_replace. Thanks, bin > >>> >>> I think keeping the chrec in the above form is also more suitable for what >>> the caller does so the outermost loop is simply >>> >>> loop = loop_nest; >>> loop-over-all-dr-chrecs >>> if (flow_loop_nested_p (loop, CHREC_LOOP (chrec))) >>> loop = CHREC_LOOP (chrec); >>> >>> given the outermost loop is the outer evolution. If you sort the >>> stride vecs from inner >>> to outer you don't need prune_access_strides_not_in_loop. >> Hmm, So stripping outer loops prefer inner to outer sort of strides, but >> cost computation >> during interchange prefers outer to inner sort because loop_nest in >> tree-data-ref is sorted >> in this way. Seems a single prune_* function is better than fiddling with >> cost computation. > > Not sure how to interpret your answer... I'll see to have a more > detailed suggestion > after playing with the code a bit. > >>> >>> +/* Count and return the number of loops in LOOP_NEST. */ >>> + >>> +unsigned int >>> +num_loops_in_loop_nest (struct loop *loop_nest) >>> +{ >>> + unsigned num_loops; >>> + for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner) >>> + ; >>> + return num_loops; >>> >>> loop_depth (innermost) - loop_depth (nest)? >> Done. >> >>> >>> +static bool >>> +should_interchange_loops (unsigned i_idx, unsigned o_idx, >>> + vec<data_reference_p> datarefs, >>> + bool innermost_loops_p, bool dump_info_p = true) >>> +{ >>> >>> isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs >>> and figure a permutation sequence to arrive there? That is for the local >>> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx] >>> when we should interchange? >>> >>> That subloop_stride_p and tracking invariant DRs looks a bit odd. For loops >>> where a DR is invariant you simply do not have an evolution in that loop. >>> You seem to simply add strides in the inner and outer loops for each DR, >>> can you explain how that works? Also I guess strides bigger than the >>> various cache-line size parameters should be treated 'equal'? That is, >>> if we don't get any DR to a step that results in L1 cache hits because >>> two DRs share a cache line the interchange is pointless? >> So given below loop: >> >> for (int i = 0; i < M; i++) >> for (int j = 0; j < M; j++) >> for (int k = 0; k < M; k++) >> a[k][0][i] = b[k][0][i] >> >> We check if memory reference is invariant wrto a loop only if it has zero >> strides within >> current loop nest. In this example, there is no invariant given address >> changes in the >> innermost loop. > > But they simply wouldn't take part in the sorting? That is, invariant > refs in a loop > shouldn't prevent it becoming more inner or more outer, no? > >> For strides bigger than cache-line size, it's also possible the interchange >> is wanted, as >> in below example: >> >> for (int i = 0; i < M; i++) //loop 1 >> for (int j = 0; j < M; j++) //loop 2 >> for (int k = 0; k < M; k++) //loop 3 >> a[j][i][k] = b[j][i][k] >> >> Strides for loop 1/2 are very like to be big, but after interchange, we will >> have stream >> access of both arrays. >> >> More advanced heuristics may be possible, but so far the estimation is quite >> good by >> checking all interchanges I looked into. >> >>> >>> +/* Prune DATAREFS by removing any data reference not inside of LOOP. */ >>> + >>> +static inline void >>> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> >>> datarefs) >>> +{ >>> + struct data_reference *dr; >>> + >>> + for (unsigned i = 0; datarefs.iterate (i, &dr);) >>> + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr)))) >>> + i++; >>> + else >>> + { >>> + datarefs.ordered_remove (i); >>> >>> that's expensive. It's better to keep moving DRs we want to keep >>> when walking the array. That is, add a j you increment only when >>> we keep a DR, moving *i to *j. >> Done. >> >>> >>> + if (dr->aux) >>> + { >>> + DR_ACCESS_STRIDE (dr)->release (); >>> + free (dr->aux); >>> + } >>> + free_data_ref (dr); >>> + } >>> +} >>> >>> + >>> + start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop); >>> + >>> >>> Hmm. If you instantiate the SCEV for the niters for each loop in the nest >>> wrt the nest you can figure if it has any evolution in sth else than the >>> loop (evolution_function_is_univariate_p). That is, this is not a problem >>> until you arrive at analyzing DR strides, right? At which point you >>> can check for the appropriate form? >> Hmm, not really. The niter relation may not appear in SCEV of reference >> addr. >> For example, below loop: >> >> for (int i = 0; i < M; i++) //loop 1 >> for (int j = 0; j < M; j++) //loop 2 >> for (int k = 0; k < i; k++) //loop 3 >> a[k][0][i] = b[k][0][i] >> >> There is no information in data reference about i/j loops. >> Anyway, I refactored the code and put this check in >> proper_loop_form_for_interchange. >> Simpler I think. >> >>> >>> + if (find_data_references_in_loop (start_loop, datarefs) == >>> chrec_dont_know >>> + /* Check if there is no data reference. */ >>> + || datarefs->length () == 0 >>> + /* Check if there are too many of data references. */ >>> + || ((int) datarefs->length () >>> + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS)) >>> + /* Check if there is any data reference in loop latch. We can't >>> handle >>> + loops which loop header and data references have different >>> execution >>> + times. */ >>> + || dataref_niters_diff_to_loop_header (*datarefs) >>> >>> this suggests to do your own find_data_references_in_loop so you can early >>> out. >> I refactored the code a bit. Now this check is in >> proper_loop_form_for_interchange, >> but I do customized my own data references finder. It's needed to strip >> outer loops >> once a difficult reference is found. >> >>> >>> Overall the flow through the pass is a bit hard to follow given there are >>> IMHO too many functions. >> Yeah, I removed quite number of small functions and refactor the code a lot. >> Hope this >> version is more straightforward. >>> >>> +unsigned int >>> +pass_linterchange::execute (function *fun) >>> +{ >>> + if (number_of_loops (fun) <= 2) >>> + return 0; >>> + >>> + bool changed_p = false;; >>> + struct loop *loop; >>> + vec<loop_p> loop_nest; >>> + vec<data_reference_p> datarefs; >>> + vec<ddr_p> ddrs; >>> + >>> + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) >>> + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs)) >>> + { >>> + tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs); >>> + changed_p |= loop_interchange.interchange (); >>> + } >>> >>> you leak datarefs/ddrs? >> It was release in destructor, but I refactored it anyway. I will push the >> code to branch >> gcc.gnu.org/svn/gcc/branches/gimple-linterchange. >> >> Thanks again for the comment of you two. > > Digging into the code now... > > Richard. > >> Thanks, >> bin >> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >> >> * Makefile.in (gimple-loop-interchange.o): New object file. >> * common.opt (floop-interchange): Reuse the option from graphite. >> * doc/invoke.texi (-floop-interchange): Ditto. New document. >> * gimple-loop-interchange.cc: New file. >> * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter. >> (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter. >> * passes.def (pass_linterchange): New pass. >> * timevar.def (TV_LINTERCHANGE): New time var. >> * tree-pass.h (make_pass_linterchange): New declaration. >> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external >> interchange. Record IV before/after increment in new parameters. >> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration. >> >> gcc/testsuite >> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >> >> * gcc.dg/tree-ssa/loop-interchange-1.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-2.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-3.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-4.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-5.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-6.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-7.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-8.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-9.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-10.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-11.c: New test.