On Thu, Nov 30, 2017 at 4:09 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On Thu, Nov 30, 2017 at 3:13 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Thu, Nov 30, 2017 at 1:01 PM, Richard Biener >> <richard.guent...@gmail.com> wrote: >>> On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>> Hi, >>>> This is updated patch with review comments resolved. Some explanation >>>> embedded below. >>>> >>>>> + >>>>> + iloop->nb_iterations = nb_iterations; >>>>> >>>>> use std::swap? Also I think if you can keep nb_iterations you >>>>> can certainly keep the upper bounds. You're probably >>>>> afraid of the ->stmt references in the nb_iter_bound entries? >>>>> >>>>> Anyway, either scrap everything or try to keep everything. >>>> Yeah, not only the stmts, but also the control_iv information because the >>>> SCEV >>>> information may be corrupted during code transformation. >>>> Now I discarded all the information. >>> >>> Note that given you interchange the loops but not the CFG or the loop >>> structures >>> you might want to swap loop->num and flags like ->force_vectorize. That is, >>> essentially change the ->header/latch association (and other CFG related >>> stuff >>> like recorded exits). >>> >>> It might also be we want to / need to disable interchange for, say, >>> ->force_vectorize >>> inner loops or ->unroll != 0? Or we need to clear them, maybe >>> optionally diagnosing >>> that fact. >>> >>> At least we need to think about what it means to preserve loop >>> structure (semantically, >>> loop->num should maintain association to the same source-level loop >>> throughout the >>> compilation) for transforms like interchange. >>> >>>>> >>>>> + for (i = 0; oloop.reductions.iterate (i, &re); ++i) >>>>> + { >>>>> + if (re->type != DOUBLE_RTYPE) >>>>> + gcc_unreachable (); >>>>> + >>>>> + use_operand_p use_p; >>>>> + imm_use_iterator iterator; >>>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var) >>>>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var); >>>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next) >>>>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next); >>>>> + if (TREE_CODE (re->init) == SSA_NAME) >>>>> + { >>>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init) >>>>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init); >>>>> + } >>>>> >>>>> can you add a comment what you are doing here? >>>>> >>>>> Note that other loop opts simply scrap all debug stmts ... >>>> As mentioned above, updated patch doesn't try hard to maintain debug use >>>> info any more. >>>> >>>>> >>>>> +static void >>>>> +compute_access_stride (struct loop *loop_nest, struct loop *loop, >>>>> + data_reference_p dr) >>>>> +{ >>>>> ... >>>>> + tree ref = DR_REF (dr); >>>>> + tree scev_base = build_fold_addr_expr (ref); >>>>> + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref)); >>>>> + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER); >>>>> + access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size); >>>>> + >>>>> + do { >>>>> + tree scev_fn = analyze_scalar_evolution (loop, scev_base); >>>>> + if (chrec_contains_undetermined (scev_fn) >>>>> + || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num)) >>>>> + break; >>>>> ... >>>>> + strides->safe_push (scev_step); >>>>> + } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL); >>>>> + >>>>> >>>>> I _think_ you want to do >>>>> >>>>> scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming >>>>> DR_STMT (dr) is in loop >>>>> scev_fn = instantiate_parameters (nest, scev_fn); >>>>> if (chrec_contains_undetermined (scev_fn)) >>>>> return; // false? >>>>> >>>>> and analyze the result which should be of the form >>>>> >>>>> { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ... >>>>> >>>>> if canonical. I think estimate_val_by_simplify_replace isn't needed >>>>> if you do that >>>>> (it also looks odd to replace all vairables in step by niter...). >>>> I replied on this in previous message, instantiate_parameters doesn't >>>> always >>>> give canonical form result as expected. The loop here could be seen as a >>>> local instantiate process, right? >>> >>> Kind of. I'll see if I can reproduce the difference with any of your >>> intercahnge >>> testcases - any hint which one to look at? >> For added tests, I think there will be no difference between the two. >> I noticed the difference for >> pointer cases like: >> for (i...) >> for (j...) >> for (k...) >> p[i*n*n + j*n+ k] =... >> >>> >>>> Also estimate_val_by_simplify_replace is needed for pointers, where strides >>>> are computed from niters of loops which could be non compilation time >>>> constant. >>>> But yes, it's an odd fixup after I failed to do anything better. >>> >>> But you are for example computing _1 - _2 to zero, right? Because both _1 >>> and _2 are not constant and thus you replace it with the same (symbolical) >>> constant 'niter'. >>> >>> I think that asks for garbage-in-garbage-out ... >>> >>> Which testcase is this important for so I can have a look? >> So far this is only for the above pointer case. Actually I don't >> think it's that important, and thought about skip it. >> So we don't have to do estimate_val_by_simplify_replace. > > Ok, I'd like to "dumb" the pass down to the level we can solve the > bwave case (which I realize is already one of the more complicated ones). > > Just because it's already late for GCC 8.
For reference I'll commit the following tomorrow, will play with adding a testcase for bwaves and doing the multiple_of_p thing we talked about. Richard. > Richard. > >> Thanks, >> bin >>> >>>>> >>>>> I think keeping the chrec in the above form is also more suitable for what >>>>> the caller does so the outermost loop is simply >>>>> >>>>> loop = loop_nest; >>>>> loop-over-all-dr-chrecs >>>>> if (flow_loop_nested_p (loop, CHREC_LOOP (chrec))) >>>>> loop = CHREC_LOOP (chrec); >>>>> >>>>> given the outermost loop is the outer evolution. If you sort the >>>>> stride vecs from inner >>>>> to outer you don't need prune_access_strides_not_in_loop. >>>> Hmm, So stripping outer loops prefer inner to outer sort of strides, but >>>> cost computation >>>> during interchange prefers outer to inner sort because loop_nest in >>>> tree-data-ref is sorted >>>> in this way. Seems a single prune_* function is better than fiddling with >>>> cost computation. >>> >>> Not sure how to interpret your answer... I'll see to have a more >>> detailed suggestion >>> after playing with the code a bit. >>> >>>>> >>>>> +/* Count and return the number of loops in LOOP_NEST. */ >>>>> + >>>>> +unsigned int >>>>> +num_loops_in_loop_nest (struct loop *loop_nest) >>>>> +{ >>>>> + unsigned num_loops; >>>>> + for (num_loops = 0; loop_nest; num_loops++, loop_nest = >>>>> loop_nest->inner) >>>>> + ; >>>>> + return num_loops; >>>>> >>>>> loop_depth (innermost) - loop_depth (nest)? >>>> Done. >>>> >>>>> >>>>> +static bool >>>>> +should_interchange_loops (unsigned i_idx, unsigned o_idx, >>>>> + vec<data_reference_p> datarefs, >>>>> + bool innermost_loops_p, bool dump_info_p = true) >>>>> +{ >>>>> >>>>> isn't all we need associating the above CHREC to sort after the >>>>> CHREC_RIGHTs >>>>> and figure a permutation sequence to arrive there? That is for the local >>>>> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx] >>>>> when we should interchange? >>>>> >>>>> That subloop_stride_p and tracking invariant DRs looks a bit odd. For >>>>> loops >>>>> where a DR is invariant you simply do not have an evolution in that loop. >>>>> You seem to simply add strides in the inner and outer loops for each DR, >>>>> can you explain how that works? Also I guess strides bigger than the >>>>> various cache-line size parameters should be treated 'equal'? That is, >>>>> if we don't get any DR to a step that results in L1 cache hits because >>>>> two DRs share a cache line the interchange is pointless? >>>> So given below loop: >>>> >>>> for (int i = 0; i < M; i++) >>>> for (int j = 0; j < M; j++) >>>> for (int k = 0; k < M; k++) >>>> a[k][0][i] = b[k][0][i] >>>> >>>> We check if memory reference is invariant wrto a loop only if it has zero >>>> strides within >>>> current loop nest. In this example, there is no invariant given address >>>> changes in the >>>> innermost loop. >>> >>> But they simply wouldn't take part in the sorting? That is, invariant >>> refs in a loop >>> shouldn't prevent it becoming more inner or more outer, no? >>> >>>> For strides bigger than cache-line size, it's also possible the >>>> interchange is wanted, as >>>> in below example: >>>> >>>> for (int i = 0; i < M; i++) //loop 1 >>>> for (int j = 0; j < M; j++) //loop 2 >>>> for (int k = 0; k < M; k++) //loop 3 >>>> a[j][i][k] = b[j][i][k] >>>> >>>> Strides for loop 1/2 are very like to be big, but after interchange, we >>>> will have stream >>>> access of both arrays. >>>> >>>> More advanced heuristics may be possible, but so far the estimation is >>>> quite good by >>>> checking all interchanges I looked into. >>>> >>>>> >>>>> +/* Prune DATAREFS by removing any data reference not inside of LOOP. */ >>>>> + >>>>> +static inline void >>>>> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> >>>>> datarefs) >>>>> +{ >>>>> + struct data_reference *dr; >>>>> + >>>>> + for (unsigned i = 0; datarefs.iterate (i, &dr);) >>>>> + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr)))) >>>>> + i++; >>>>> + else >>>>> + { >>>>> + datarefs.ordered_remove (i); >>>>> >>>>> that's expensive. It's better to keep moving DRs we want to keep >>>>> when walking the array. That is, add a j you increment only when >>>>> we keep a DR, moving *i to *j. >>>> Done. >>>> >>>>> >>>>> + if (dr->aux) >>>>> + { >>>>> + DR_ACCESS_STRIDE (dr)->release (); >>>>> + free (dr->aux); >>>>> + } >>>>> + free_data_ref (dr); >>>>> + } >>>>> +} >>>>> >>>>> + >>>>> + start_loop = prune_non_rectangle_loop_nest (innermost_loop, >>>>> start_loop); >>>>> + >>>>> >>>>> Hmm. If you instantiate the SCEV for the niters for each loop in the nest >>>>> wrt the nest you can figure if it has any evolution in sth else than the >>>>> loop (evolution_function_is_univariate_p). That is, this is not a >>>>> problem >>>>> until you arrive at analyzing DR strides, right? At which point you >>>>> can check for the appropriate form? >>>> Hmm, not really. The niter relation may not appear in SCEV of reference >>>> addr. >>>> For example, below loop: >>>> >>>> for (int i = 0; i < M; i++) //loop 1 >>>> for (int j = 0; j < M; j++) //loop 2 >>>> for (int k = 0; k < i; k++) //loop 3 >>>> a[k][0][i] = b[k][0][i] >>>> >>>> There is no information in data reference about i/j loops. >>>> Anyway, I refactored the code and put this check in >>>> proper_loop_form_for_interchange. >>>> Simpler I think. >>>> >>>>> >>>>> + if (find_data_references_in_loop (start_loop, datarefs) == >>>>> chrec_dont_know >>>>> + /* Check if there is no data reference. */ >>>>> + || datarefs->length () == 0 >>>>> + /* Check if there are too many of data references. */ >>>>> + || ((int) datarefs->length () >>>>> + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS)) >>>>> + /* Check if there is any data reference in loop latch. We can't >>>>> handle >>>>> + loops which loop header and data references have different >>>>> execution >>>>> + times. */ >>>>> + || dataref_niters_diff_to_loop_header (*datarefs) >>>>> >>>>> this suggests to do your own find_data_references_in_loop so you can early >>>>> out. >>>> I refactored the code a bit. Now this check is in >>>> proper_loop_form_for_interchange, >>>> but I do customized my own data references finder. It's needed to strip >>>> outer loops >>>> once a difficult reference is found. >>>> >>>>> >>>>> Overall the flow through the pass is a bit hard to follow given there are >>>>> IMHO too many functions. >>>> Yeah, I removed quite number of small functions and refactor the code a >>>> lot. Hope this >>>> version is more straightforward. >>>>> >>>>> +unsigned int >>>>> +pass_linterchange::execute (function *fun) >>>>> +{ >>>>> + if (number_of_loops (fun) <= 2) >>>>> + return 0; >>>>> + >>>>> + bool changed_p = false;; >>>>> + struct loop *loop; >>>>> + vec<loop_p> loop_nest; >>>>> + vec<data_reference_p> datarefs; >>>>> + vec<ddr_p> ddrs; >>>>> + >>>>> + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) >>>>> + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs)) >>>>> + { >>>>> + tree_loop_interchange loop_interchange (loop_nest, datarefs, >>>>> ddrs); >>>>> + changed_p |= loop_interchange.interchange (); >>>>> + } >>>>> >>>>> you leak datarefs/ddrs? >>>> It was release in destructor, but I refactored it anyway. I will push the >>>> code to branch >>>> gcc.gnu.org/svn/gcc/branches/gimple-linterchange. >>>> >>>> Thanks again for the comment of you two. >>> >>> Digging into the code now... >>> >>> Richard. >>> >>>> Thanks, >>>> bin >>>> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >>>> >>>> * Makefile.in (gimple-loop-interchange.o): New object file. >>>> * common.opt (floop-interchange): Reuse the option from graphite. >>>> * doc/invoke.texi (-floop-interchange): Ditto. New document. >>>> * gimple-loop-interchange.cc: New file. >>>> * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter. >>>> (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter. >>>> * passes.def (pass_linterchange): New pass. >>>> * timevar.def (TV_LINTERCHANGE): New time var. >>>> * tree-pass.h (make_pass_linterchange): New declaration. >>>> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external >>>> interchange. Record IV before/after increment in new parameters. >>>> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration. >>>> >>>> gcc/testsuite >>>> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >>>> >>>> * gcc.dg/tree-ssa/loop-interchange-1.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-2.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-3.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-4.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-5.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-6.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-7.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-8.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-9.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-10.c: New test. >>>> * gcc.dg/tree-ssa/loop-interchange-11.c: New test.
interchange
Description: Binary data