Re: [PATCH GCC][V2]A simple implementation of loop interchange

Richard Biener Thu, 30 Nov 2017 07:52:26 -0800

On Thu, Nov 30, 2017 at 4:09 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Thu, Nov 30, 2017 at 3:13 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Thu, Nov 30, 2017 at 1:01 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> This is updated patch with review comments resolved.  Some explanation 
>>>> embedded below.
>>>>
>>>>> +
>>>>> +  iloop->nb_iterations = nb_iterations;
>>>>>
>>>>> use std::swap?  Also I think if you can keep nb_iterations you
>>>>> can certainly keep the upper bounds.  You're probably
>>>>> afraid of the ->stmt references in the nb_iter_bound entries?
>>>>>
>>>>> Anyway, either scrap everything or try to keep everything.
>>>> Yeah, not only the stmts, but also the control_iv information because the 
>>>> SCEV
>>>> information may be corrupted during code transformation.
>>>> Now I discarded all the information.
>>>
>>> Note that given you interchange the loops but not the CFG or the loop 
>>> structures
>>> you might want to swap loop->num and flags like ->force_vectorize.  That is,
>>> essentially change the ->header/latch association (and other CFG related 
>>> stuff
>>> like recorded exits).
>>>
>>> It might also be we want to / need to disable interchange for, say,
>>> ->force_vectorize
>>> inner loops or ->unroll != 0?  Or we need to clear them, maybe
>>> optionally diagnosing
>>> that fact.
>>>
>>> At least we need to think about what it means to preserve loop
>>> structure (semantically,
>>> loop->num should maintain association to the same source-level loop
>>> throughout the
>>> compilation) for transforms like interchange.
>>>
>>>>>
>>>>> +  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
>>>>> +    {
>>>>> +      if (re->type != DOUBLE_RTYPE)
>>>>> +       gcc_unreachable ();
>>>>> +
>>>>> +      use_operand_p use_p;
>>>>> +      imm_use_iterator iterator;
>>>>> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
>>>>> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
>>>>> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
>>>>> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
>>>>> +      if (TREE_CODE (re->init) == SSA_NAME)
>>>>> +       {
>>>>> +         FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
>>>>> +           mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
>>>>> +       }
>>>>>
>>>>> can you add a comment what you are doing here?
>>>>>
>>>>> Note that other loop opts simply scrap all debug stmts ...
>>>> As mentioned above, updated patch doesn't try hard to maintain debug use 
>>>> info any more.
>>>>
>>>>>
>>>>> +static void
>>>>> +compute_access_stride (struct loop *loop_nest, struct loop *loop,
>>>>> +                      data_reference_p dr)
>>>>> +{
>>>>> ...
>>>>> +  tree ref = DR_REF (dr);
>>>>> +  tree scev_base = build_fold_addr_expr (ref);
>>>>> +  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
>>>>> +  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
>>>>> +  access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
>>>>> +
>>>>> +  do {
>>>>> +    tree scev_fn = analyze_scalar_evolution (loop, scev_base);
>>>>> +    if (chrec_contains_undetermined (scev_fn)
>>>>> +       || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
>>>>> +      break;
>>>>> ...
>>>>> +    strides->safe_push (scev_step);
>>>>> +  } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
>>>>> +
>>>>>
>>>>> I _think_ you want to do
>>>>>
>>>>>    scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming
>>>>> DR_STMT (dr) is in loop
>>>>>    scev_fn = instantiate_parameters (nest, scev_fn);
>>>>>    if (chrec_contains_undetermined (scev_fn))
>>>>>      return; // false?
>>>>>
>>>>> and analyze the result which should be of the form
>>>>>
>>>>>   { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...
>>>>>
>>>>> if canonical.  I think estimate_val_by_simplify_replace isn't needed
>>>>> if you do that
>>>>> (it also looks odd to replace all vairables in step by niter...).
>>>> I replied on this in previous message, instantiate_parameters doesn't 
>>>> always
>>>> give canonical form result as expected.  The loop here could be seen as a
>>>> local instantiate process, right?
>>>
>>> Kind of.  I'll see if I can reproduce the difference with any of your
>>> intercahnge
>>> testcases - any hint which one to look at?
>> For added tests, I think there will be no difference between the two.
>> I noticed the difference for
>> pointer cases like:
>> for (i...)
>>   for (j...)
>>     for (k...)
>>        p[i*n*n + j*n+ k] =...
>>
>>>
>>>> Also estimate_val_by_simplify_replace is needed for pointers, where strides
>>>> are computed from niters of loops which could be non compilation time 
>>>> constant.
>>>> But yes, it's an odd fixup after I failed to do anything better.
>>>
>>> But you are for example computing _1 - _2 to zero, right?  Because both _1
>>> and _2 are not constant and thus you replace it with the same (symbolical)
>>> constant 'niter'.
>>>
>>> I think that asks for garbage-in-garbage-out ...
>>>
>>> Which testcase is this important for so I can have a look?
>> So far this is only for the above pointer case.  Actually I don't
>> think it's that important, and thought about skip it.
>> So we don't have to do estimate_val_by_simplify_replace.
>
> Ok, I'd like to "dumb" the pass down to the level we can solve the
> bwave case (which I realize is already one of the more complicated ones).
>
> Just because it's already late for GCC 8.


For reference I'll commit the following tomorrow, will play with adding
a testcase for bwaves and doing the multiple_of_p thing we talked about.

Richard.

> Richard.
>
>> Thanks,
>> bin
>>>
>>>>>
>>>>> I think keeping the chrec in the above form is also more suitable for what
>>>>> the caller does so the outermost loop is simply
>>>>>
>>>>>   loop = loop_nest;
>>>>>   loop-over-all-dr-chrecs
>>>>>     if (flow_loop_nested_p (loop, CHREC_LOOP (chrec)))
>>>>>       loop = CHREC_LOOP (chrec);
>>>>>
>>>>> given the outermost loop is the outer evolution.  If you sort the
>>>>> stride vecs from inner
>>>>> to outer you don't need prune_access_strides_not_in_loop.
>>>> Hmm, So stripping outer loops prefer inner to outer sort of strides, but 
>>>> cost computation
>>>> during interchange prefers outer to inner sort because loop_nest in 
>>>> tree-data-ref is sorted
>>>> in this way.  Seems a single prune_* function is better than fiddling with 
>>>> cost computation.
>>>
>>> Not sure how to interpret your answer...  I'll see to have a more
>>> detailed suggestion
>>> after playing with the code a bit.
>>>
>>>>>
>>>>> +/* Count and return the number of loops in LOOP_NEST.  */
>>>>> +
>>>>> +unsigned int
>>>>> +num_loops_in_loop_nest (struct loop *loop_nest)
>>>>> +{
>>>>> +  unsigned num_loops;
>>>>> +  for (num_loops = 0; loop_nest; num_loops++, loop_nest = 
>>>>> loop_nest->inner)
>>>>> +    ;
>>>>> +  return num_loops;
>>>>>
>>>>> loop_depth (innermost) - loop_depth (nest)?
>>>> Done.
>>>>
>>>>>
>>>>> +static bool
>>>>> +should_interchange_loops (unsigned i_idx, unsigned o_idx,
>>>>> +                         vec<data_reference_p> datarefs,
>>>>> +                         bool innermost_loops_p, bool dump_info_p = true)
>>>>> +{
>>>>>
>>>>> isn't all we need associating the above CHREC to sort after the 
>>>>> CHREC_RIGHTs
>>>>> and figure a permutation sequence to arrive there?  That is for the local
>>>>> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx]
>>>>> when we should interchange?
>>>>>
>>>>> That subloop_stride_p and tracking invariant DRs looks a bit odd.  For 
>>>>> loops
>>>>> where a DR is invariant you simply do not have an evolution in that loop.
>>>>> You seem to simply add strides in the inner and outer loops for each DR,
>>>>> can you explain how that works?  Also I guess strides bigger than the
>>>>> various cache-line size parameters should be treated 'equal'?  That is,
>>>>> if we don't get any DR to a step that results in L1 cache hits because
>>>>> two DRs share a cache line the interchange is pointless?
>>>> So given below loop:
>>>>
>>>>   for (int i = 0; i < M; i++)
>>>>     for (int j = 0; j < M; j++)
>>>>       for (int k = 0; k < M; k++)
>>>>         a[k][0][i] = b[k][0][i]
>>>>
>>>> We check if memory reference is invariant wrto a loop only if it has zero 
>>>> strides within
>>>> current loop nest.  In this example, there is no invariant given address 
>>>> changes in the
>>>> innermost loop.
>>>
>>> But they simply wouldn't take part in the sorting?  That is, invariant
>>> refs in a loop
>>> shouldn't prevent it becoming more inner or more outer, no?
>>>
>>>> For strides bigger than cache-line size, it's also possible the 
>>>> interchange is wanted, as
>>>> in below example:
>>>>
>>>>   for (int i = 0; i < M; i++)          //loop 1
>>>>     for (int j = 0; j < M; j++)        //loop 2
>>>>       for (int k = 0; k < M; k++)    //loop 3
>>>>         a[j][i][k] = b[j][i][k]
>>>>
>>>> Strides for loop 1/2 are very like to be big, but after interchange, we 
>>>> will have stream
>>>> access of both arrays.
>>>>
>>>> More advanced heuristics may be possible, but so far the estimation is 
>>>> quite good by
>>>> checking all interchanges I looked into.
>>>>
>>>>>
>>>>> +/* Prune DATAREFS by removing any data reference not inside of LOOP.  */
>>>>> +
>>>>> +static inline void
>>>>> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> 
>>>>> datarefs)
>>>>> +{
>>>>> +  struct data_reference *dr;
>>>>> +
>>>>> +  for (unsigned i = 0; datarefs.iterate (i, &dr);)
>>>>> +    if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
>>>>> +      i++;
>>>>> +    else
>>>>> +      {
>>>>> +       datarefs.ordered_remove (i);
>>>>>
>>>>> that's expensive.  It's better to keep moving DRs we want to keep
>>>>> when walking the array.  That is, add a j you increment only when
>>>>> we keep a DR, moving *i to *j.
>>>> Done.
>>>>
>>>>>
>>>>> +       if (dr->aux)
>>>>> +         {
>>>>> +           DR_ACCESS_STRIDE (dr)->release ();
>>>>> +           free (dr->aux);
>>>>> +         }
>>>>> +       free_data_ref (dr);
>>>>> +      }
>>>>> +}
>>>>>
>>>>> +
>>>>> +  start_loop = prune_non_rectangle_loop_nest (innermost_loop, 
>>>>> start_loop);
>>>>> +
>>>>>
>>>>> Hmm.  If you instantiate the SCEV for the niters for each loop in the nest
>>>>> wrt the nest you can figure if it has any evolution in sth else than the
>>>>> loop (evolution_function_is_univariate_p).   That is, this is not a 
>>>>> problem
>>>>> until you arrive at analyzing DR strides, right?  At which point you
>>>>> can check for the appropriate form?
>>>> Hmm, not really.  The niter relation may not appear in SCEV of reference 
>>>> addr.
>>>> For example, below loop:
>>>>
>>>>   for (int i = 0; i < M; i++)          //loop 1
>>>>     for (int j = 0; j < M; j++)        //loop 2
>>>>       for (int k = 0; k < i; k++)    //loop 3
>>>>         a[k][0][i] = b[k][0][i]
>>>>
>>>> There is no information in data reference about i/j loops.
>>>> Anyway, I refactored the code and put this check in 
>>>> proper_loop_form_for_interchange.
>>>> Simpler I think.
>>>>
>>>>>
>>>>> +  if (find_data_references_in_loop (start_loop, datarefs) == 
>>>>> chrec_dont_know
>>>>> +      /* Check if there is no data reference.  */
>>>>> +      || datarefs->length () == 0
>>>>> +      /* Check if there are too many of data references.  */
>>>>> +      || ((int) datarefs->length ()
>>>>> +         > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
>>>>> +      /* Check if there is any data reference in loop latch.  We can't 
>>>>> handle
>>>>> +        loops which loop header and data references have different 
>>>>> execution
>>>>> +        times.  */
>>>>> +      || dataref_niters_diff_to_loop_header (*datarefs)
>>>>>
>>>>> this suggests to do your own find_data_references_in_loop so you can early
>>>>> out.
>>>> I refactored the code a bit.  Now this check is in 
>>>> proper_loop_form_for_interchange,
>>>> but I do customized my own data references finder.  It's needed to strip 
>>>> outer loops
>>>> once a difficult reference is found.
>>>>
>>>>>
>>>>> Overall the flow through the pass is a bit hard to follow given there are
>>>>> IMHO too many functions.
>>>> Yeah, I removed quite number of small functions and refactor the code a 
>>>> lot.  Hope this
>>>> version is more straightforward.
>>>>>
>>>>> +unsigned int
>>>>> +pass_linterchange::execute (function *fun)
>>>>> +{
>>>>> +  if (number_of_loops (fun) <= 2)
>>>>> +    return 0;
>>>>> +
>>>>> +  bool changed_p = false;;
>>>>> +  struct loop *loop;
>>>>> +  vec<loop_p> loop_nest;
>>>>> +  vec<data_reference_p> datarefs;
>>>>> +  vec<ddr_p> ddrs;
>>>>> +
>>>>> +  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
>>>>> +    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
>>>>> +      {
>>>>> +       tree_loop_interchange loop_interchange (loop_nest, datarefs, 
>>>>> ddrs);
>>>>> +       changed_p |= loop_interchange.interchange ();
>>>>> +      }
>>>>>
>>>>> you leak datarefs/ddrs?
>>>> It was release in destructor, but I refactored it anyway.  I will push the 
>>>> code to branch
>>>> gcc.gnu.org/svn/gcc/branches/gimple-linterchange.
>>>>
>>>> Thanks again for the comment of you two.
>>>
>>> Digging into the code now...
>>>
>>> Richard.
>>>
>>>> Thanks,
>>>> bin
>>>> 2017-11-27  Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>>         * Makefile.in (gimple-loop-interchange.o): New object file.
>>>>         * common.opt (floop-interchange): Reuse the option from graphite.
>>>>         * doc/invoke.texi (-floop-interchange): Ditto.  New document.
>>>>         * gimple-loop-interchange.cc: New file.
>>>>         * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
>>>>         (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
>>>>         * passes.def (pass_linterchange): New pass.
>>>>         * timevar.def (TV_LINTERCHANGE): New time var.
>>>>         * tree-pass.h (make_pass_linterchange): New declaration.
>>>>         * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external
>>>>         interchange.  Record IV before/after increment in new parameters.
>>>>         * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>>>
>>>> gcc/testsuite
>>>> 2017-11-27  Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>>         * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-11.c: New test.

interchange
Description: Binary data

Re: [PATCH GCC][V2]A simple implementation of loop interchange

Reply via email to