On Tue, Nov 28, 2017 at 4:00 PM, David Malcolm <dmalc...@redhat.com> wrote: > On Tue, 2017-11-28 at 15:26 +0000, Bin Cheng wrote: >> Hi, >> This is updated patch with review comments resolved. Some >> explanation embedded below. >> >> On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener <richard.guenther@gma >> il.com> wrote: >> > On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.ch...@gmail.com> >> > wrote: >> > > On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <m...@suse.de> >> > > wrote: >> > > > Hello, >> > > > >> > > > On Fri, 22 Sep 2017, Bin.Cheng wrote: >> > > > >> > > > > This is updated patch for loop interchange with review >> > > > > suggestions >> > > > > resolved. Changes are: >> > > > > 1) It does more light weight checks like rectangle loop >> > > > > nest check >> > > > > earlier than before. >> > > > > 2) It checks profitability of interchange before data >> > > > > dependence computation. >> > > > > 3) It calls find_data_references_in_loop only once for a >> > > > > loop nest now. >> > > > > 4) Data dependence is open-computed so that we can skip >> > > > > instantly at >> > > > > unknown dependence. >> > > > > 5) It improves code generation in mapping induction >> > > > > variables for >> > > > > loop nest, as well as >> > > > > adding a simple dead code elimination pass. >> > > > > 6) It changes magic constants into parameters. >> > > > >> > > > So I have a couple comments/questions. Something stylistic: >> > > >> > > Hi Michael, >> > > Thanks for reviewing. >> > > >> > > > >> > > > > +class loop_cand >> > > > > +{ >> > > > > +public: >> > > > > ... >> > > > > + friend class tree_loop_interchange; >> > > > > +private: >> > > > >> > > > Just make this all public (and hence a struct, not class). >> > > > No need for friends in file local classes. >> > > >> > > Done. >> > > >> > > > >> > > > > +single_use_in_loop (tree var, struct loop *loop) >> > > > > ... >> > > > > + FOR_EACH_IMM_USE_FAST (use_p, iterator, var) >> > > > > + { >> > > > > + stmt = USE_STMT (use_p); >> > > > > ... >> > > > > + basic_block bb = gimple_bb (stmt); >> > > > > + gcc_assert (bb != NULL); >> > > > >> > > > This pattern reoccurs often in your patch: you check for a bb >> > > > associated >> > > > for a USE_STMT. Uses of SSA names always occur in basic >> > > > blocks, no need >> > > > for checking. >> > > >> > > Done. >> > > >> > > > >> > > > Then, something about your handling of simple reductions: >> > > > >> > > > > +void >> > > > > +loop_cand::classify_simple_reduction (reduction_p re) >> > > > > +{ >> > > > > ... >> > > > > + /* Require memory references in producer and consumer are >> > > > > the same so >> > > > > + that we can undo reduction during interchange. */ >> > > > > + if (re->init_ref && !operand_equal_p (re->init_ref, re- >> > > > > >fini_ref, 0)) >> > > > > + return; >> > > > >> > > > Where is it checked that the undoing transformation is legal >> > > > also >> > > > from a data dep point of view? Think code like this: >> > > > >> > > > sum = X[i]; >> > > > for (j ...) >> > > > sum += X[j]; >> > > > X[i] = sum; >> > > > >> > > > Moving the store into the inner loop isn't always correct and I >> > > > don't seem >> > > > to find where the above situation is rejected. >> > > >> > > Yeah. for the old patch, it's possible to have such loop wrongly >> > > interchanged; >> > > in practice, it's hard to create an example. The pass will give >> > > up >> > > when computing >> > > data dep between references in inner/outer loops. In this >> > > updated >> > > patch, it's fixed >> > > by giving up if there is any dependence between references of >> > > inner/outer loops. >> > > >> > > > >> > > > Maybe I'm confused because I also don't see where you even can >> > > > get into >> > > > the above situation (though I do see testcases about >> > > > this). The thing is, >> > > > for an 2d loop nest to contain something like the above >> > > > reduction it can't >> > > > be perfect: >> > > > >> > > > for (j) { >> > > > int sum = X[j]; // 1 >> > > > for (i) >> > > > sum += Y[j][i]; >> > > > X[j] = sum; // 2 >> > > > } >> > > > >> > > > But you do check for perfectness in >> > > > proper_loop_form_for_interchange and >> > > > prepare_perfect_loop_nest, so either you can't get into the >> > > > situation or >> > > > the checking can't be complete, or you define the above to be >> > > > perfect >> > > > nevertheless (probably because the load and store are in outer >> > > > loop >> > > > header/exit blocks?). The latter would mean that you accept >> > > > also other >> > > > code in header/footer of loops from a pure CFG perspective, so >> > > > where is it >> > > > checked that that other code (which aren't simple reductions) >> > > > isn't >> > > > harmful to the transformation? >> > > >> > > Yes, I used the name perfect loop nest, but the pass can handle >> > > special form >> > > imperfect loop nest for the simple reduction. I added comments >> > > describing >> > > this before function prepare_perfect_loop_nest. >> > > >> > > > >> > > > Then, the data dependence part of the new pass: >> > > > >> > > > > +bool >> > > > > +tree_loop_interchange::valid_data_dependences (unsigned >> > > > > inner, unsigned outer) >> > > > > +{ >> > > > > + struct data_dependence_relation *ddr; >> > > > > + >> > > > > + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i) >> > > > > + { >> > > > > + /* Skip no-dependence case. */ >> > > > > + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) >> > > > > + continue; >> > > > > + >> > > > > + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j) >> > > > > + { >> > > > > + lambda_vector dist_vect = DDR_DIST_VECT (ddr, j); >> > > > > + unsigned level = dependence_level (dist_vect, >> > > > > loop_nest.length ()); >> > > > > + >> > > > > + /* If there is no carried dependence. */ >> > > > > + if (level == 0) >> > > > > + continue; >> > > > > + >> > > > > + level --; >> > > > > + /* Skip case which has '>' as the leftmost >> > > > > direction. */ >> > > > > + if (!lambda_vector_lexico_pos (dist_vect, level)) >> > > > > + return false; >> > > > >> > > > Shouldn't happen as dist vectors are forced positive via >> > > > DDR_REVERSED. >> > > >> > > Done. >> > > >> > > > >> > > > > + /* If dependence is carried by outer loop of the two >> > > > > loops for >> > > > > + interchange. */ >> > > > > + if (level < outer) >> > > > > + continue; >> > > > > + >> > > > > + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j); >> > > > > + /* If directions at both inner/outer levels are the >> > > > > same. */ >> > > > > + if (dir_vect[inner] == dir_vect[outer]) >> > > > > + continue; >> > > > > + >> > > > > + /* Be conservative, skip case if either direction at >> > > > > inner/outer >> > > > > + levels is not '=' or '<'. */ >> > > > > + if (dir_vect[inner] != dir_equal >> > > > > + && dir_vect[inner] != dir_positive >> > > > > + && dir_vect[inner] != dir_independent >> > > > > + && dir_vect[inner] != dir_positive_or_equal) >> > > > > + return false; >> > > > > + >> > > > > + if (dir_vect[outer] != dir_equal >> > > > > + && dir_vect[outer] != dir_positive >> > > > > + && dir_vect[outer] != dir_independent >> > > > > + && dir_vect[outer] != dir_positive_or_equal) >> > > > > + return false; >> > > > >> > > > Checking dir vectors doesn't make much sense in GCC: the >> > > > elements are only >> > > > ever set to dir_positive, dir_negative or dir_equal, exactly >> > > > when distance >> > > > is >> > > > > 0, < 0 or == 0. So checking dist vector is enough. (though >> > > > sameness of >> > > > direction checks sameness of sign with zero). Incidentally: >> > > >> > > Done. >> > > >> > > > >> > > > > +tree_loop_interchange::update_data_deps (unsigned inner, >> > > > > unsigned outer) >> > > > > +{ >> > > > > + struct data_dependence_relation *ddr; >> > > > > + >> > > > > + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i) >> > > > > + { >> > > > > + /* Skip no-dependence case. */ >> > > > > + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) >> > > > > + continue; >> > > > > + >> > > > > + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j) >> > > > > + { >> > > > > + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j); >> > > > > + std::swap (dir_vect[inner], dir_vect[outer]); >> > > > > + } >> > > > > + } >> > > > > +} >> > > > >> > > > Here you swap only the direction but not the distance vector, >> > > > which can't >> > > > be right. I suggest only using (and updating) the distance >> > > > vector. >> > > >> > > Yeah, fixed. >> > > >> > > > >> > > > And then your usage and update of DR_ACCESS_FNs: there's quite >> > > > some >> > > > complexity connected with that and I'm not sure how worthwhile >> > > > it is. >> > > > You're basically using the ACCESS_FNs to determine >> > > > profitability (and not >> > > > for validity, and that's good). But e.g. for pointer based >> > > > accesses like >> > > > in fortran with explicit address arithmetic the relation >> > > > between access-fn >> > > > step and stride and actual access stride isn't that easy (e.g. >> > > > in your >> > > > should_interchange_loops function iloop_stride and oloop_stride >> > > > will >> > > > always be one for pointer based accesses). >> > > > >> > > > Conceptually what you should check is how the access address >> > > > for each data >> > > > ref revolves for each loop, so why not doing this >> > > > explicitely? What I >> > > > mean is: calculate a (complicated) chrec for the DR addresses >> > > > for the >> > > > whole nest at the beginning. It should be in the form like >> > > > (assume "+" >> > > > always): >> > > > >> > > > {{{init, s1}_l1, s2}_l2, s3}_l3 >> > > > >> > > > (i.e. all steps should be invariants/constants, and only one >> > > > non-chrec >> > > > init value). Addresses which aren't in this form you're >> > > > already ignoring >> > > > right now, so you could continue doing that. (Or better said, >> > > > all >> > > > non-constant steps you regard as being AVG_DIM_SIZE, which you >> > > > still can >> > > > continue doing). >> > > > >> > > > Now, with the above form you can form expressions for the >> > > > difference >> > > > between addresses per iteration for each loop (i.e. the address >> > > > stride per >> > > > loop); store these. Then, when interchanging loops you need to >> > > > merely >> > > > swap these expressions like you have to with the distance >> > > > vector, instead >> > > > of fiddling inside the DR_ACCESS_FNs themself. Much code would >> > > > go away. >> > > >> > > Yeah. Did similar thing in loop nest distribution pass. See >> > > compute_access_range >> > > in tree-loop-distribution.c. Actually, I would do the same here >> > > if I >> > > had implemented >> > > this pass after loop nest distribution patches. Done in this >> > > updated patch. >> > > >> > > > >> > > > Testcases: given that we had to remove our old separate >> > > > interchange pass >> > > > because it miscompiled stuff all over I'm missing some >> > > > testcases where >> > > > interchange should _not_ happen for validity reasons, like my >> > > > above >> > > > example with an reduction that can't be moved inside. Perhaps >> > > > you can >> > > > think of some more. >> > > >> > > As mentioned above, it's hard to create test that fail exactly >> > > for this reason. >> > > I added one that data dependence prevents us from interchanging >> > > the loop. >> > > >> > > > >> > > > I hope this is of some help to you :) >> > > >> > > Thanks again, it's very helpful. >> > > >> > > I also fixed several bugs of previous implementation, mostly >> > > about debug info >> > > statements and simple reductions. As for test, I enabled this >> > > pass by default, >> > > bootstrap and regtest GCC, I also build/run specs. There must be >> > > some other >> > > latent bugs in it, but guess we have to exercise it by enabling >> > > it at >> > > some point. >> > > >> > > So any comments? >> > >> > bool >> > -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently) >> > +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool >> > insert_dbg) >> > { >> > >> > that you need this suggests you do stmt removal in wrong order (you >> > need to >> > do reverse dom order). >> >> As below code in handling debug uses, this updated patch gives up on >> more cases >> by scrapping debug uses now. Hopefully this isn't a problem, >> debugging experience >> for interchange loops is bad already? >> > >> > +/* Maximum number of statements in loop nest for loop >> > interchange. */ >> > + >> > +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS, >> > + "loop-interchange-max-num-stmts", >> > + "The maximum number of stmts in loop nest for loop >> > interchange.", >> > + 64, 0, 0) >> > >> > is that to limit dependence computation? In this case you should >> > probably >> > limit the number of data references instead? >> >> Hmm, I kept this one and it is to limit the size of loops for >> interchange. >> >> > >> > +ftree-loop-interchange >> > +Common Report Var(flag_tree_loop_interchange) Optimization >> > +Enable loop interchange on trees. >> > + >> > >> > please re-use -floop-interchange instead and change the GRAPHITE >> > tests >> > to use -floop-nest-optimize. You can do that as pre-approved thing >> > now. >> >> Done. I will send an independent patch adjusting GRAPHITE tests. >> >> > >> > Please enable the pass by default at O3 via opts.c. >> >> I will do it in a separated patch because many vectorization tests >> are vulnerable >> to interchange. I checked these tests, interchange is good, we need >> to disable >> explicitly. >> >> > >> > diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop- >> > interchange.cc >> > >> > gimple-loop-interchange.cc please. >> > >> > new file mode 100644 >> > index 0000000..abffbf6 >> > --- /dev/null >> > +++ b/gcc/tree-ssa-loop-interchange.cc >> > @@ -0,0 +1,2274 @@ >> > +/* Loop invariant motion. >> > + Copyright (C) 2017 Free Software Foundation, Inc. >> > >> > Loop invariant motion? ... ;) >> > >> > Please add a "Contributed by ..." to have an easy way to figure >> > people to blame. >> > >> > +}*induction_p; >> > + >> > >> > space after '*' >> > >> > +}*reduction_p; >> > + >> > >> > likewise. >> >> All done. >> >> > >> > +/* Return true if PHI is unsupported in loop interchange, i.e, PHI >> > contains >> > + ssa var appearing in any abnormal phi node. */ >> > + >> > +static inline bool >> > +unsupported_phi_node (gphi *phi) >> > +{ >> > + if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi))) >> > + return true; >> > + >> > + for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i) >> > + { >> > + tree arg = PHI_ARG_DEF (phi, i); >> > + if (TREE_CODE (arg) == SSA_NAME >> > + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg)) >> > + return true; >> > + } >> > + >> > + return false; >> > >> > I believe the above isn't necessary given you rule out abnormal >> > edges >> > into the loop. >> > Did you have a testcase that broke? A minor thing I guess if it is >> > just for extra >> > safety... >> >> Extra safety I guess. I now remove this and haven't run into >> compilation issues so far. >> >> > >> > +/* Return true if all stmts in BB can be supported by loop >> > interchange, >> > + otherwise return false. ILOOP is not NULL if this loop_cand is >> > the >> > + outer loop in loop nest. */ >> > + >> > +bool >> > +loop_cand::unsupported_operation (basic_block bb, loop_cand >> > *iloop) >> > +{ >> > >> > docs and return value suggest this be named supported_operation >> >> Done. >> >> > >> > + /* Or it's invariant memory reference and only used by inner >> > loop. */ >> > + if (gimple_assign_single_p (stmt) >> > + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE >> > + && TREE_CODE (lhs) == SSA_NAME >> > + && single_use_in_loop (lhs, iloop->loop)) >> > + continue; >> > >> > comment suggests multiple uses in loop would be ok? >> >> Comment changed. >> >> > >> > + if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE >> > + || lhs != re->init) >> > + return; >> > + >> > + if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE >> > + || !REFERENCE_CLASS_P (rhs)) >> > + return; >> > >> > lhs and rhs are never NULL. Please initialize them outside of the >> > if. >> > You want to disallow DECL_P rhs here? >> >> Done. >> >> > >> > Can you add an overall function comment what this function >> > does? It seems >> > to detect a reduction as produced by loop store-motion? Thus it >> > tries to >> > get at enough information to perform the reverse transform? >> >> Yes. Comment added. >> >> > >> > During review I have a hard time distinguishing between locals and >> > members >> > so can you please prefix all member variables with m_ as according >> > to our >> > code guidelines? I guess what adds to the confusion is the >> > loop_cand argument >> > that sometimes is present for loop_cand member functions... >> > (I personally prefer to prefix all member accesses with this-> but >> > that's harder >> > to enforce) >> >> Done by using m_* stuff. >> >> > >> > +static void >> > +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, >> > gimple *consumer) >> > +{ >> > >> > this extracts stmts starting from consumer but rules out other >> > consumers (recursively). >> > Is that intended? I wonder if you can avoid this dance... >> >> Okay, so in case the reduction is initialized from constant value, we >> need to generate new >> load memory reference when undoing it. The new memory reference may >> depends on idx >> variables like in ARRAY_REF. This function tries to find all >> depended stmts for inserting it >> I didn't look into how GRAPHITE tracks the idx variable and >> regenerate it when necessary, >> maybe we can do the same in the future. >> >> > >> > + /* Transform simple reduction of below form: >> > + >> > + init = 0; >> > + loop: >> > + var = phi<init, next> >> > + next = var op ... >> > + reduc_sum = phi<next> >> > + MEM_REF[...] = reduc_sum >> > + >> > + into: >> > + >> > >> > which one is 'consumer'? I wonder if you can simply leave all the >> > dead code in place >> > just emitting the load-update-store stmts and removing the >> > MEM_REF[...] = reduc_sum >> > above? >> >> Done. I simplified the code generation for the pass and uses >> prerequisite simple dce >> interface to remove dead code. Hope it's clearer now. >> >> > >> > +/* Eliminate dead code after loop interchange. */ >> > + >> > +void >> > +loop_cand::eliminate_dead_code (void) >> > +{ >> > >> > PRE tracks "possible" dead defs and has a worklist algorithm in >> > remove_dead_inserted_code. >> > I wonder if you can do sth similar? That is, I wonder if doing a >> > sweep from last to first >> > stmt wouldn't be better here? >> > >> > + /* Given copy propagation is done during interchange, we >> > can >> > + simply check zero uses of var and eliminate it. */ >> > + if (is_gimple_assign (stmt) >> > + && !gimple_vuse (stmt) >> > >> > you probably meant to check gimple_vdef? >> > >> > + && !gimple_has_volatile_ops (stmt) >> > + && !gimple_has_side_effects (stmt) >> > >> > the former is redundant >> > >> > + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE >> > + && TREE_CODE (lhs) == SSA_NAME >> > + && has_zero_uses (lhs)) >> > >> > if you use gimple_get_lhs () you can also handle calls. >> > >> > That said, this seems to be a very poor DCE, why is it necessary at >> > all? >> >> Local DCE removed by reusing new interface simple_dce_from_worklist. >> But the DCE is necessary giving dead code lasts to vectorizer. At >> least we >> can save compilation time. >> >> > >> > +/* Interchange niters info of ILOOP and OLOOP while reset any >> > other niters >> > + estimates information for now. */ >> > + >> > +static inline void >> > +interchange_nb_iterations (struct loop *iloop, struct loop *oloop) >> > +{ >> > + tree nb_iterations = oloop->nb_iterations; >> > + >> > + oloop->any_upper_bound = false; >> > + oloop->any_likely_upper_bound = false; >> > + free_numbers_of_iterations_estimates (oloop); >> > + >> > + oloop->nb_iterations = iloop->nb_iterations; >> > + >> > + iloop->any_upper_bound = false; >> > + iloop->any_likely_upper_bound = false; >> > + free_numbers_of_iterations_estimates (iloop); >> > + >> > + iloop->nb_iterations = nb_iterations; >> > >> > use std::swap? Also I think if you can keep nb_iterations you >> > can certainly keep the upper bounds. You're probably >> > afraid of the ->stmt references in the nb_iter_bound entries? >> > >> > Anyway, either scrap everything or try to keep everything. >> >> Yeah, not only the stmts, but also the control_iv information because >> the SCEV >> information may be corrupted during code transformation. >> Now I discarded all the information. >> >> > >> > + for (i = 0; oloop.reductions.iterate (i, &re); ++i) >> > + { >> > + if (re->type != DOUBLE_RTYPE) >> > + gcc_unreachable (); >> > + >> > + use_operand_p use_p; >> > + imm_use_iterator iterator; >> > + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var) >> > + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var); >> > + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next) >> > + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next); >> > + if (TREE_CODE (re->init) == SSA_NAME) >> > + { >> > + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init) >> > + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init); >> > + } >> > >> > can you add a comment what you are doing here? >> > >> > Note that other loop opts simply scrap all debug stmts ... >> >> As mentioned above, updated patch doesn't try hard to maintain debug >> use info any more. >> >> > >> > +static void >> > +compute_access_stride (struct loop *loop_nest, struct loop *loop, >> > + data_reference_p dr) >> > +{ >> > ... >> > + tree ref = DR_REF (dr); >> > + tree scev_base = build_fold_addr_expr (ref); >> > + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref)); >> > + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER); >> > + access_size = fold_build2 (MULT_EXPR, sizetype, niters, >> > access_size); >> > + >> > + do { >> > + tree scev_fn = analyze_scalar_evolution (loop, scev_base); >> > + if (chrec_contains_undetermined (scev_fn) >> > + || chrec_contains_symbols_defined_in_loop (scev_fn, loop- >> > >num)) >> > + break; >> > ... >> > + strides->safe_push (scev_step); >> > + } while (loop != loop_nest && (loop = loop_outer (loop)) != >> > NULL); >> > + >> > >> > I _think_ you want to do >> > >> > scev_fn = analyze_scalar_evolution (loop, scev_base); // >> > assuming >> > DR_STMT (dr) is in loop >> > scev_fn = instantiate_parameters (nest, scev_fn); >> > if (chrec_contains_undetermined (scev_fn)) >> > return; // false? >> > >> > and analyze the result which should be of the form >> > >> > { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ... >> > >> > if canonical. I think estimate_val_by_simplify_replace isn't >> > needed >> > if you do that >> > (it also looks odd to replace all vairables in step by niter...). >> >> I replied on this in previous message, instantiate_parameters doesn't >> always >> give canonical form result as expected. The loop here could be seen >> as a >> local instantiate process, right? >> Also estimate_val_by_simplify_replace is needed for pointers, where >> strides >> are computed from niters of loops which could be non compilation time >> constant. >> But yes, it's an odd fixup after I failed to do anything better. >> >> > >> > I think keeping the chrec in the above form is also more suitable >> > for what >> > the caller does so the outermost loop is simply >> > >> > loop = loop_nest; >> > loop-over-all-dr-chrecs >> > if (flow_loop_nested_p (loop, CHREC_LOOP (chrec))) >> > loop = CHREC_LOOP (chrec); >> > >> > given the outermost loop is the outer evolution. If you sort the >> > stride vecs from inner >> > to outer you don't need prune_access_strides_not_in_loop. >> >> Hmm, So stripping outer loops prefer inner to outer sort of strides, >> but cost computation >> during interchange prefers outer to inner sort because loop_nest in >> tree-data-ref is sorted >> in this way. Seems a single prune_* function is better than fiddling >> with cost computation. >> >> > >> > +/* Count and return the number of loops in LOOP_NEST. */ >> > + >> > +unsigned int >> > +num_loops_in_loop_nest (struct loop *loop_nest) >> > +{ >> > + unsigned num_loops; >> > + for (num_loops = 0; loop_nest; num_loops++, loop_nest = >> > loop_nest->inner) >> > + ; >> > + return num_loops; >> > >> > loop_depth (innermost) - loop_depth (nest)? >> >> Done. >> >> > >> > +static bool >> > +should_interchange_loops (unsigned i_idx, unsigned o_idx, >> > + vec<data_reference_p> datarefs, >> > + bool innermost_loops_p, bool dump_info_p >> > = true) >> > +{ >> > >> > isn't all we need associating the above CHREC to sort after the >> > CHREC_RIGHTs >> > and figure a permutation sequence to arrive there? That is for the >> > local >> > decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT >> > [o_idx] >> > when we should interchange? >> > >> > That subloop_stride_p and tracking invariant DRs looks a bit >> > odd. For loops >> > where a DR is invariant you simply do not have an evolution in that >> > loop. >> > You seem to simply add strides in the inner and outer loops for >> > each DR, >> > can you explain how that works? Also I guess strides bigger than >> > the >> > various cache-line size parameters should be treated 'equal'? That >> > is, >> > if we don't get any DR to a step that results in L1 cache hits >> > because >> > two DRs share a cache line the interchange is pointless? >> >> So given below loop: >> >> for (int i = 0; i < M; i++) >> for (int j = 0; j < M; j++) >> for (int k = 0; k < M; k++) >> a[k][0][i] = b[k][0][i] >> >> We check if memory reference is invariant wrto a loop only if it has >> zero strides within >> current loop nest. In this example, there is no invariant given >> address changes in the >> innermost loop. >> For strides bigger than cache-line size, it's also possible the >> interchange is wanted, as >> in below example: >> >> for (int i = 0; i < M; i++) //loop 1 >> for (int j = 0; j < M; j++) //loop 2 >> for (int k = 0; k < M; k++) //loop 3 >> a[j][i][k] = b[j][i][k] >> >> Strides for loop 1/2 are very like to be big, but after interchange, >> we will have stream >> access of both arrays. >> >> More advanced heuristics may be possible, but so far the estimation >> is quite good by >> checking all interchanges I looked into. >> >> > >> > +/* Prune DATAREFS by removing any data reference not inside of >> > LOOP. */ >> > + >> > +static inline void >> > +prune_datarefs_not_in_loop (struct loop *loop, >> > vec<data_reference_p> datarefs) >> > +{ >> > + struct data_reference *dr; >> > + >> > + for (unsigned i = 0; datarefs.iterate (i, &dr);) >> > + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr)))) >> > + i++; >> > + else >> > + { >> > + datarefs.ordered_remove (i); >> > >> > that's expensive. It's better to keep moving DRs we want to keep >> > when walking the array. That is, add a j you increment only when >> > we keep a DR, moving *i to *j. >> >> Done. >> >> > >> > + if (dr->aux) >> > + { >> > + DR_ACCESS_STRIDE (dr)->release (); >> > + free (dr->aux); >> > + } >> > + free_data_ref (dr); >> > + } >> > +} >> > >> > + >> > + start_loop = prune_non_rectangle_loop_nest (innermost_loop, >> > start_loop); >> > + >> > >> > Hmm. If you instantiate the SCEV for the niters for each loop in >> > the nest >> > wrt the nest you can figure if it has any evolution in sth else >> > than the >> > loop (evolution_function_is_univariate_p). That is, this is not a >> > problem >> > until you arrive at analyzing DR strides, right? At which point >> > you >> > can check for the appropriate form? >> >> Hmm, not really. The niter relation may not appear in SCEV of >> reference addr. >> For example, below loop: >> >> for (int i = 0; i < M; i++) //loop 1 >> for (int j = 0; j < M; j++) //loop 2 >> for (int k = 0; k < i; k++) //loop 3 >> a[k][0][i] = b[k][0][i] >> >> There is no information in data reference about i/j loops. >> Anyway, I refactored the code and put this check in >> proper_loop_form_for_interchange. >> Simpler I think. >> >> > >> > + if (find_data_references_in_loop (start_loop, datarefs) == >> > chrec_dont_know >> > + /* Check if there is no data reference. */ >> > + || datarefs->length () == 0 >> > + /* Check if there are too many of data references. */ >> > + || ((int) datarefs->length () >> > + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS)) >> > + /* Check if there is any data reference in loop latch. We >> > can't handle >> > + loops which loop header and data references have different >> > execution >> > + times. */ >> > + || dataref_niters_diff_to_loop_header (*datarefs) >> > >> > this suggests to do your own find_data_references_in_loop so you >> > can early >> > out. >> >> I refactored the code a bit. Now this check is in >> proper_loop_form_for_interchange, >> but I do customized my own data references finder. It's needed to >> strip outer loops >> once a difficult reference is found. >> >> > >> > Overall the flow through the pass is a bit hard to follow given >> > there are >> > IMHO too many functions. >> >> Yeah, I removed quite number of small functions and refactor the code >> a lot. Hope this >> version is more straightforward. >> > >> > +unsigned int >> > +pass_linterchange::execute (function *fun) >> > +{ >> > + if (number_of_loops (fun) <= 2) >> > + return 0; >> > + >> > + bool changed_p = false;; >> > + struct loop *loop; >> > + vec<loop_p> loop_nest; >> > + vec<data_reference_p> datarefs; >> > + vec<ddr_p> ddrs; >> > + >> > + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) >> > + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, >> > &ddrs)) >> > + { >> > + tree_loop_interchange loop_interchange (loop_nest, >> > datarefs, ddrs); >> > + changed_p |= loop_interchange.interchange (); >> > + } >> > >> > you leak datarefs/ddrs? >> >> It was release in destructor, but I refactored it anyway. I will >> push the code to branch >> gcc.gnu.org/svn/gcc/branches/gimple-linterchange. >> >> Thanks again for the comment of you two. >> >> Thanks, >> bin >> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >> >> * Makefile.in (gimple-loop-interchange.o): New object file. >> * common.opt (floop-interchange): Reuse the option from >> graphite. >> * doc/invoke.texi (-floop-interchange): Ditto. New document. >> * gimple-loop-interchange.cc: New file. >> * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New >> parameter. >> (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter. >> * passes.def (pass_linterchange): New pass. >> * timevar.def (TV_LINTERCHANGE): New time var. >> * tree-pass.h (make_pass_linterchange): New declaration. >> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to >> external >> interchange. Record IV before/after increment in new >> parameters. >> * tree-ssa-loop-ivopts.h (create_canonical_iv): New >> declaration. >> >> gcc/testsuite >> 2017-11-27 Bin Cheng <bin.ch...@arm.com> >> >> * gcc.dg/tree-ssa/loop-interchange-1.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-2.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-3.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-4.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-5.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-6.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-7.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-8.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-9.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-10.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-11.c: New test. > > [...] >> diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc >> new file mode 100644 >> index 0000000..9a65b28 >> --- /dev/null >> +++ b/gcc/gimple-loop-interchange.cc > > [...] > >> +/* Return the reduction if STMT is one of its lcssa PHI, producer or >> consumer >> + stmt. */ >> + >> +reduction_p >> +loop_cand::find_reduction_by_stmt (gimple *stmt) >> +{ >> + gphi *phi = NULL; >> + reduction_p re; >> + >> + if (is_a <gphi *> (stmt)) >> + phi = as_a <gphi *> (stmt); > > I happened to notice a few places in the patch where you use is_a<> > followed by as_a<>. > > Sorry if you covered this in an earlier round of review. > > I believe the code could be slightly simplified by using is-a.h's > dyn_cast<> here, giving something like: > > gphi *phi = dyn_cast <gphi *> (stmt); > > [...] > >> +/* Analyze reduction variable VAR for inner loop of the loop nest to be >> + interchanged. Return true if analysis succeeds. */ >> + >> +bool >> +loop_cand::analyze_iloop_reduction_var (tree var) >> +{ > > [...] > >> + >> + /* Or else it's used in PHI itself. */ >> + use_phi = NULL; >> + if (is_a <gphi *> (stmt) >> + && (use_phi = as_a <gphi *> (stmt)) != NULL >> + && use_phi == phi) >> + continue; > > Similarly here; I belive this could be simplified to: > > /* Or else it's used in PHI itself. */ > use_phi = dyn_cast <gphin *> (stmt); > if (use_phi != NULL > && use_phi == phi) > continue; > > and given that (I think) phi is non-NULL, this could be: > > /* Or else it's used in PHI itself. */ > use_phi = dyn_cast <gphi *> (stmt); > if (use_phi == phi) > continue; > > for 3 lines rather than 5 lines. > > [...] > >> +bool >> +loop_cand::analyze_oloop_reduction_var (loop_cand *iloop, tree var) >> +{ > > [...] > >> + /* Outer loop's reduction should only be used to initialize inner loop's >> + simple reduction. */ >> + FOR_EACH_IMM_USE_FAST (use_p, iterator, var) >> + { >> + stmt = USE_STMT (use_p); >> + if (is_gimple_debug (stmt)) >> + continue; >> + >> + if (!flow_bb_inside_loop_p (m_loop, gimple_bb (stmt))) >> + return false; >> + >> + if (! is_a <gphi *> (stmt) >> + || (use_phi = as_a <gphi *> (stmt)) == NULL >> + || use_phi != inner_re->phi) >> + return false; >> + } > > Another one here. The: > use_phi = as_a <gphi *> (stmt)) == NULL > looks strange to me: the "is_a <gphi *>" tests for stmt->code, so stmt can't > be NULL, and hence use_phi can't be NULL. > > You don't seem to use the "use_phi" after each FOR_EACH_IMM_USE_FAST loop, > so this could in theory be simplified to: > > if (stmt != inner_re->phi) > return false; > > Though maybe I'm missing something; presumably that logic was there > for a reason. > >> + >> + /* Check this reduction is correctly used outside of loop via lcssa phi. >> */ >> + FOR_EACH_IMM_USE_FAST (use_p, iterator, next) >> + { >> + stmt = USE_STMT (use_p); >> + if (is_gimple_debug (stmt)) >> + continue; >> + >> + /* Or else it's used in PHI itself. */ >> + use_phi = NULL; >> + if (is_a <gphi *> (stmt) >> + && (use_phi = as_a <gphi *> (stmt)) != NULL >> + && use_phi == phi) >> + continue; >> + if (lcssa_phi == NULL >> + && use_phi != NULL >> + && gimple_bb (stmt) == e->dest >> + && PHI_ARG_DEF_FROM_EDGE (use_phi, e) == next) >> + lcssa_phi = use_phi; >> + else >> + return false; >> + } > > ...and here, which I believe could become: > > use_phi = dyn_cast <gphi *> (stmt); > if (use_phi == phi) > continue; > if (lcssa_phi == NULL > [..etc..] > > [...] > > Hope this is constructive Yes, of course. I will fix these in the next update.
Thanks, bin > Dave