On Fri, 6 Dec 2013, Jakub Jelinek wrote: > On Fri, Dec 06, 2013 at 01:49:50PM +0100, Richard Biener wrote: > > Comments inline (scary large this patch for this stage ...) > > Thanks. > > > > +(define_expand "maskload<mode>" > > > + [(set (match_operand:V48_AVX2 0 "register_operand") > > > + (unspec:V48_AVX2 > > > + [(match_operand:<sseintvecmode> 2 "register_operand") > > > + (match_operand:V48_AVX2 1 "memory_operand")] > > > + UNSPEC_MASKMOV))] > > > + "TARGET_AVX") > > > + > > > +(define_expand "maskstore<mode>" > > > + [(set (match_operand:V48_AVX2 0 "memory_operand") > > > + (unspec:V48_AVX2 > > > + [(match_operand:<sseintvecmode> 2 "register_operand") > > > + (match_operand:V48_AVX2 1 "register_operand") > > > + (match_dup 0)] > > > + UNSPEC_MASKMOV))] > > > + "TARGET_AVX") > > > + > > > (define_insn_and_split "avx_<castmode><avxsizesuffix>_<castmode>" > > > [(set (match_operand:AVX256MODE2P 0 "nonimmediate_operand" "=x,m") > > > (unspec:AVX256MODE2P > > > > x86 maintainers should comment here (ick - unspecs) > > Well, the unspecs are preexisting (right now used by intrinsics only), I'm > just adding expanders that will expand to those instructions. > > > > @@ -4386,16 +4396,35 @@ get_references_in_stmt (gimple stmt, vec > > > { > > > unsigned i, n; > > > > > > - op0 = gimple_call_lhs_ptr (stmt); > > > + ref.is_read = false; > > > + if (gimple_call_internal_p (stmt)) > > > + switch (gimple_call_internal_fn (stmt)) > > > + { > > > + case IFN_MASK_LOAD: > > > + ref.is_read = true; > > > + case IFN_MASK_STORE: > > > + ref.ref = build2 (MEM_REF, > > > + ref.is_read > > > + ? TREE_TYPE (gimple_call_lhs (stmt)) > > > + : TREE_TYPE (gimple_call_arg (stmt, 3)), > > > + gimple_call_arg (stmt, 0), > > > + gimple_call_arg (stmt, 1)); > > > + references->safe_push (ref); > > > > This may not be a canonical MEM_REF AFAIK, so you should > > use fold_build2 here (if the address is &a.b the .b needs folding > > Ok, will try that. > > > into the offset). I assume the 2nd arg is always constant and > > thus doesn't change pointer-type during propagations? > > Yes, it is, always created by > ptr = build_int_cst (reference_alias_ptr_type (ref), 0); > (and for vectorized IFN_MASK_* copied over from the non-vectorized > IFN_MASK_* call). > > > > @@ -4464,7 +4493,7 @@ graphite_find_data_references_in_stmt (l > > > > > > FOR_EACH_VEC_ELT (references, i, ref) > > > { > > > - dr = create_data_ref (nest, loop, *ref->pos, stmt, ref->is_read); > > > + dr = create_data_ref (nest, loop, ref->ref, stmt, ref->is_read); > > > gcc_assert (dr != NULL); > > > datarefs->safe_push (dr); > > > } > > > > Interetsting that you succeeded in removing the indirection > > on ref.pos ... I remember trying that twice at least and > > failing ;) > > > > You can install that as cleanup now if you split it out (so hopefully > > no users creep back that make removing it impossible). > > Ok, will do. > > > > + /* Check whether this is a load or store. */ > > > + lhs = gimple_assign_lhs (stmt); > > > + if (TREE_CODE (lhs) != SSA_NAME) > > > + { > > > > gimple_store_p ()? > > Likely. > > > > + if (!is_gimple_val (gimple_assign_rhs1 (stmt))) > > > + return false; > > > + op = maskstore_optab; > > > + ref = lhs; > > > + } > > > + else if (gimple_assign_load_p (stmt)) > > > + { > > > + op = maskload_optab; > > > + ref = gimple_assign_rhs1 (stmt); > > > + } > > > + else > > > + return false; > > > + > > > + /* And whether REF isn't a MEM_REF with non-addressable decl. */ > > > + if (TREE_CODE (ref) == MEM_REF > > > + && TREE_CODE (TREE_OPERAND (ref, 0)) == ADDR_EXPR > > > + && DECL_P (TREE_OPERAND (TREE_OPERAND (ref, 0), 0)) > > > + && !TREE_ADDRESSABLE (TREE_OPERAND (TREE_OPERAND (ref, 0), 0))) > > > + return false; > > > > I think that's overly conservative and not conservative enough. Just > > use may_be_nonaddressable_p () (even though the implementation can > > need some TLC) and make sure to set TREE_ADDRESSABLE when you > > end up taking its address. > > Will try. > > > Please factor out the target bits into a predicate in optabs.c > > so you can reduce the amount of includes here. You can eventually > > re-use that from the vectorization parts. > > Okay. > > > > @@ -1404,7 +1530,8 @@ insert_gimplified_predicates (loop_p loo > > > basic_block bb = ifc_bbs[i]; > > > gimple_seq stmts; > > > > > > - if (!is_predicated (bb)) > > > + if (!is_predicated (bb) > > > + || dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > > > > isn't that redundant now? > > Will try (and read the corresponding threads and IRC about that). > > > > for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) > > > - if ((stmt = gsi_stmt (gsi)) > > > - && gimple_assign_single_p (stmt) > > > - && gimple_vdef (stmt)) > > > + if ((stmt = gsi_stmt (gsi)) == NULL > > > > I don't think gsi_stmt can be NULL > > It can if gsi_end_p, but that is apparently in the loop condition. > It was preexisting code anyway, but will change. > > > > + will be then if-converted, the new copy of the loop will not, > > > + and the LOOP_VECTORIZED internal call will be guarding which > > > + loop to execute. The vectorizer pass will fold this > > > + internal call into either true or false. */ > > > > > > static bool > > > +version_loop_for_if_conversion (struct loop *loop, bool *do_outer) > > > +{ > > > > What's the do_outer parameter? > > That is whether it should version also the outer loop. > Though, perhaps if we solely use the loop versioning + IFN_LOOP_VECTORIZED > for MASK_{LOAD,STORE} and perhaps MASK_CALL, we could avoid that and simply > live with not vectorizing those in nested loops, perhaps that occurs rarely > enough and will not be a vectorization regression. The reason I have added > it was when the versioning was used always for if-conversion, then without > it we have regressed quite some tests. Perhaps the outer loop versioning > is too costly for too unlikely case. So, do you prefer to drop that, or > keep?
If it's easy to rip out it looks it can simplify the patch at this stage which is good. > > > > This needs a comment with explaining what code you create. > > Ok. > > > > > > > - return changed; > > > + if (todo && version_outer_loop) > > > + { > > > + if (todo & TODO_update_ssa_only_virtuals) > > > + { > > > + update_ssa (TODO_update_ssa_only_virtuals); > > > + todo &= ~TODO_update_ssa_only_virtuals; > > > + } > > > > Btw I hate that we do update_ssa multiple times per pass per > > function. That makes us possibly worse than O(N^2) as update_ssa computes > > the IDF of the whole function. > > > > This is something your patch introduces (it's only rewriting the > > virtuals, not the incremental SSA update by BB copying). > > See above. If you don't like the 2x loop versioning, it can be dropped. > That said, every loop versioning has one update_ssa anyway, so this > isn't making the complexity any worse, other than constant factor. Yeah. On my TODO list is still cheaper loop versioning (including getting rid of the speel_* stuff and having only a single SESE copying machinery). > > > + maskt = gimple_call_arg (stmt, 2); > > > + lhs = gimple_call_lhs (stmt); > > > + type = TREE_TYPE (lhs); > > > + rhs = build2 (MEM_REF, type, gimple_call_arg (stmt, 0), > > > + gimple_call_arg (stmt, 1)); > > > > That's possibly not canonical again, use fold_build2. > > Ok, will try. > > > > + gimple g = loop_vectorized_call; > > > + tree lhs = gimple_call_lhs (g); > > > + gimple_stmt_iterator gsi = gsi_for_stmt (g); > > > + gimplify_and_update_call_from_tree (&gsi, boolean_true_node); > > > > plain update_call_from_tree should also work here, boolean_true_node > > is already gimple. > > Will test. > > > > > + gsi_next (&gsi); > > > + if (!gsi_end_p (gsi)) > > > + { > > > + g = gsi_stmt (gsi); > > > + if (gimple_code (g) == GIMPLE_COND > > > + && gimple_cond_lhs (g) == lhs) > > > + { > > > + gimple_cond_set_lhs (g, boolean_true_node); > > > > or simply replace all immediate uses of 'lhs' by boolean_true_node > > and remove the loop_vectorized call? > > That would work, though e.g. removing of the call is something > any DCE probably handles well and vectorizer relies on tons of DCE anyway. Well, but you handle forwarding into the COND_EXPR here anyway (probably so that cfgcleanup removes it), so simply forward it to all uses. > > See above. And factor this out into a function. Also move this > > to the cleanup loop below. > > Ok. Thanks. Ricahrd.