Re: [PATCH] Masked load/store vectorization (take 6)

Jakub Jelinek Fri, 06 Dec 2013 05:28:19 -0800

On Fri, Dec 06, 2013 at 01:49:50PM +0100, Richard Biener wrote:
> Comments inline (scary large this patch for this stage ...)


Thanks.

> > +(define_expand "maskload<mode>"
> > +  [(set (match_operand:V48_AVX2 0 "register_operand")
> > +   (unspec:V48_AVX2
> > +     [(match_operand:<sseintvecmode> 2 "register_operand")
> > +      (match_operand:V48_AVX2 1 "memory_operand")]
> > +     UNSPEC_MASKMOV))]
> > +  "TARGET_AVX")
> > +
> > +(define_expand "maskstore<mode>"
> > +  [(set (match_operand:V48_AVX2 0 "memory_operand")
> > +   (unspec:V48_AVX2
> > +     [(match_operand:<sseintvecmode> 2 "register_operand")
> > +      (match_operand:V48_AVX2 1 "register_operand")
> > +      (match_dup 0)]
> > +     UNSPEC_MASKMOV))]
> > +  "TARGET_AVX")
> > +
> >  (define_insn_and_split "avx_<castmode><avxsizesuffix>_<castmode>"
> >    [(set (match_operand:AVX256MODE2P 0 "nonimmediate_operand" "=x,m")
> >     (unspec:AVX256MODE2P
> 
> x86 maintainers should comment here (ick - unspecs)

Well, the unspecs are preexisting (right now used by intrinsics only), I'm
just adding expanders that will expand to those instructions.

> > @@ -4386,16 +4396,35 @@ get_references_in_stmt (gimple stmt, vec
> >      {
> >        unsigned i, n;
> >  
> > -      op0 = gimple_call_lhs_ptr (stmt);
> > +      ref.is_read = false;
> > +      if (gimple_call_internal_p (stmt))
> > +   switch (gimple_call_internal_fn (stmt))
> > +     {
> > +     case IFN_MASK_LOAD:
> > +       ref.is_read = true;
> > +     case IFN_MASK_STORE:
> > +       ref.ref = build2 (MEM_REF,
> > +                         ref.is_read
> > +                         ? TREE_TYPE (gimple_call_lhs (stmt))
> > +                         : TREE_TYPE (gimple_call_arg (stmt, 3)),
> > +                         gimple_call_arg (stmt, 0),
> > +                         gimple_call_arg (stmt, 1));
> > +       references->safe_push (ref);
> 
> This may not be a canonical MEM_REF AFAIK, so you should
> use fold_build2 here (if the address is &a.b the .b needs folding

Ok, will try that.

> into the offset).  I assume the 2nd arg is always constant and
> thus doesn't change pointer-type during propagations?

Yes, it is, always created by
  ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
(and for vectorized IFN_MASK_* copied over from the non-vectorized
IFN_MASK_* call).

> > @@ -4464,7 +4493,7 @@ graphite_find_data_references_in_stmt (l
> >  
> >    FOR_EACH_VEC_ELT (references, i, ref)
> >      {
> > -      dr = create_data_ref (nest, loop, *ref->pos, stmt, ref->is_read);
> > +      dr = create_data_ref (nest, loop, ref->ref, stmt, ref->is_read);
> >        gcc_assert (dr != NULL);
> >        datarefs->safe_push (dr);
> >      }
> 
> Interetsting that you succeeded in removing the indirection
> on ref.pos ... I remember trying that twice at least and
> failing ;)
> 
> You can install that as cleanup now if you split it out (so hopefully
> no users creep back that make removing it impossible).

Ok, will do.

> > +  /* Check whether this is a load or store.  */
> > +  lhs = gimple_assign_lhs (stmt);
> > +  if (TREE_CODE (lhs) != SSA_NAME)
> > +    {
> 
> gimple_store_p ()?

Likely.

> > +      if (!is_gimple_val (gimple_assign_rhs1 (stmt)))
> > +   return false;
> > +      op = maskstore_optab;
> > +      ref = lhs;
> > +    }
> > +  else if (gimple_assign_load_p (stmt))
> > +    {
> > +      op = maskload_optab;
> > +      ref = gimple_assign_rhs1 (stmt);
> > +    }
> > +  else
> > +    return false;
> > +
> > +  /* And whether REF isn't a MEM_REF with non-addressable decl.  */
> > +  if (TREE_CODE (ref) == MEM_REF
> > +      && TREE_CODE (TREE_OPERAND (ref, 0)) == ADDR_EXPR
> > +      && DECL_P (TREE_OPERAND (TREE_OPERAND (ref, 0), 0))
> > +      && !TREE_ADDRESSABLE (TREE_OPERAND (TREE_OPERAND (ref, 0), 0)))
> > +    return false;
> 
> I think that's overly conservative and not conservative enough.  Just
> use may_be_nonaddressable_p () (even though the implementation can
> need some TLC) and make sure to set TREE_ADDRESSABLE when you
> end up taking its address.

Will try.

> Please factor out the target bits into a predicate in optabs.c
> so you can reduce the amount of includes here.  You can eventually
> re-use that from the vectorization parts.

Okay.

> > @@ -1404,7 +1530,8 @@ insert_gimplified_predicates (loop_p loo
> >        basic_block bb = ifc_bbs[i];
> >        gimple_seq stmts;
> >  
> > -      if (!is_predicated (bb))
> > +      if (!is_predicated (bb)
> > +     || dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> 
> isn't that redundant now?

Will try (and read the corresponding threads and IRC about that).

> >        for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > -   if ((stmt = gsi_stmt (gsi))
> > -       && gimple_assign_single_p (stmt)
> > -       && gimple_vdef (stmt))
> > +   if ((stmt = gsi_stmt (gsi)) == NULL
> 
> I don't think gsi_stmt can be NULL

It can if gsi_end_p, but that is apparently in the loop condition.
It was preexisting code anyway, but will change.

> > +   will be then if-converted, the new copy of the loop will not,
> > +   and the LOOP_VECTORIZED internal call will be guarding which
> > +   loop to execute.  The vectorizer pass will fold this
> > +   internal call into either true or false.  */
> >  
> >  static bool
> > +version_loop_for_if_conversion (struct loop *loop, bool *do_outer)
> > +{
> 
> What's the do_outer parameter?

That is whether it should version also the outer loop.
Though, perhaps if we solely use the loop versioning + IFN_LOOP_VECTORIZED
for MASK_{LOAD,STORE} and perhaps MASK_CALL, we could avoid that and simply
live with not vectorizing those in nested loops, perhaps that occurs rarely
enough and will not be a vectorization regression.  The reason I have added
it was when the versioning was used always for if-conversion, then without
it we have regressed quite some tests.  Perhaps the outer loop versioning
is too costly for too unlikely case.  So, do you prefer to drop that, or
keep?
> 
> This needs a comment with explaining what code you create.

Ok.

> >  
> > -  return changed;
> > +  if (todo && version_outer_loop)
> > +    {
> > +      if (todo & TODO_update_ssa_only_virtuals)
> > +   {
> > +     update_ssa (TODO_update_ssa_only_virtuals);
> > +     todo &= ~TODO_update_ssa_only_virtuals;
> > +   }
> 
> Btw I hate that we do update_ssa multiple times per pass per
> function.  That makes us possibly worse than O(N^2) as update_ssa computes
> the IDF of the whole function.
> 
> This is something your patch introduces (it's only rewriting the
> virtuals, not the incremental SSA update by BB copying).

See above.  If you don't like the 2x loop versioning, it can be dropped.
That said, every loop versioning has one update_ssa anyway, so this
isn't making the complexity any worse, other than constant factor.

> > +  maskt = gimple_call_arg (stmt, 2);
> > +  lhs = gimple_call_lhs (stmt);
> > +  type = TREE_TYPE (lhs);
> > +  rhs = build2 (MEM_REF, type, gimple_call_arg (stmt, 0),
> > +           gimple_call_arg (stmt, 1));
> 
> That's possibly not canonical again, use fold_build2.

Ok, will try.

> > +       gimple g = loop_vectorized_call;
> > +       tree lhs = gimple_call_lhs (g);
> > +       gimple_stmt_iterator gsi = gsi_for_stmt (g);
> > +       gimplify_and_update_call_from_tree (&gsi, boolean_true_node);
> 
> plain update_call_from_tree should also work here, boolean_true_node
> is already gimple.

Will test.
> 
> > +       gsi_next (&gsi);
> > +       if (!gsi_end_p (gsi))
> > +         {
> > +           g = gsi_stmt (gsi);
> > +           if (gimple_code (g) == GIMPLE_COND
> > +               && gimple_cond_lhs (g) == lhs)
> > +             {
> > +               gimple_cond_set_lhs (g, boolean_true_node);
> 
> or simply replace all immediate uses of 'lhs' by boolean_true_node
> and remove the loop_vectorized call?

That would work, though e.g. removing of the call is something
any DCE probably handles well and vectorizer relies on tons of DCE anyway.

> See above.  And factor this out into a function.  Also move this
> to the cleanup loop below.

Ok.

        Jakub

Re: [PATCH] Masked load/store vectorization (take 6)

Reply via email to