Re: V5 [PATCH] Optimize vector constructor

Richard Biener Fri, 08 Mar 2019 03:04:13 -0800

On Fri, Mar 8, 2019 at 9:49 AM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Thu, Mar 7, 2019 at 9:51 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Wed, Mar 6, 2019 at 8:33 PM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > > On Wed, Mar 6, 2019 at 8:46 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > >
> > > > On Tue, Mar 5, 2019 at 1:46 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > > >
> > > > > On Mon, Mar 04, 2019 at 12:55:04PM +0100, Richard Biener wrote:
> > > > > > On Sun, Mar 3, 2019 at 10:13 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > > > > >
> > > > > > > On Sun, Mar 03, 2019 at 06:40:09AM -0800, Andrew Pinski wrote:
> > > > > > > > )
> > > > > > > > ,On Sun, Mar 3, 2019 at 6:32 AM H.J. Lu <hjl.to...@gmail.com> 
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > For vector init constructor:
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > > typedef float __v4sf __attribute__ ((__vector_size__ (16)));
> > > > > > > > >
> > > > > > > > > __v4sf
> > > > > > > > > foo (__v4sf x, float f)
> > > > > > > > > {
> > > > > > > > >   __v4sf y = { f, x[1], x[2], x[3] };
> > > > > > > > >   return y;
> > > > > > > > > }
> > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > we can optimize vector init constructor with vector copy or 
> > > > > > > > > permute
> > > > > > > > > followed by a single scalar insert:
> > > > >
> > > > > > and you want to advance to the _1 = BIT_INSERT_EXPR here.  The 
> > > > > > easiest way
> > > > > > is to emit a new stmt for _2 = copy ...; and do the set_rhs with the
> > > > > > BIT_INSERT_EXPR.
> > > > >
> > > > > Thanks for BIT_INSERT_EXPR suggestion.  I am testing this patch.
> > > > >
> > > > >
> > > > > H.J.
> > > > > ---
> > > > > We can optimize vector constructor with vector copy or permute 
> > > > > followed
> > > > > by a single scalar insert:
> > > > >
> > > > >   __v4sf y;
> > > > >   __v4sf D.1930;
> > > > >   float _1;
> > > > >   float _2;
> > > > >   float _3;
> > > > >
> > > > >   <bb 2> :
> > > > >   _1 = BIT_FIELD_REF <x_9(D), 32, 96>;
> > > > >   _2 = BIT_FIELD_REF <x_9(D), 32, 64>;
> > > > >   _3 = BIT_FIELD_REF <x_9(D), 32, 32>;
> > > > >   y_6 = {f_5(D), _3, _2, _1};
> > > > >   return y_6;
> > > > >
> > > > > with
> > > > >
> > > > >  __v4sf y;
> > > > >   __v4sf D.1930;
> > > > >   float _1;
> > > > >   float _2;
> > > > >   float _3;
> > > > >   vector(4) float _8;
> > > > >
> > > > >   <bb 2> :
> > > > >   _1 = BIT_FIELD_REF <x_9(D), 32, 96>;
> > > > >   _2 = BIT_FIELD_REF <x_9(D), 32, 64>;
> > > > >   _3 = BIT_FIELD_REF <x_9(D), 32, 32>;
> > > > >   _8 = x_9(D);
> > > > >   y_6 = BIT_INSERT_EXPR <x_9(D), f_5(D), 0 (32 bits)>;
> > > > >   return y_6;
> > > > >
> > > > > gcc/
> > > > >
> > > > >         PR tree-optimization/88828
> > > > >         * tree-ssa-forwprop.c (simplify_vector_constructor): Optimize
> > > > >         vector init constructor with vector copy or permute followed
> > > > >         by a single scalar insert.
> > > > >
> > > > > gcc/testsuite/
> > > > >
> > > > >         PR tree-optimization/88828
> > > > >         * gcc.target/i386/pr88828-1a.c: New test.
> > > > >         * gcc.target/i386/pr88828-2b.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-2.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-3a.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-3b.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-3c.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-3d.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-4a.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-4b.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-5a.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-5b.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-6a.c: Likewise.
> > > > >         * gcc.target/i386/pr88828-6b.c: Likewise.
> > > >
> > > > Here is the updated patch with run-time tests.
> > >
> > > -      if (TREE_CODE (elt->value) != SSA_NAME)
> > > +      if (TREE_CODE (ce->value) != SSA_NAME)
> > >         return false;
> > >
> > > hmm, so it doesn't allow { 0, v[1], v[2], v[3] }?  I think the single
> > > scalar value can be a constant as well.
> >
> > Fixed.
> >
> > >        if (!def_stmt)
> > > -       return false;
> > > +       {
> > > +         if (gimple_nop_p (SSA_NAME_DEF_STMT (ce->value)))
> > >
> > > if (SSA_NAME_IS_DEFAULT_DEF (ce->value))
> > >
> > > +           {
> > >
> > > also you seem to disallow
> > >
> > >   { i + 1, v[1], v[2], v[3] }
> >
> > Fixed by
> >
> >      if (code != BIT_FIELD_REF)
> >         {
> >           /* Only allow one scalar insert.  */
> >           if (nscalars != 0)
> >             return false;
> >
> >           nscalars = 1;
> >           insert = true;
> >           scalar_idx = i;
> >           sel.quick_push (i);
> >           scalar_element = ce->value;
> >           continue;
> >         }
> >
> > > because get_prop_source_stmt will return the definition computing
> > > i + 1 in this case and your code will be skipped?
> > >
> > > I think you can simplify the code by treating scalar_element != NULL
> > > as nscalars == 1 and eliding nscalars.
> >
> > It works only if
> >
> > TYPE_VECTOR_SUBPARTS (type).to_constant ()  == (nscalars + nvectors)
> >
> > We need to check both nscalars and nvectors.  Elide nscalar
> > check doesn't help much here.
> >
> > > -      if (conv_code == ERROR_MARK)
> > > +      if (conv_code == ERROR_MARK && !insert)
> > >         gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, orig[0],
> > >                                         orig[1], op2);
> > >        else
> > > @@ -2148,10 +2198,25 @@ simplify_vector_constructor (gimple_stmt_iterator 
> > > *gsi)
> > >                                    VEC_PERM_EXPR, orig[0], orig[1], op2);
> > >           orig[0] = gimple_assign_lhs (perm);
> > >           gsi_insert_before (gsi, perm, GSI_SAME_STMT);
> > > -         gimple_assign_set_rhs_with_ops (gsi, conv_code, orig[0],
> > > +         gimple_assign_set_rhs_with_ops (gsi,
> > > +                                         (conv_code != ERROR_MARK
> > > +                                          ? conv_code
> > > +                                          : NOP_EXPR),
> > > +                                         orig[0],
> > >                                           NULL_TREE, NULL_TREE);
> > >
> > > I believe you should elide the last stmt for conv_code == ERROR_MARK,
> > > that is, why did you need to add the && !insert check in the guarding 
> > > condition
> >
> > When conv_code == ERROR_MARK, we still need
> >
> >        gimple *perm
> >             = gimple_build_assign (make_ssa_name (TREE_TYPE (orig[0])),
> >                                    VEC_PERM_EXPR, orig[0], orig[1], op2);
> >           orig[0] = gimple_assign_lhs (perm);
> >           gsi_insert_before (gsi, perm, GSI_SAME_STMT);
> >           gimple_assign_set_rhs_with_ops (gsi,  NOP_EXPR,
> >                                           orig[0],
> >                                           NULL_TREE, NULL_TREE);
> >
> > Otherwise, scalar insert won't work.
> >
> > > (this path should already do the correct thing?).  Note that in all
> > > cases it looks
> > > that with conv_code != ERROR_MARK you may end up doing a float->int
> > > or int->float conversion on a value it wasn't done on before which might
> > > raise exceptions?  That is, do we need to make sure we permute a
> > > value we already do convert into the place we're going to insert to?
> >
> > This couldn't happen:
> >
> >       if (type == TREE_TYPE (ref))
> >          {
> >            /* The RHS vector has the same type as LHS.  */
> >            if (rhs_vector == NULL)
> >              rhs_vector = ref;
> >            /* Check if all RHS vector elements come fome the same
> >               vector.  */
> >            if (rhs_vector == ref)
> >              nvectors++;
> >          }
> > ...
> >   if (insert
> >       && (nvectors == 0
> >           || (TYPE_VECTOR_SUBPARTS (type).to_constant ()
> >               != (nscalars + nvectors))))
> >     return false;


I see - that looks like a missed case then?

 { 1., (float)v[1], (float)v[2], (float)v[3] }

with integer vector v?

I'll have a look at the full patch next week (it's GCC 10 material in any case).

Richard.

> > > +  if (insert)
> > > +    {
> > > +      /* Generate a single scalar insert.  */
> > > +      tree var = make_ssa_name (type);
> > > +      tree val = gimple_assign_rhs1 (stmt);
> > > +      gimple *copy = gimple_build_assign (var, val);
> > >
> > > I believe this doesn't properly copy the stmt in case it is a permute.
> > > You can use (note the use of gsi_stmt - gimple_assign_set_rhs_with_ops
> > > can re-allocate the stmt)
> > >
> > >         gimple *copy = gimple_copy (gsi_stmt (*gsi));
> > >         gimple_assign_set_lhs (copy, var);
> >
> > Fixed.
> >
> > > +      gsi_insert_before (gsi, copy, GSI_SAME_STMT);
> > > +      tree bitpos = bitsize_int (scalar_idx * elem_size);
> > > +      gimple_assign_set_rhs_with_ops (gsi, BIT_INSERT_EXPR, var,
> > > +                                     scalar_element, bitpos);
> > > +    }
> > >
> > > Otherwise looks OK to me.
> > >
> > > As separate followup patch it might be interesting to support
> > >
> > >  { 0, a[1], a[2], 3 }
> > >
> > > kinds as well, thus combining a VECTOR_CST (which is
> > > reasonably cheap to create) with another vector.  That should
> > > be maybe done as a first patch given this is just a two-vector
> > > permute which the code already handles apart from not
> > > recognizing the implicit constant vector participating.
> > >
> > > Similar
> > >
> > >  { 0, a[1], b[2], 3 }
> > >
> > > where the combination of a and b is blended with another
> > > constant vector.  I'm not sure if handling an arbitrary number
> > > of scalar elements should be done in a similar way, that is,
> > > implementing
> > >
> > >  { s1, a[1], a[2], s2, s3, b[0], b[1], b[2] }
> > >
> > > as
> > >
> > >   tem = VEC_PERM <a, b, { ... }>
> > >   tem2 = { s1, 0, 0, s2, s3, 0, 0, 0 }
> > >   res = VEC_PERM <tem, tem2, { blend-mask }>
> > >
> > > where constructing tem2 should take at most
> > > N-1 inserts (the first element to insert into tem2
> > > can use a splat or if element zero a zero-extending move).
> > >
> > > Doing this effectively lifts the restriction of only
> > > handling two vectors - we'd incrementally do
> > > two-vector permute plus blend of the rest which has
> > > its constructor re-processed.
> > >
> > > But as said - the code is already a bit awkward so changing
> > > this in multiple reivisions is preferred and the single-element
> > > case is certainly sth to do via a BIT_INSERT_EXPR.
> >
> > Agreed.
> >
> > I am testing this updated patch.
> >
>
> This is the fully tested patch.  I used
>
> +
> +      if (useless_type_conversion_p (type, TREE_TYPE (ref)))
> + {
> +    /* The RHS vector has the same type as LHS.  */
> +    if (rhs_vector == NULL)
> +      rhs_vector = ref;
> +    /* Check if all RHS vector elements come fome the same
> +       vector.  */
> +    if (rhs_vector == ref)
> +      nvectors++;
> + }
> +
>
> to support cast between __v4sf and __m128.
>
> --
> H.J.

Re: V5 [PATCH] Optimize vector constructor

Reply via email to