On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
> On Mon, 11 Jun 2012, William J. Schmidt wrote:
> 
> > 
> > 
> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
> > > <wschm...@linux.vnet.ibm.com> wrote:
> > > > The fix for PR53331 caused a degradation to 187.facerec on
> > > > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
> > > > Bootstrapped and regtested on that platform with no new regressions.  Ok
> > > > for trunk?
> > > 
> > > Well, would the real cost not be subparts * scalar_to_vec plus
> > > subparts * vec_perm?
> > > At least vec_perm isn't the cost for building up a vector from N scalar 
> > > elements
> > > either (it might be close enough if subparts == 2).  What's the case
> > > with facerec
> > > here?  Does it have subparts == 2?  
> > 
> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
> > PowerPC, this requires two merge instructions and a permute instruction
> > to get the four 32-bit quantities into the right place in a 128-bit
> > register.  Currently this is modeled as a vec_perm in other parts of the
> > vectorizer per Ira's earlier patches, so I naively changed this to do
> > the same thing.
> 
> I see.
> 
> > The types of vectorizer instructions aren't documented, and I can't
> > infer much from the i386.c cost model, so I need a little education.
> > What semantics are represented by scalar_to_vec?
> 
> It's a vector splat, thus x -> { x, x, x, ... }.  You can create
> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
> as VEC_PERM_EXPR, takes two input vectors).  That's by far not
> the most efficient way to build up such a vector of course (with AVX
> you could do one splat plus N - 1 inserts for example).  The cost
> is of course dependent on the number of vector elements, so a
> simple new enum vect_cost_for_stmt kind does not cover it but
> the target would have to look at the vector type passed and do
> some reasonable guess.

Ah, splat!  Yes, that's lingo I understand.  I see the intent now.

> 
> > On PowerPC, we have this mapping of the floating-point registers and
> > vector float registers where they overlap (the low-order half of each of
> > the first 32 vector float regs corresponds to a scalar float reg).  So
> > in this case we have four scalar loads that place things in the bottom
> > half of four vector registers, two vector merge instructions that
> > collapse the four registers into two vector registers, and a vector
> > permute that gets things in the right order.(*)  I wonder if what we
> > refer to as a merge instruction is similar to scalar_to_vec.
> 
> Looks similar to x86 SSE then.
> 
> > If so, then maybe we need something like
> > 
> >      subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
> >      inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
> >      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts 
> > / 2;
> >      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> > 
> > But then we'd have to change how vec_perm is modeled elsewhere for
> > PowerPC based on Ira's earlier patches.  As I said, it's difficult for
> > me to figure out all the intent of cost model decisions that have been
> > made in the past, using current documentation.
> 
> Heh, usually the intent was to make the changes simple, not to compute
> a proper cost.
> 
> I think we simply need a new scalars_to_vec cost kind.

That works.  Maybe vec_construct gets the point across a little better?
I think we need to use the full builtin_vectorization_cost interface
instead of vect_get_stmt_cost here, so the targets can parameterize on
type.  Then we can just do one cost calculation for vec_construct that
covers the full costs of getting the vector in order after the loads.

> 
> > > I really wanted to pessimize this case
> > > for say AVX and char elements, thus building up a vector from 32 scalars 
> > > which
> > > certainly does not cost a mere vec_perm.  So, maybe special-case the
> > > subparts == 2 case and assume vec_perm would match the cost only in that
> > > case.
> > 
> > (I'm a little confused by this as what you have at the moment is a
> > single scalar_to_vec per copy, which has a cost of 1 on most i386
> > targets (occasionally 2).  The subparts multiplier is only applied to
> > the loads.  So changing this to vec_perm seemed to be a no-op for i386.)
> 
> Oh, I somehow read the patch as you were removing the multiplication
> by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
> to reflect it with N scalar loads plus N splats plus N - 1 permutes
> originally.  You could also model it with N scalar loads plus N
> inserts (but we don't have a vec_insert cost either).  I think adding
> a scalars_to_vec or vec_init or however we want to call it - basically
> what the cost of a vector CONSTRUCTOR would be - is best.
> 
> > (*) There are actually a couple more instructions here to convert 64-bit
> > values to 32-bit values, since on PowerPC 32-bit loads are converted to
> > 64-bit values in scalar float registers and they have to be coerced back
> > to 32-bit.  Very ugly.  The cost model currently doesn't represent this
> > at all, which I'll have to look at fixing at some point in some way that
> > isn't too nasty for the other targets.  The cost model for PowerPC seems
> > to need a lot of TLC.
> 
> Maybe the above would be a possible way to do it.  On x86 for example
> a vector of two doubles is extremely cheap for generic SSE2 to construct,
> we can directly load into the low/high part, thus when we use it as
> N scalar loads plus the vector-build the vector-build would be free.

Absolutely.  As long as the vector type is available, this can be built
into vec_construct's logic in the target.  I will feel much more
comfortable having better estimation for the ugly 32-bit case.

Thanks,
Bill

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Bill
> > 
> > > 
> > > Thanks,
> > > Richard.
> > > 
> > > > Thanks,
> > > > Bill
> > > >
> > > >
> > > > 2012-06-10  Bill Schmidt  <wschm...@linux.ibm.com>
> > > >
> > > >        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
> > > >        for strided loads.
> > > >
> > > >
> > > > Index: gcc/tree-vect-stmts.c
> > > > ===================================================================
> > > > --- gcc/tree-vect-stmts.c       (revision 188341)
> > > > +++ gcc/tree-vect-stmts.c       (working copy)
> > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, 
> > > > int
> > > >   /* The loads themselves.  */
> > > >   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
> > > >     {
> > > > -      /* N scalar loads plus gathering them into a vector.
> > > > -         ???  scalar_to_vec isn't the cost for that.  */
> > > > +      /* N scalar loads plus gathering them into a vector.  */
> > > >       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
> > > >                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE 
> > > > (stmt_info)));
> > > > -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
> > > > +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> > > >     }
> > > >   else
> > > >     vect_get_load_cost (first_dr, ncopies,
> > > >
> > > >
> > > 
> > 
> > 
> 

Reply via email to