Re: [PATCH] Correct cost model for strided loads

Richard Guenther Mon, 11 Jun 2012 07:10:43 -0700

On Mon, 11 Jun 2012, William J. Schmidt wrote:

> 
> 
> On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
> > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
> > <wschm...@linux.vnet.ibm.com> wrote:
> > > The fix for PR53331 caused a degradation to 187.facerec on
> > > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
> > > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
> > > Bootstrapped and regtested on that platform with no new regressions.  Ok
> > > for trunk?
> > 
> > Well, would the real cost not be subparts * scalar_to_vec plus
> > subparts * vec_perm?
> > At least vec_perm isn't the cost for building up a vector from N scalar 
> > elements
> > either (it might be close enough if subparts == 2).  What's the case
> > with facerec
> > here?  Does it have subparts == 2?  
> 
> In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
> PowerPC, this requires two merge instructions and a permute instruction
> to get the four 32-bit quantities into the right place in a 128-bit
> register.  Currently this is modeled as a vec_perm in other parts of the
> vectorizer per Ira's earlier patches, so I naively changed this to do
> the same thing.


I see.

> The types of vectorizer instructions aren't documented, and I can't
> infer much from the i386.c cost model, so I need a little education.
> What semantics are represented by scalar_to_vec?

It's a vector splat, thus x -> { x, x, x, ... }.  You can create
{ x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
as VEC_PERM_EXPR, takes two input vectors).  That's by far not
the most efficient way to build up such a vector of course (with AVX
you could do one splat plus N - 1 inserts for example).  The cost
is of course dependent on the number of vector elements, so a
simple new enum vect_cost_for_stmt kind does not cover it but
the target would have to look at the vector type passed and do
some reasonable guess.

> On PowerPC, we have this mapping of the floating-point registers and
> vector float registers where they overlap (the low-order half of each of
> the first 32 vector float regs corresponds to a scalar float reg).  So
> in this case we have four scalar loads that place things in the bottom
> half of four vector registers, two vector merge instructions that
> collapse the four registers into two vector registers, and a vector
> permute that gets things in the right order.(*)  I wonder if what we
> refer to as a merge instruction is similar to scalar_to_vec.

Looks similar to x86 SSE then.

> If so, then maybe we need something like
> 
>      subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>      inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
>      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 
> 2;
>      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> 
> But then we'd have to change how vec_perm is modeled elsewhere for
> PowerPC based on Ira's earlier patches.  As I said, it's difficult for
> me to figure out all the intent of cost model decisions that have been
> made in the past, using current documentation.

Heh, usually the intent was to make the changes simple, not to compute
a proper cost.

I think we simply need a new scalars_to_vec cost kind.

> > I really wanted to pessimize this case
> > for say AVX and char elements, thus building up a vector from 32 scalars 
> > which
> > certainly does not cost a mere vec_perm.  So, maybe special-case the
> > subparts == 2 case and assume vec_perm would match the cost only in that
> > case.
> 
> (I'm a little confused by this as what you have at the moment is a
> single scalar_to_vec per copy, which has a cost of 1 on most i386
> targets (occasionally 2).  The subparts multiplier is only applied to
> the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

Oh, I somehow read the patch as you were removing the multiplication
by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
to reflect it with N scalar loads plus N splats plus N - 1 permutes
originally.  You could also model it with N scalar loads plus N
inserts (but we don't have a vec_insert cost either).  I think adding
a scalars_to_vec or vec_init or however we want to call it - basically
what the cost of a vector CONSTRUCTOR would be - is best.

> (*) There are actually a couple more instructions here to convert 64-bit
> values to 32-bit values, since on PowerPC 32-bit loads are converted to
> 64-bit values in scalar float registers and they have to be coerced back
> to 32-bit.  Very ugly.  The cost model currently doesn't represent this
> at all, which I'll have to look at fixing at some point in some way that
> isn't too nasty for the other targets.  The cost model for PowerPC seems
> to need a lot of TLC.

Maybe the above would be a possible way to do it.  On x86 for example
a vector of two doubles is extremely cheap for generic SSE2 to construct,
we can directly load into the low/high part, thus when we use it as
N scalar loads plus the vector-build the vector-build would be free.

Thanks,
Richard.

> Thanks,
> Bill
> 
> > 
> > Thanks,
> > Richard.
> > 
> > > Thanks,
> > > Bill
> > >
> > >
> > > 2012-06-10  Bill Schmidt  <wschm...@linux.ibm.com>
> > >
> > >        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
> > >        for strided loads.
> > >
> > >
> > > Index: gcc/tree-vect-stmts.c
> > > ===================================================================
> > > --- gcc/tree-vect-stmts.c       (revision 188341)
> > > +++ gcc/tree-vect-stmts.c       (working copy)
> > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
> > >   /* The loads themselves.  */
> > >   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
> > >     {
> > > -      /* N scalar loads plus gathering them into a vector.
> > > -         ???  scalar_to_vec isn't the cost for that.  */
> > > +      /* N scalar loads plus gathering them into a vector.  */
> > >       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
> > >                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE 
> > > (stmt_info)));
> > > -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
> > > +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> > >     }
> > >   else
> > >     vect_get_load_cost (first_dr, ncopies,
> > >
> > >
> > 
> 
> 

-- 
Richard Guenther <rguent...@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix ImendÃ¶rffer

Re: [PATCH] Correct cost model for strided loads

Reply via email to