Re: [PATCH] Correct cost model for strided loads

Richard Guenther Mon, 11 Jun 2012 11:37:31 -0700

On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt
<wschm...@linux.vnet.ibm.com> wrote:
>
>
> On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
>> On Mon, 11 Jun 2012, William J. Schmidt wrote:
>>
>> >
>> >
>> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
>> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
>> > > <wschm...@linux.vnet.ibm.com> wrote:
>> > > > The fix for PR53331 caused a degradation to 187.facerec on
>> > > > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
>> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
>> > > > Bootstrapped and regtested on that platform with no new regressions.  
>> > > > Ok
>> > > > for trunk?
>> > >
>> > > Well, would the real cost not be subparts * scalar_to_vec plus
>> > > subparts * vec_perm?
>> > > At least vec_perm isn't the cost for building up a vector from N scalar 
>> > > elements
>> > > either (it might be close enough if subparts == 2).  What's the case
>> > > with facerec
>> > > here?  Does it have subparts == 2?
>> >
>> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
>> > PowerPC, this requires two merge instructions and a permute instruction
>> > to get the four 32-bit quantities into the right place in a 128-bit
>> > register.  Currently this is modeled as a vec_perm in other parts of the
>> > vectorizer per Ira's earlier patches, so I naively changed this to do
>> > the same thing.
>>
>> I see.
>>
>> > The types of vectorizer instructions aren't documented, and I can't
>> > infer much from the i386.c cost model, so I need a little education.
>> > What semantics are represented by scalar_to_vec?
>>
>> It's a vector splat, thus x -> { x, x, x, ... }.  You can create
>> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
>> as VEC_PERM_EXPR, takes two input vectors).  That's by far not
>> the most efficient way to build up such a vector of course (with AVX
>> you could do one splat plus N - 1 inserts for example).  The cost
>> is of course dependent on the number of vector elements, so a
>> simple new enum vect_cost_for_stmt kind does not cover it but
>> the target would have to look at the vector type passed and do
>> some reasonable guess.
>
> Ah, splat!  Yes, that's lingo I understand.  I see the intent now.
>
>>
>> > On PowerPC, we have this mapping of the floating-point registers and
>> > vector float registers where they overlap (the low-order half of each of
>> > the first 32 vector float regs corresponds to a scalar float reg).  So
>> > in this case we have four scalar loads that place things in the bottom
>> > half of four vector registers, two vector merge instructions that
>> > collapse the four registers into two vector registers, and a vector
>> > permute that gets things in the right order.(*)  I wonder if what we
>> > refer to as a merge instruction is similar to scalar_to_vec.
>>
>> Looks similar to x86 SSE then.
>>
>> > If so, then maybe we need something like
>> >
>> >      subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>> >      inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
>> >      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * 
>> > subparts / 2;
>> >      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
>> >
>> > But then we'd have to change how vec_perm is modeled elsewhere for
>> > PowerPC based on Ira's earlier patches.  As I said, it's difficult for
>> > me to figure out all the intent of cost model decisions that have been
>> > made in the past, using current documentation.
>>
>> Heh, usually the intent was to make the changes simple, not to compute
>> a proper cost.
>>
>> I think we simply need a new scalars_to_vec cost kind.
>
> That works.  Maybe vec_construct gets the point across a little better?
> I think we need to use the full builtin_vectorization_cost interface
> instead of vect_get_stmt_cost here, so the targets can parameterize on
> type.  Then we can just do one cost calculation for vec_construct that
> covers the full costs of getting the vector in order after the loads.


Yes, vec_construct sounds good to me.

>>
>> > > I really wanted to pessimize this case
>> > > for say AVX and char elements, thus building up a vector from 32 scalars 
>> > > which
>> > > certainly does not cost a mere vec_perm.  So, maybe special-case the
>> > > subparts == 2 case and assume vec_perm would match the cost only in that
>> > > case.
>> >
>> > (I'm a little confused by this as what you have at the moment is a
>> > single scalar_to_vec per copy, which has a cost of 1 on most i386
>> > targets (occasionally 2).  The subparts multiplier is only applied to
>> > the loads.  So changing this to vec_perm seemed to be a no-op for i386.)
>>
>> Oh, I somehow read the patch as you were removing the multiplication
>> by TYPE_VECTOR_SUBPARTS.  But yes, the cost is way off and I'd wanted
>> to reflect it with N scalar loads plus N splats plus N - 1 permutes
>> originally.  You could also model it with N scalar loads plus N
>> inserts (but we don't have a vec_insert cost either).  I think adding
>> a scalars_to_vec or vec_init or however we want to call it - basically
>> what the cost of a vector CONSTRUCTOR would be - is best.
>>
>> > (*) There are actually a couple more instructions here to convert 64-bit
>> > values to 32-bit values, since on PowerPC 32-bit loads are converted to
>> > 64-bit values in scalar float registers and they have to be coerced back
>> > to 32-bit.  Very ugly.  The cost model currently doesn't represent this
>> > at all, which I'll have to look at fixing at some point in some way that
>> > isn't too nasty for the other targets.  The cost model for PowerPC seems
>> > to need a lot of TLC.
>>
>> Maybe the above would be a possible way to do it.  On x86 for example
>> a vector of two doubles is extremely cheap for generic SSE2 to construct,
>> we can directly load into the low/high part, thus when we use it as
>> N scalar loads plus the vector-build the vector-build would be free.
>
> Absolutely.  As long as the vector type is available, this can be built
> into vec_construct's logic in the target.  I will feel much more
> comfortable having better estimation for the ugly 32-bit case.

Indeed.

Richard.

> Thanks,
> Bill
>
>>
>> Thanks,
>> Richard.
>>
>> > Thanks,
>> > Bill
>> >
>> > >
>> > > Thanks,
>> > > Richard.
>> > >
>> > > > Thanks,
>> > > > Bill
>> > > >
>> > > >
>> > > > 2012-06-10  Bill Schmidt  <wschm...@linux.ibm.com>
>> > > >
>> > > >        * tree-vect-stmts.c (vect_model_load_cost):  Change cost model
>> > > >        for strided loads.
>> > > >
>> > > >
>> > > > Index: gcc/tree-vect-stmts.c
>> > > > ===================================================================
>> > > > --- gcc/tree-vect-stmts.c       (revision 188341)
>> > > > +++ gcc/tree-vect-stmts.c       (working copy)
>> > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, 
>> > > > int
>> > > >   /* The loads themselves.  */
>> > > >   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
>> > > >     {
>> > > > -      /* N scalar loads plus gathering them into a vector.
>> > > > -         ???  scalar_to_vec isn't the cost for that.  */
>> > > > +      /* N scalar loads plus gathering them into a vector.  */
>> > > >       inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
>> > > >                      * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE 
>> > > > (stmt_info)));
>> > > > -      inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
>> > > > +      inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
>> > > >     }
>> > > >   else
>> > > >     vect_get_load_cost (first_dr, ncopies,
>> > > >
>> > > >
>> > >
>> >
>> >
>>
>

Re: [PATCH] Correct cost model for strided loads

Reply via email to