On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote: > On Mon, 11 Jun 2012, William J. Schmidt wrote: > > > > > > > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: > > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt > > > <wschm...@linux.vnet.ibm.com> wrote: > > > > The fix for PR53331 caused a degradation to 187.facerec on > > > > powerpc64-unknown-linux-gnu. The following simple patch reverses the > > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. > > > > Bootstrapped and regtested on that platform with no new regressions. Ok > > > > for trunk? > > > > > > Well, would the real cost not be subparts * scalar_to_vec plus > > > subparts * vec_perm? > > > At least vec_perm isn't the cost for building up a vector from N scalar > > > elements > > > either (it might be close enough if subparts == 2). What's the case > > > with facerec > > > here? Does it have subparts == 2? > > > > In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On > > PowerPC, this requires two merge instructions and a permute instruction > > to get the four 32-bit quantities into the right place in a 128-bit > > register. Currently this is modeled as a vec_perm in other parts of the > > vectorizer per Ira's earlier patches, so I naively changed this to do > > the same thing. > > I see. > > > The types of vectorizer instructions aren't documented, and I can't > > infer much from the i386.c cost model, so I need a little education. > > What semantics are represented by scalar_to_vec? > > It's a vector splat, thus x -> { x, x, x, ... }. You can create > { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute, > as VEC_PERM_EXPR, takes two input vectors). That's by far not > the most efficient way to build up such a vector of course (with AVX > you could do one splat plus N - 1 inserts for example). The cost > is of course dependent on the number of vector elements, so a > simple new enum vect_cost_for_stmt kind does not cover it but > the target would have to look at the vector type passed and do > some reasonable guess.
Ah, splat! Yes, that's lingo I understand. I see the intent now. > > > On PowerPC, we have this mapping of the floating-point registers and > > vector float registers where they overlap (the low-order half of each of > > the first 32 vector float regs corresponds to a scalar float reg). So > > in this case we have four scalar loads that place things in the bottom > > half of four vector registers, two vector merge instructions that > > collapse the four registers into two vector registers, and a vector > > permute that gets things in the right order.(*) I wonder if what we > > refer to as a merge instruction is similar to scalar_to_vec. > > Looks similar to x86 SSE then. > > > If so, then maybe we need something like > > > > subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); > > inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; > > inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts > > / 2; > > inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > > > But then we'd have to change how vec_perm is modeled elsewhere for > > PowerPC based on Ira's earlier patches. As I said, it's difficult for > > me to figure out all the intent of cost model decisions that have been > > made in the past, using current documentation. > > Heh, usually the intent was to make the changes simple, not to compute > a proper cost. > > I think we simply need a new scalars_to_vec cost kind. That works. Maybe vec_construct gets the point across a little better? I think we need to use the full builtin_vectorization_cost interface instead of vect_get_stmt_cost here, so the targets can parameterize on type. Then we can just do one cost calculation for vec_construct that covers the full costs of getting the vector in order after the loads. > > > > I really wanted to pessimize this case > > > for say AVX and char elements, thus building up a vector from 32 scalars > > > which > > > certainly does not cost a mere vec_perm. So, maybe special-case the > > > subparts == 2 case and assume vec_perm would match the cost only in that > > > case. > > > > (I'm a little confused by this as what you have at the moment is a > > single scalar_to_vec per copy, which has a cost of 1 on most i386 > > targets (occasionally 2). The subparts multiplier is only applied to > > the loads. So changing this to vec_perm seemed to be a no-op for i386.) > > Oh, I somehow read the patch as you were removing the multiplication > by TYPE_VECTOR_SUBPARTS. But yes, the cost is way off and I'd wanted > to reflect it with N scalar loads plus N splats plus N - 1 permutes > originally. You could also model it with N scalar loads plus N > inserts (but we don't have a vec_insert cost either). I think adding > a scalars_to_vec or vec_init or however we want to call it - basically > what the cost of a vector CONSTRUCTOR would be - is best. > > > (*) There are actually a couple more instructions here to convert 64-bit > > values to 32-bit values, since on PowerPC 32-bit loads are converted to > > 64-bit values in scalar float registers and they have to be coerced back > > to 32-bit. Very ugly. The cost model currently doesn't represent this > > at all, which I'll have to look at fixing at some point in some way that > > isn't too nasty for the other targets. The cost model for PowerPC seems > > to need a lot of TLC. > > Maybe the above would be a possible way to do it. On x86 for example > a vector of two doubles is extremely cheap for generic SSE2 to construct, > we can directly load into the low/high part, thus when we use it as > N scalar loads plus the vector-build the vector-build would be free. Absolutely. As long as the vector type is available, this can be built into vec_construct's logic in the target. I will feel much more comfortable having better estimation for the ugly 32-bit case. Thanks, Bill > > Thanks, > Richard. > > > Thanks, > > Bill > > > > > > > > Thanks, > > > Richard. > > > > > > > Thanks, > > > > Bill > > > > > > > > > > > > 2012-06-10 Bill Schmidt <wschm...@linux.ibm.com> > > > > > > > > * tree-vect-stmts.c (vect_model_load_cost): Change cost model > > > > for strided loads. > > > > > > > > > > > > Index: gcc/tree-vect-stmts.c > > > > =================================================================== > > > > --- gcc/tree-vect-stmts.c (revision 188341) > > > > +++ gcc/tree-vect-stmts.c (working copy) > > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, > > > > int > > > > /* The loads themselves. */ > > > > if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) > > > > { > > > > - /* N scalar loads plus gathering them into a vector. > > > > - ??? scalar_to_vec isn't the cost for that. */ > > > > + /* N scalar loads plus gathering them into a vector. */ > > > > inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies > > > > * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE > > > > (stmt_info))); > > > > - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); > > > > + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > > > } > > > > else > > > > vect_get_load_cost (first_dr, ncopies, > > > > > > > > > > > > > > > >