On Mon, Jun 11, 2012 at 5:01 PM, William J. Schmidt <wschm...@linux.vnet.ibm.com> wrote: > > > On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote: >> On Mon, 11 Jun 2012, William J. Schmidt wrote: >> >> > >> > >> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: >> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt >> > > <wschm...@linux.vnet.ibm.com> wrote: >> > > > The fix for PR53331 caused a degradation to 187.facerec on >> > > > powerpc64-unknown-linux-gnu. The following simple patch reverses the >> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. >> > > > Bootstrapped and regtested on that platform with no new regressions. >> > > > Ok >> > > > for trunk? >> > > >> > > Well, would the real cost not be subparts * scalar_to_vec plus >> > > subparts * vec_perm? >> > > At least vec_perm isn't the cost for building up a vector from N scalar >> > > elements >> > > either (it might be close enough if subparts == 2). What's the case >> > > with facerec >> > > here? Does it have subparts == 2? >> > >> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On >> > PowerPC, this requires two merge instructions and a permute instruction >> > to get the four 32-bit quantities into the right place in a 128-bit >> > register. Currently this is modeled as a vec_perm in other parts of the >> > vectorizer per Ira's earlier patches, so I naively changed this to do >> > the same thing. >> >> I see. >> >> > The types of vectorizer instructions aren't documented, and I can't >> > infer much from the i386.c cost model, so I need a little education. >> > What semantics are represented by scalar_to_vec? >> >> It's a vector splat, thus x -> { x, x, x, ... }. You can create >> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute, >> as VEC_PERM_EXPR, takes two input vectors). That's by far not >> the most efficient way to build up such a vector of course (with AVX >> you could do one splat plus N - 1 inserts for example). The cost >> is of course dependent on the number of vector elements, so a >> simple new enum vect_cost_for_stmt kind does not cover it but >> the target would have to look at the vector type passed and do >> some reasonable guess. > > Ah, splat! Yes, that's lingo I understand. I see the intent now. > >> >> > On PowerPC, we have this mapping of the floating-point registers and >> > vector float registers where they overlap (the low-order half of each of >> > the first 32 vector float regs corresponds to a scalar float reg). So >> > in this case we have four scalar loads that place things in the bottom >> > half of four vector registers, two vector merge instructions that >> > collapse the four registers into two vector registers, and a vector >> > permute that gets things in the right order.(*) I wonder if what we >> > refer to as a merge instruction is similar to scalar_to_vec. >> >> Looks similar to x86 SSE then. >> >> > If so, then maybe we need something like >> > >> > subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >> > inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; >> > inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * >> > subparts / 2; >> > inside_cost += ncopies * vect_get_stmt_cost (vec_perm); >> > >> > But then we'd have to change how vec_perm is modeled elsewhere for >> > PowerPC based on Ira's earlier patches. As I said, it's difficult for >> > me to figure out all the intent of cost model decisions that have been >> > made in the past, using current documentation. >> >> Heh, usually the intent was to make the changes simple, not to compute >> a proper cost. >> >> I think we simply need a new scalars_to_vec cost kind. > > That works. Maybe vec_construct gets the point across a little better? > I think we need to use the full builtin_vectorization_cost interface > instead of vect_get_stmt_cost here, so the targets can parameterize on > type. Then we can just do one cost calculation for vec_construct that > covers the full costs of getting the vector in order after the loads.
Yes, vec_construct sounds good to me. >> >> > > I really wanted to pessimize this case >> > > for say AVX and char elements, thus building up a vector from 32 scalars >> > > which >> > > certainly does not cost a mere vec_perm. So, maybe special-case the >> > > subparts == 2 case and assume vec_perm would match the cost only in that >> > > case. >> > >> > (I'm a little confused by this as what you have at the moment is a >> > single scalar_to_vec per copy, which has a cost of 1 on most i386 >> > targets (occasionally 2). The subparts multiplier is only applied to >> > the loads. So changing this to vec_perm seemed to be a no-op for i386.) >> >> Oh, I somehow read the patch as you were removing the multiplication >> by TYPE_VECTOR_SUBPARTS. But yes, the cost is way off and I'd wanted >> to reflect it with N scalar loads plus N splats plus N - 1 permutes >> originally. You could also model it with N scalar loads plus N >> inserts (but we don't have a vec_insert cost either). I think adding >> a scalars_to_vec or vec_init or however we want to call it - basically >> what the cost of a vector CONSTRUCTOR would be - is best. >> >> > (*) There are actually a couple more instructions here to convert 64-bit >> > values to 32-bit values, since on PowerPC 32-bit loads are converted to >> > 64-bit values in scalar float registers and they have to be coerced back >> > to 32-bit. Very ugly. The cost model currently doesn't represent this >> > at all, which I'll have to look at fixing at some point in some way that >> > isn't too nasty for the other targets. The cost model for PowerPC seems >> > to need a lot of TLC. >> >> Maybe the above would be a possible way to do it. On x86 for example >> a vector of two doubles is extremely cheap for generic SSE2 to construct, >> we can directly load into the low/high part, thus when we use it as >> N scalar loads plus the vector-build the vector-build would be free. > > Absolutely. As long as the vector type is available, this can be built > into vec_construct's logic in the target. I will feel much more > comfortable having better estimation for the ugly 32-bit case. Indeed. Richard. > Thanks, > Bill > >> >> Thanks, >> Richard. >> >> > Thanks, >> > Bill >> > >> > > >> > > Thanks, >> > > Richard. >> > > >> > > > Thanks, >> > > > Bill >> > > > >> > > > >> > > > 2012-06-10 Bill Schmidt <wschm...@linux.ibm.com> >> > > > >> > > > * tree-vect-stmts.c (vect_model_load_cost): Change cost model >> > > > for strided loads. >> > > > >> > > > >> > > > Index: gcc/tree-vect-stmts.c >> > > > =================================================================== >> > > > --- gcc/tree-vect-stmts.c (revision 188341) >> > > > +++ gcc/tree-vect-stmts.c (working copy) >> > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, >> > > > int >> > > > /* The loads themselves. */ >> > > > if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) >> > > > { >> > > > - /* N scalar loads plus gathering them into a vector. >> > > > - ??? scalar_to_vec isn't the cost for that. */ >> > > > + /* N scalar loads plus gathering them into a vector. */ >> > > > inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies >> > > > * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE >> > > > (stmt_info))); >> > > > - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); >> > > > + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); >> > > > } >> > > > else >> > > > vect_get_load_cost (first_dr, ncopies, >> > > > >> > > > >> > > >> > >> > >> >