On Mon, 11 Jun 2012, William J. Schmidt wrote: > > > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote: > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt > > <wschm...@linux.vnet.ibm.com> wrote: > > > The fix for PR53331 caused a degradation to 187.facerec on > > > powerpc64-unknown-linux-gnu. The following simple patch reverses the > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006. > > > Bootstrapped and regtested on that platform with no new regressions. Ok > > > for trunk? > > > > Well, would the real cost not be subparts * scalar_to_vec plus > > subparts * vec_perm? > > At least vec_perm isn't the cost for building up a vector from N scalar > > elements > > either (it might be close enough if subparts == 2). What's the case > > with facerec > > here? Does it have subparts == 2? > > In this case, subparts == 4 (32-bit floats, 128-bit vec reg). On > PowerPC, this requires two merge instructions and a permute instruction > to get the four 32-bit quantities into the right place in a 128-bit > register. Currently this is modeled as a vec_perm in other parts of the > vectorizer per Ira's earlier patches, so I naively changed this to do > the same thing.
I see. > The types of vectorizer instructions aren't documented, and I can't > infer much from the i386.c cost model, so I need a little education. > What semantics are represented by scalar_to_vec? It's a vector splat, thus x -> { x, x, x, ... }. You can create { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute, as VEC_PERM_EXPR, takes two input vectors). That's by far not the most efficient way to build up such a vector of course (with AVX you could do one splat plus N - 1 inserts for example). The cost is of course dependent on the number of vector elements, so a simple new enum vect_cost_for_stmt kind does not cover it but the target would have to look at the vector type passed and do some reasonable guess. > On PowerPC, we have this mapping of the floating-point registers and > vector float registers where they overlap (the low-order half of each of > the first 32 vector float regs corresponds to a scalar float reg). So > in this case we have four scalar loads that place things in the bottom > half of four vector registers, two vector merge instructions that > collapse the four registers into two vector registers, and a vector > permute that gets things in the right order.(*) I wonder if what we > refer to as a merge instruction is similar to scalar_to_vec. Looks similar to x86 SSE then. > If so, then maybe we need something like > > subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); > inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts; > inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / > 2; > inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > But then we'd have to change how vec_perm is modeled elsewhere for > PowerPC based on Ira's earlier patches. As I said, it's difficult for > me to figure out all the intent of cost model decisions that have been > made in the past, using current documentation. Heh, usually the intent was to make the changes simple, not to compute a proper cost. I think we simply need a new scalars_to_vec cost kind. > > I really wanted to pessimize this case > > for say AVX and char elements, thus building up a vector from 32 scalars > > which > > certainly does not cost a mere vec_perm. So, maybe special-case the > > subparts == 2 case and assume vec_perm would match the cost only in that > > case. > > (I'm a little confused by this as what you have at the moment is a > single scalar_to_vec per copy, which has a cost of 1 on most i386 > targets (occasionally 2). The subparts multiplier is only applied to > the loads. So changing this to vec_perm seemed to be a no-op for i386.) Oh, I somehow read the patch as you were removing the multiplication by TYPE_VECTOR_SUBPARTS. But yes, the cost is way off and I'd wanted to reflect it with N scalar loads plus N splats plus N - 1 permutes originally. You could also model it with N scalar loads plus N inserts (but we don't have a vec_insert cost either). I think adding a scalars_to_vec or vec_init or however we want to call it - basically what the cost of a vector CONSTRUCTOR would be - is best. > (*) There are actually a couple more instructions here to convert 64-bit > values to 32-bit values, since on PowerPC 32-bit loads are converted to > 64-bit values in scalar float registers and they have to be coerced back > to 32-bit. Very ugly. The cost model currently doesn't represent this > at all, which I'll have to look at fixing at some point in some way that > isn't too nasty for the other targets. The cost model for PowerPC seems > to need a lot of TLC. Maybe the above would be a possible way to do it. On x86 for example a vector of two doubles is extremely cheap for generic SSE2 to construct, we can directly load into the low/high part, thus when we use it as N scalar loads plus the vector-build the vector-build would be free. Thanks, Richard. > Thanks, > Bill > > > > > Thanks, > > Richard. > > > > > Thanks, > > > Bill > > > > > > > > > 2012-06-10 Bill Schmidt <wschm...@linux.ibm.com> > > > > > > * tree-vect-stmts.c (vect_model_load_cost): Change cost model > > > for strided loads. > > > > > > > > > Index: gcc/tree-vect-stmts.c > > > =================================================================== > > > --- gcc/tree-vect-stmts.c (revision 188341) > > > +++ gcc/tree-vect-stmts.c (working copy) > > > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int > > > /* The loads themselves. */ > > > if (STMT_VINFO_STRIDE_LOAD_P (stmt_info)) > > > { > > > - /* N scalar loads plus gathering them into a vector. > > > - ??? scalar_to_vec isn't the cost for that. */ > > > + /* N scalar loads plus gathering them into a vector. */ > > > inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies > > > * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE > > > (stmt_info))); > > > - inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec); > > > + inside_cost += ncopies * vect_get_stmt_cost (vec_perm); > > > } > > > else > > > vect_get_load_cost (first_dr, ncopies, > > > > > > > > > > -- Richard Guenther <rguent...@suse.de> SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer