Hi, PR80695 identifies a case (similar to several others we've seen) where SLP vectorization is too aggressive about vectorizing stores. The problem is that we undervalue the cost of a vec_construct operation. vec_construct is the vectorizer's representation for building a vector from scalar elements. When we construct an integer vector type from its constituent parts, it requires a direct move from two GPRs (one instruction on P9, two direct moves and a merge on P8). The high cost of this is not reflected in the current cost calculation, which only counts the cost of combining the elements using N-1 inserts. This patch provides a higher estimate that is closer to reality. Note that all cost estimation for vectorization is a bit rough, so this should be viewed as a heuristic.
The patch treats all integer vectors separately from the default case. There is already special handling for V4SFmode, so this leaves only V2DFmode in the default case. It was previously established heuristically that a cost factor of 2 was appropriate for V2DFmode, so that is left unchanged here; but since V2DFmode is the only default, we can simplify the calculation to just return 2. Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions. Is this ok for trunk? Thanks, Bill [gcc] 2017-05-11 Bill Schmidt <wschm...@linux.vnet.ibm.com> PR target/80695 * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Account for direct move costs for vec_construct of integer vectors. [gcc/testsuite] 2017-05-11 Bill Schmidt <wschm...@linux.vnet.ibm.com> PR target/80695 * gcc.target/powerpc/pr80695-p8.c: New file. * gcc.target/powerpc/pr80695-p9.c: New file. Index: gcc/config/rs6000/rs6000.c =================================================================== --- gcc/config/rs6000/rs6000.c (revision 247809) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -5849,8 +5849,20 @@ rs6000_builtin_vectorization_cost (enum vect_cost_ if (SCALAR_FLOAT_TYPE_P (elem_type) && TYPE_PRECISION (elem_type) == 32) return 5; + /* On POWER9, integer vector types are built up in GPRs and then + use a direct move (2 cycles). For POWER8 this is even worse, + as we need two direct moves and a merge, and the direct moves + are five cycles. */ + else if (INTEGRAL_TYPE_P (elem_type)) + { + if (TARGET_P9_VECTOR) + return TYPE_VECTOR_SUBPARTS (vectype) - 1 + 2; + else + return TYPE_VECTOR_SUBPARTS (vectype) - 1 + 11; + } else - return max (2, TYPE_VECTOR_SUBPARTS (vectype) - 1); + /* V2DFmode doesn't need a direct move. */ + return 2; default: gcc_unreachable (); Index: gcc/testsuite/gcc.target/powerpc/pr80695-p8.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/pr80695-p8.c (nonexistent) +++ gcc/testsuite/gcc.target/powerpc/pr80695-p8.c (working copy) @@ -0,0 +1,18 @@ +/* { dg-do compile { target { powerpc*-*-* } } } */ +/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-options "-mcpu=power8 -O3 -fdump-tree-slp-details" } */ + +/* PR80695: Verify cost model for vec_construct on POWER8. */ + +long a[10] __attribute__((aligned(16))); + +void foo (long i, long j, long k, long l) +{ + a[6] = i; + a[7] = j; + a[8] = k; + a[9] = l; +} + +/* { dg-final { scan-tree-dump-times "vectorization is not profitable" 1 "slp2" } } */ Index: gcc/testsuite/gcc.target/powerpc/pr80695-p9.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/pr80695-p9.c (nonexistent) +++ gcc/testsuite/gcc.target/powerpc/pr80695-p9.c (working copy) @@ -0,0 +1,18 @@ +/* { dg-do compile { target { powerpc*-*-* } } } */ +/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-options "-mcpu=power9 -O3 -fdump-tree-slp-details" } */ + +/* PR80695: Verify cost model for vec_construct on POWER9. */ + +long a[10] __attribute__((aligned(16))); + +void foo (long i, long j, long k, long l) +{ + a[6] = i; + a[7] = j; + a[8] = k; + a[9] = l; +} + +/* { dg-final { scan-tree-dump-times "vectorization is not profitable" 1 "slp2" } } */