> I opened: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90952 > > We shouldn't use costs for moves for costs of RTL expressions. We can > experiment different RTL expression cost formulas. But we need to separate > costs of RTL expressions from costs for moves first. What is the best way > to partition processor_costs to avoid confusion between costs of moves vs. > costs of RTL expressions?
I am still worried that splitting the cost and experimentally finding value which works well for SPEC2017 is not very reliable solution here, since the problematic decisions is not only about store cost but also about other factors. What benchmarks besides x264 are sensitive to this? Looking at x264 the problem is really simple SLP vectorization of 8 integer stores into one AVX256 store which is not a win on Core. I wrote simple microbenchmark that tests SLP vectorized versus normal store (attached). Results at skylake are: 64bit float 2 SLP: 1.54 float 2 no-SLP: 1.52 float 2 def: 1.55 char 8 SLP: 3.35 char 8 no-SLP: 3.34 char 8 def: 3.32 short 4 SLP: 1.51 short 4 no-SLP: 1.51 short 4 def: 1.52 int 2 SLP: 1.22 int 2 no-SLP: 1.24 int 2 def: 1.25 AVX126 float 4 SLP: 1.51 float 4 no-SLP: 1.81 float 4 def: 1.54 double 2 SLP: 1.51 double 2 no-SLP: 1.53 double 2 def: 1.55 char 16 SLP: 6.31 char 16 no-SLP: 8.31 char 16 def: 6.33 short 8 SLP: 3.91 short 8 no-SLP: 3.33 short 8 def: 3.92 int 4 SLP: 2.12 int 4 no-SLP: 1.51 int 4 def: 1.56 long long 2 SLP: 1.50 long long 2 no-SLP: 1.21 long long 2 def: 1.26 AVX256 float 8 SLP: 2.11 float 8 no-SLP: 2.70 float 8 def: 2.13 double 4 SLP: 1.83 double 4 no-SLP: 1.80 double 4 def: 1.82 char 32 SLP: 12.72 char 32 no-SLP: 17.28 char 32 def: 12.71 short 16 SLP: 6.32 short 16 no-SLP: 8.77 short 16 def: 6.20 int 8 SLP: 3.93 int 8 no-SLP: 3.31 int 8 def: 3.33 long long 4 SLP: 2.13 long long 4 no-SLP: 1.52 long long 4 def: 1.51 def is with cost model based decision. SLP seems bad idea for - 256 long long and int vectors (which I see are cured by your change in cost table. - doubles (little bit) - shorts for 128bit vectors (I guess that would be cured if 16bit store cost was decreased a bit like you did for int) For zen we get: 64bit float 2 SLP: 2.22 float 2 no-SLP: 2.23 float 2 def: 2.23 char 8 SLP: 4.08 char 8 no-SLP: 4.08 char 8 def: 4.08 short 4 SLP: 2.22 short 4 no-SLP: 2.23 short 4 def: 2.23 int 2 SLP: 1.86 int 2 no-SLP: 1.87 int 2 def: 1.86 AVX126 float 4 SLP: 2.23 float 4 no-SLP: 2.60 float 4 def: 2.23 double 2 SLP: 2.23 double 2 no-SLP: 2.23 double 2 def: 2.23 char 16 SLP: 4.79 char 16 no-SLP: 10.03 char 16 def: 4.85 short 8 SLP: 3.20 short 8 no-SLP: 4.08 short 8 def: 3.22 int 4 SLP: 2.23 int 4 no-SLP: 2.23 int 4 def: 2.23 long long 2 SLP: 1.86 long long 2 no-SLP: 1.86 long long 2 def: 1.87 So SLP is win in general and for buldozer 64bit float 2 SLP: 2.76 float 2 no-SLP: 2.77 float 2 def: 2.77 char 8 SLP: 4.48 char 8 no-SLP: 4.49 char 8 def: 4.48 short 4 SLP: 2.84 short 4 no-SLP: 2.84 short 4 def: 2.83 int 2 SLP: 2.14 int 2 no-SLP: 2.13 int 2 def: 2.15 AVX126 float 4 SLP: 2.59 float 4 no-SLP: 3.07 float 4 def: 2.59 double 2 SLP: 2.48 double 2 no-SLP: 2.49 double 2 def: 2.48 char 16 SLP: 30.33 char 16 no-SLP: 11.72 char 16 def: 30.30 short 8 SLP: 21.04 short 8 no-SLP: 4.62 short 8 def: 21.06 int 4 SLP: 4.29 int 4 no-SLP: 2.84 int 4 def: 4.30 long long 2 SLP: 3.07 long long 2 no-SLP: 2.14 long long 2 def: 2.16 Here SLP is major los for integers and we get it all wrong. This is because SLP for integer implies inter-unit move that is bad on this chip. Looking at the generated code, we seem to get constructor costs wrong. SLP for float4 is generated as: vunpcklps %xmm3, %xmm2, %xmm2 vunpcklps %xmm1, %xmm0, %xmm0 vmovlhps %xmm2, %xmm0, %xmm0 vmovaps %xmm0, array(%rip) While vectorizer does: 0x3050e50 a0_2(D) 1 times vec_construct costs 16 in prologue 0x3050e50 a0_2(D) 1 times vector_store costs 16 in body 0x3051030 a0_2(D) 1 times scalar_store costs 16 in body 0x3051030 a1_4(D) 1 times scalar_store costs 16 in body 0x3051030 a2_6(D) 1 times scalar_store costs 16 in body 0x3051030 a3_8(D) 1 times scalar_store costs 16 in body testslp.C:70:1: note: Cost model analysis: Vector inside of basic block cost: 16 Vector prologue cost: 16 Vector epilogue cost: 0 Scalar cost of basic block: 64 So it thinks that vectorized sequence will take same time as one store. This is result of: case vec_construct: { /* N element inserts into SSE vectors. */ int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; /* One vinserti128 for combining two SSE vectors for AVX256. */ if (GET_MODE_BITSIZE (mode) == 256) cost += ix86_vec_cost (mode, ix86_cost->addss); /* One vinserti64x4 and two vinserti128 for combining SSE and AVX256 vectors to AVX512. */ else if (GET_MODE_BITSIZE (mode) == 512) cost += 3 * ix86_vec_cost (mode, ix86_cost->addss); return cost; } So 4*normal sse_op (latency 1) plus addss (latency 4) overall 8 cycles SSE store should be 4 cycles. This does not quite meet the reality. For integer version this is even less realistic since we output 8 int->SSE moves followed by packing code. The attached patch gets number of instructions right, but it still won't result in the optimal scores in my micro benchmark. Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 272507) +++ config/i386/i386.c (working copy) @@ -21130,15 +21132,38 @@ ix86_builtin_vectorization_cost (enum ve case vec_construct: { - /* N element inserts into SSE vectors. */ - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; - /* One vinserti128 for combining two SSE vectors for AVX256. */ - if (GET_MODE_BITSIZE (mode) == 256) - cost += ix86_vec_cost (mode, ix86_cost->addss); - /* One vinserti64x4 and two vinserti128 for combining SSE - and AVX256 vectors to AVX512. */ - else if (GET_MODE_BITSIZE (mode) == 512) - cost += 3 * ix86_vec_cost (mode, ix86_cost->addss); + int cost; + if (fp) + /* vunpcklps or vunpcklpd to move half of the values above + the other half. */ + cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op / 2; + else + /* Scalar values are usually converted from integer unit. + N/2 vmovs and N/2 vpinsrd */ + cost = TYPE_VECTOR_SUBPARTS (vectype) + * COSTS_N_INSNS (ix86_cost->sse_to_integer / 2); + switch (TYPE_VECTOR_SUBPARTS (vectype)) + { + case 2: + break; + case 4: + /* movhlps or vinsertf128. */ + cost += ix86_vec_cost (mode, ix86_cost->sse_op); + break; + case 8: + /* 2 vmovlhps + vinsertf128. */ + cost += ix86_vec_cost (mode, 3 * ix86_cost->sse_op); + break; + case 16: + cost += ix86_vec_cost (mode, 7 * ix86_cost->sse_op); + break; + case 32: + cost += ix86_vec_cost (mode, 15 * ix86_cost->sse_op); + break; + case 64: + cost += ix86_vec_cost (mode, 31 * ix86_cost->sse_op); + break; + } return cost; }