Re: [PATCH] i386: Separate costs of RTL expressions from costs of moves

Jan Hubicka Sun, 23 Jun 2019 04:19:21 -0700

> I opened:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90952
> 
> We shouldn't use costs for moves for costs of RTL expressions.   We can
> experiment different RTL expression cost formulas.   But we need to separate
> costs of RTL expressions from costs for moves first.   What is the best way
> to partition processor_costs to avoid confusion between costs of moves vs.
> costs of RTL expressions?


I am still worried that splitting the cost and experimentally finding
value which works well for SPEC2017 is not very reliable solution here,
since the problematic decisions is not only about store cost but also
about other factors.

What benchmarks besides x264 are sensitive to this?

Looking at x264 the problem is really simple SLP vectorization of 8
integer stores into one AVX256 store which is not a win on Core.
I wrote simple microbenchmark that tests SLP vectorized versus normal
store (attached). Results at skylake are:

64bit
     float          2 SLP:    1.54
     float          2 no-SLP: 1.52
     float          2 def:    1.55
      char          8 SLP:    3.35
      char          8 no-SLP: 3.34
      char          8 def:    3.32
     short          4 SLP:    1.51
     short          4 no-SLP: 1.51
     short          4 def:    1.52
       int          2 SLP:    1.22
       int          2 no-SLP: 1.24
       int          2 def:    1.25
AVX126
     float          4 SLP:    1.51
     float          4 no-SLP: 1.81
     float          4 def:    1.54
    double          2 SLP:    1.51
    double          2 no-SLP: 1.53
    double          2 def:    1.55
      char         16 SLP:    6.31
      char         16 no-SLP: 8.31
      char         16 def:    6.33
     short          8 SLP:    3.91
     short          8 no-SLP: 3.33
     short          8 def:    3.92
       int          4 SLP:    2.12
       int          4 no-SLP: 1.51
       int          4 def:    1.56
 long long          2 SLP:    1.50
 long long          2 no-SLP: 1.21
 long long          2 def:    1.26

AVX256
     float          8 SLP:    2.11
     float          8 no-SLP: 2.70
     float          8 def:    2.13
    double          4 SLP:    1.83
    double          4 no-SLP: 1.80
    double          4 def:    1.82
      char         32 SLP:    12.72
      char         32 no-SLP: 17.28
      char         32 def:    12.71
     short         16 SLP:    6.32
     short         16 no-SLP: 8.77
     short         16 def:    6.20
       int          8 SLP:    3.93
       int          8 no-SLP: 3.31
       int          8 def:    3.33
 long long          4 SLP:    2.13
 long long          4 no-SLP: 1.52
 long long          4 def:    1.51

def is with cost model based decision.
SLP seems bad idea for 
 - 256 long long and int vectors
   (which I see are cured by your change in cost table.
 - doubles (little bit)
 - shorts for 128bit vectors
   (I guess that would be cured if 16bit store cost was
    decreased a bit like you did for int)

For zen we get:

64bit
     float          2 SLP:    2.22
     float          2 no-SLP: 2.23
     float          2 def:    2.23
      char          8 SLP:    4.08
      char          8 no-SLP: 4.08
      char          8 def:    4.08
     short          4 SLP:    2.22
     short          4 no-SLP: 2.23
     short          4 def:    2.23
       int          2 SLP:    1.86
       int          2 no-SLP: 1.87
       int          2 def:    1.86
AVX126
     float          4 SLP:    2.23
     float          4 no-SLP: 2.60
     float          4 def:    2.23
    double          2 SLP:    2.23
    double          2 no-SLP: 2.23
    double          2 def:    2.23
      char         16 SLP:    4.79
      char         16 no-SLP: 10.03
      char         16 def:    4.85
     short          8 SLP:    3.20
     short          8 no-SLP: 4.08
     short          8 def:    3.22
       int          4 SLP:    2.23
       int          4 no-SLP: 2.23
       int          4 def:    2.23
 long long          2 SLP:    1.86
 long long          2 no-SLP: 1.86
 long long          2 def:    1.87

So SLP is win in general
and for buldozer

64bit
     float          2 SLP:    2.76
     float          2 no-SLP: 2.77
     float          2 def:    2.77
      char          8 SLP:    4.48
      char          8 no-SLP: 4.49
      char          8 def:    4.48
     short          4 SLP:    2.84
     short          4 no-SLP: 2.84
     short          4 def:    2.83
       int          2 SLP:    2.14
       int          2 no-SLP: 2.13
       int          2 def:    2.15
AVX126
     float          4 SLP:    2.59
     float          4 no-SLP: 3.07
     float          4 def:    2.59
    double          2 SLP:    2.48
    double          2 no-SLP: 2.49
    double          2 def:    2.48
      char         16 SLP:    30.33
      char         16 no-SLP: 11.72
      char         16 def:    30.30
     short          8 SLP:    21.04
     short          8 no-SLP: 4.62
     short          8 def:    21.06
       int          4 SLP:    4.29
       int          4 no-SLP: 2.84
       int          4 def:    4.30
 long long          2 SLP:    3.07
 long long          2 no-SLP: 2.14
 long long          2 def:    2.16

Here SLP is major los for integers and we get it all wrong.
This is because SLP for integer implies inter-unit move that is bad
on this chip.

Looking at the generated code, we seem to get constructor costs wrong.

SLP for float4 is generated as:
        vunpcklps       %xmm3, %xmm2, %xmm2
        vunpcklps       %xmm1, %xmm0, %xmm0
        vmovlhps        %xmm2, %xmm0, %xmm0
        vmovaps %xmm0, array(%rip)

While vectorizer does:
0x3050e50 a0_2(D) 1 times vec_construct costs 16 in prologue
0x3050e50 a0_2(D) 1 times vector_store costs 16 in body
0x3051030 a0_2(D) 1 times scalar_store costs 16 in body
0x3051030 a1_4(D) 1 times scalar_store costs 16 in body
0x3051030 a2_6(D) 1 times scalar_store costs 16 in body
0x3051030 a3_8(D) 1 times scalar_store costs 16 in body
testslp.C:70:1: note:  Cost model analysis: 
  Vector inside of basic block cost: 16
  Vector prologue cost: 16
  Vector epilogue cost: 0
  Scalar cost of basic block: 64

So it thinks that vectorized sequence will take same time as one store.
This is result of:

      case vec_construct:
        {
          /* N element inserts into SSE vectors.  */
          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          if (GET_MODE_BITSIZE (mode) == 256)
            cost += ix86_vec_cost (mode, ix86_cost->addss);
          /* One vinserti64x4 and two vinserti128 for combining SSE
             and AVX256 vectors to AVX512.  */
          else if (GET_MODE_BITSIZE (mode) == 512)
            cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
          return cost;
        }
So 4*normal sse_op (latency 1) plus addss (latency 4) overall 8 cycles
SSE store should be 4 cycles.

This does not quite meet the reality.  

For integer version this is even less realistic since we output 8
int->SSE moves followed by packing code.

The attached patch gets number of instructions right, but it still won't
result in the optimal scores in my micro benchmark.

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c  (revision 272507)
+++ config/i386/i386.c  (working copy)
@@ -21130,15 +21132,38 @@ ix86_builtin_vectorization_cost (enum ve
 
       case vec_construct:
        {
-         /* N element inserts into SSE vectors.  */
-         int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
-         /* One vinserti128 for combining two SSE vectors for AVX256.  */
-         if (GET_MODE_BITSIZE (mode) == 256)
-           cost += ix86_vec_cost (mode, ix86_cost->addss);
-         /* One vinserti64x4 and two vinserti128 for combining SSE
-            and AVX256 vectors to AVX512.  */
-         else if (GET_MODE_BITSIZE (mode) == 512)
-           cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
+         int cost;
+         if (fp)
+             /* vunpcklps or vunpcklpd to move half of the values above
+                the other half.  */
+           cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op / 2;
+         else
+           /* Scalar values are usually converted from integer unit.
+              N/2 vmovs and N/2 vpinsrd  */
+           cost = TYPE_VECTOR_SUBPARTS (vectype)
+                  * COSTS_N_INSNS (ix86_cost->sse_to_integer / 2);
+         switch (TYPE_VECTOR_SUBPARTS (vectype))
+           {
+           case 2:
+              break;
+           case 4:
+              /* movhlps or vinsertf128.  */
+              cost += ix86_vec_cost (mode, ix86_cost->sse_op);
+              break;
+           case 8:
+              /* 2 vmovlhps + vinsertf128.  */
+              cost += ix86_vec_cost (mode, 3 * ix86_cost->sse_op);
+              break;
+           case 16:
+              cost += ix86_vec_cost (mode, 7 * ix86_cost->sse_op);
+              break;
+           case 32:
+              cost += ix86_vec_cost (mode, 15 * ix86_cost->sse_op);
+              break;
+           case 64:
+              cost += ix86_vec_cost (mode, 31 * ix86_cost->sse_op);
+              break;
+           }
          return cost;
        }

Re: [PATCH] i386: Separate costs of RTL expressions from costs of moves

Reply via email to