Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel

Jan Hubicka Thu, 19 Oct 2017 00:17:43 -0700

Hi,
this is proof of concept patch for vectorizer costs to use costs used for 
rtx_cost
and register_move_cost which are readily available in ix86_costs instead of 
using
its own set of random values.  At least until we have proof of evidence that 
vectroizer
costs needs to differ, I do not think we want to complicate CPU tuning by 
having them
twice.


This is of course quite intrusive change to what we have becuase it affects all
x86 targets.  I have finally worked out that the "random" values used by AMD 
target
corresponds to latencies of bdver1.

I have benchmarked them on Zen and also temporarily patches Czerny (Haswel).
It seems to cause no regression and quite nice improvements:
  - 27.3% for facerec on Zen
  - 7% for mgrid on Haswel
  - maybe 1% for galgel of Haswell
  - 3% for facerec on Haswell
  - maybe 1% aspi on Haswell
  - there may be small off-noise improvement for rnflow and regression for 
fatigue2 on Haswell

So I would say that outcome is surprisingly good (especially due to lack of
noteworthy regressions).  I also know that vectorizer hurts performance on Zen 
and
Mesa/tonto benchmarks which is not cured by this patch alone.

There is testsuite fallout though.

./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc  -std=c++11  
scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times)
./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc  -std=c++14  
scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times)
./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc  -std=c++98  
scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times)

  Here we vectorize the loop before first while originally we unrolled and SLP 
vectorized next

./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_1.c 
scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_2.c 
scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_3.c 
scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_4.c 
scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_5.c 
scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_6.c 
scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_1.c 
scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_2.c 
scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_3.c 
scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_4.c 
scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_5.c 
scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_6.c 
scan-assembler-times vfnmsub[123]+ss 120 (found 64 times)

And friends, clearly we do not vectorize all loops, I did not look into details 
yet

./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr61403.c scan-assembler blend

Here again we vectorize loop while originally we did SLP.  I am not sure why 
loop
vectorizer does not use blend.

./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr79683.c scan-assembler-times 
padd 1 (found 0 times)

Here we are supposed to vectorize two integer additions, but since generic cost 
model now claims that
latency of vector add is twice of integer add we don't.  I think it makes sense.

./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr79723.c scan-assembler 
mov[au]p.[ \t][^,]+, %gs:

Similarly here.

If it seems to make sense, I will clean it up (remove now unused entries and 
scale
conditional costs by COSTS_N_INSNS) and fix the tessuite fallout.

Honza

Index: i386.c
===================================================================
--- i386.c      (revision 253824)
+++ i386.c      (working copy)
@@ -44015,50 +44015,56 @@ static int
 ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
                                  tree vectype, int)
 {
+  bool fp = false;
+  if (vectype != NULL)
+    fp = FLOAT_TYPE_P (vectype);
+
   switch (type_of_cost)
     {
       case scalar_stmt:
-        return ix86_cost->scalar_stmt_cost;
+        return fp ? ix86_cost->addss : COSTS_N_INSNS (1);
 
       case scalar_load:
-        return ix86_cost->scalar_load_cost;
+        return COSTS_N_INSNS (fp ? ix86_cost->sse_load[0]
+                             : ix86_cost->int_load [2]) / 2;
 
       case scalar_store:
-        return ix86_cost->scalar_store_cost;
+        return COSTS_N_INSNS (fp ? ix86_cost->sse_store[0]
+                             : ix86_cost->int_store [2]) / 2;
 
       case vector_stmt:
-        return ix86_cost->vec_stmt_cost;
+        return fp ? ix86_cost->addss : ix86_cost->sse_op;
 
       case vector_load:
-        return ix86_cost->vec_align_load_cost;
+        return COSTS_N_INSNS (ix86_cost->sse_load[2]) / 2;
 
       case vector_store:
-        return ix86_cost->vec_store_cost;
+        return COSTS_N_INSNS (ix86_cost->sse_store[2]) / 2;
 
       case vec_to_scalar:
-        return ix86_cost->vec_to_scalar_cost;
-
       case scalar_to_vec:
-        return ix86_cost->scalar_to_vec_cost;
+        return ix86_cost->sse_op;
 
       case unaligned_load:
-      case unaligned_store:
       case vector_gather_load:
+        return COSTS_N_INSNS (ix86_cost->sse_load[2]) / 2;
+
+      case unaligned_store:
       case vector_scatter_store:
-        return ix86_cost->vec_unalign_load_cost;
+        return COSTS_N_INSNS (ix86_cost->sse_store[2]) / 2;
 
       case cond_branch_taken:
-        return ix86_cost->cond_taken_branch_cost;
+        return COSTS_N_INSNS (ix86_cost->cond_taken_branch_cost);
 
       case cond_branch_not_taken:
-        return ix86_cost->cond_not_taken_branch_cost;
+        return COSTS_N_INSNS (ix86_cost->cond_not_taken_branch_cost);
 
       case vec_perm:
       case vec_promote_demote:
-        return ix86_cost->vec_stmt_cost;
+        return ix86_cost->sse_op;
 
       case vec_construct:
-       return ix86_cost->vec_stmt_cost * (TYPE_VECTOR_SUBPARTS (vectype) - 1);
+       return ix86_cost->sse_op * (TYPE_VECTOR_SUBPARTS (vectype) - 1);
 
       default:
         gcc_unreachable ();

Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel

Reply via email to