Hi, this is proof of concept patch for vectorizer costs to use costs used for rtx_cost and register_move_cost which are readily available in ix86_costs instead of using its own set of random values. At least until we have proof of evidence that vectroizer costs needs to differ, I do not think we want to complicate CPU tuning by having them twice.
This is of course quite intrusive change to what we have becuase it affects all x86 targets. I have finally worked out that the "random" values used by AMD target corresponds to latencies of bdver1. I have benchmarked them on Zen and also temporarily patches Czerny (Haswel). It seems to cause no regression and quite nice improvements: - 27.3% for facerec on Zen - 7% for mgrid on Haswel - maybe 1% for galgel of Haswell - 3% for facerec on Haswell - maybe 1% aspi on Haswell - there may be small off-noise improvement for rnflow and regression for fatigue2 on Haswell So I would say that outcome is surprisingly good (especially due to lack of noteworthy regressions). I also know that vectorizer hurts performance on Zen and Mesa/tonto benchmarks which is not cured by this patch alone. There is testsuite fallout though. ./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc -std=c++11 scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times) ./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc -std=c++14 scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times) ./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc -std=c++98 scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times) Here we vectorize the loop before first while originally we unrolled and SLP vectorized next ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_1.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_2.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_3.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_4.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_5.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_6.c scan-assembler-times vfmadd[123]+sd 56 (found 32 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_1.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_2.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_3.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_4.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_5.c scan-assembler-times vfmadd[123]+ss 120 (found 64 times) ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_6.c scan-assembler-times vfnmsub[123]+ss 120 (found 64 times) And friends, clearly we do not vectorize all loops, I did not look into details yet ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr61403.c scan-assembler blend Here again we vectorize loop while originally we did SLP. I am not sure why loop vectorizer does not use blend. ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr79683.c scan-assembler-times padd 1 (found 0 times) Here we are supposed to vectorize two integer additions, but since generic cost model now claims that latency of vector add is twice of integer add we don't. I think it makes sense. ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr79723.c scan-assembler mov[au]p.[ \t][^,]+, %gs: Similarly here. If it seems to make sense, I will clean it up (remove now unused entries and scale conditional costs by COSTS_N_INSNS) and fix the tessuite fallout. Honza Index: i386.c =================================================================== --- i386.c (revision 253824) +++ i386.c (working copy) @@ -44015,50 +44015,56 @@ static int ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost, tree vectype, int) { + bool fp = false; + if (vectype != NULL) + fp = FLOAT_TYPE_P (vectype); + switch (type_of_cost) { case scalar_stmt: - return ix86_cost->scalar_stmt_cost; + return fp ? ix86_cost->addss : COSTS_N_INSNS (1); case scalar_load: - return ix86_cost->scalar_load_cost; + return COSTS_N_INSNS (fp ? ix86_cost->sse_load[0] + : ix86_cost->int_load [2]) / 2; case scalar_store: - return ix86_cost->scalar_store_cost; + return COSTS_N_INSNS (fp ? ix86_cost->sse_store[0] + : ix86_cost->int_store [2]) / 2; case vector_stmt: - return ix86_cost->vec_stmt_cost; + return fp ? ix86_cost->addss : ix86_cost->sse_op; case vector_load: - return ix86_cost->vec_align_load_cost; + return COSTS_N_INSNS (ix86_cost->sse_load[2]) / 2; case vector_store: - return ix86_cost->vec_store_cost; + return COSTS_N_INSNS (ix86_cost->sse_store[2]) / 2; case vec_to_scalar: - return ix86_cost->vec_to_scalar_cost; - case scalar_to_vec: - return ix86_cost->scalar_to_vec_cost; + return ix86_cost->sse_op; case unaligned_load: - case unaligned_store: case vector_gather_load: + return COSTS_N_INSNS (ix86_cost->sse_load[2]) / 2; + + case unaligned_store: case vector_scatter_store: - return ix86_cost->vec_unalign_load_cost; + return COSTS_N_INSNS (ix86_cost->sse_store[2]) / 2; case cond_branch_taken: - return ix86_cost->cond_taken_branch_cost; + return COSTS_N_INSNS (ix86_cost->cond_taken_branch_cost); case cond_branch_not_taken: - return ix86_cost->cond_not_taken_branch_cost; + return COSTS_N_INSNS (ix86_cost->cond_not_taken_branch_cost); case vec_perm: case vec_promote_demote: - return ix86_cost->vec_stmt_cost; + return ix86_cost->sse_op; case vec_construct: - return ix86_cost->vec_stmt_cost * (TYPE_VECTOR_SUBPARTS (vectype) - 1); + return ix86_cost->sse_op * (TYPE_VECTOR_SUBPARTS (vectype) - 1); default: gcc_unreachable ();