On Mon, Aug 11, 2025 at 8:57 PM Richard Biener <rguent...@suse.de> wrote: > > On Sun, 10 Aug 2025, liuhongt wrote: > > > > > > > The comment doesn't match the bool type. > > > > > Fixed. > > > > > > > > is_gimple_assign (stmt_info->stmt) > > > > > Changed. > > > > > There's also SAD_EXPR? The vectorizer has lane_reducing_op_p () > > > for this that also lists WIDEN_SUM_EXPR. > > Add SAD_EXPR since x86 supports usad{v16qi, v32qi, v64qi}. > > Not add WIDEN_SUM_EXPR since x86 doesn't support the optab. > > > > > is issue rate a good measure here? I think for the given > > > operation it's more like the number of ops that can be > > > issued in parallel (like 2 for FMA) times the latency > > > (like 3), thus the number of op that can be in flight? > > Yes, it's better to have instruction latency times throughput. > > The patch adds a new member to processor_cost: reduc_lat_mult_thr > > which should be latency times throughput. > > .i.e > > For fma, latency is 4, throught is 2, reduc_lat_mult_thr is 8 > > if there's 1 FMA for reduction then unroll factor is 8 / 1 = 8. > > > > There's also a vect_unroll_limit, the final suggested_unroll_factor is > > set as MIN (vect_unroll_limix, 8). > > The vect_unroll_limit is mainly for register pressure, avoid to many > > spills. > > Ideally, all instructions in the vectorized loop should be used to > > determine unroll_factor with their (latency * throughput) / number, > > but that would too much for this patch, and may just GIGO, so the > > patch only considers 3 kinds of instructions: FMA, DOT_PROD_EXPR, > > SAD_EXPR. > > > > For latest AMD/Intel procesors, reduc_lat_mult_thr is set according to > > Instruction tables by Agner Fog > > .i.e > > > > reduc_lat_mult_thr of Zen4 is set according as > > (8, 8, 6) > > > > FMA: latency is 4 cycles , throughput is 2. > > VPDPBUSD: latency is 4 cycles , throughput is 2. > > VPSADBW: latency is 3 cycles , throughput is 2. > > > > reduc_lat_mult_thr of SPR is set according as > > (8, 10, 3) > > > > FMA: latency is 4 cycles , throughput is 2. > > VPDPBUSD: latency is 5 cycles , throughput is 2. > > VPSADBW: latency is 3 cycles , throughput is 1. > > > > > ix86_issue_rate should only be a very rough approximation of > > > that? I suppose we should have separate tuning entries > > > for this, like one for the number of FMAC units and the > > > FMAC units (best case) latency? As for SAD_EXPR that > > > would be an integer op, that probably goes to a different > > > pipeline unit. > > > > > > In general we should have a look at register pressure, I > > > suppose issue_rate / m_num_reductions ensures we're never > > > getting close to this in practice. > > > > Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}. > > This looks reasonable from my side now. Please give Honza the > chance to chime in.
Any comments Honza? > > Thanks, > Richard. > > > > > The patch is trying to unroll the vectorized loop when there're > > FMA/DOT_PRDO_EXPR/SAD_EXPR reductions, it will break cross-iteration > > dependence > > and enable more parallelism(since vectorize will also enable partial > > sum). > > > > When there's gather/scatter or scalarization in the loop, don't do the > > unroll since the performance bottleneck is not at the reduction. > > > > The unroll factor is set according to FMA/DOT_PROX_EXPR/SAD_EXPR > > CEIL ((latency * throught), num_of_reduction) > > .i.e > > For fma, latency is 4, throught is 2, if there's 1 FMA for reduction > > then unroll factor is 2 * 4 / 1 = 8. > > > > There's also a vect_unroll_limit, the final suggested_unroll_factor is > > set as MIN (vect_unroll_limix, 8). > > > > The vect_unroll_limit is mainly for register pressure, avoid to many > > spills. > > Ideally, all instructions in the vectorized loop should be used to > > determine unroll_factor with their (latency * throughput) / number, > > but that would too much for this patch, and may just GIGO, so the > > patch only considers 3 kinds of instructions: FMA, DOT_PROD_EXPR, > > SAD_EXPR. > > > > Note when DOT_PROD_EXPR is not native support, > > m_num_reduction += 3 * count which almost prevents unroll. > > > > There's performance boost for simple benchmark with DOT_PRDO_EXPR/FMA > > chain, slight improvement in SPEC2017 performance. > > > > gcc/ChangeLog: > > > > * config/i386/i386.cc (ix86_vector_costs::ix86_vector_costs): > > Addd new memeber m_num_reduc, m_prefer_unroll. > > (ix86_vector_costs::add_stmt_cost): Set m_prefer_unroll and > > m_num_reduc > > (ix86_vector_costs::finish_cost): Determine > > m_suggested_unroll_vector with consideration of > > reduc_lat_mult_thr, m_num_reduction and > > ix86_vect_unroll_limit. > > * config/i386/i386.h (enum ix86_reduc_unroll_factor): New > > enum. > > (processor_costs): Add reduc_lat_mult_thr and > > vect_unroll_limit. > > * config/i386/x86-tune-costs.h: Initialize > > reduc_lat_mult_thr and vect_unroll_limit. > > * config/i386/i386.opt: Add -param=ix86-vect-unroll-limit. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/vect_unroll-1.c: New test. > > * gcc.target/i386/vect_unroll-2.c: New test. > > * gcc.target/i386/vect_unroll-3.c: New test. > > * gcc.target/i386/vect_unroll-4.c: New test. > > * gcc.target/i386/vect_unroll-5.c: New test. > > --- > > gcc/config/i386/i386.cc | 165 ++++++++++++++- > > gcc/config/i386/i386.h | 16 ++ > > gcc/config/i386/i386.opt | 4 + > > gcc/config/i386/x86-tune-costs.h | 192 ++++++++++++++++++ > > gcc/testsuite/gcc.target/i386/vect_unroll-1.c | 12 ++ > > gcc/testsuite/gcc.target/i386/vect_unroll-2.c | 12 ++ > > gcc/testsuite/gcc.target/i386/vect_unroll-3.c | 12 ++ > > gcc/testsuite/gcc.target/i386/vect_unroll-4.c | 12 ++ > > gcc/testsuite/gcc.target/i386/vect_unroll-5.c | 13 ++ > > gcc/testsuite/gcc.target/i386/vect_unroll-6.c | 12 ++ > > 10 files changed, 447 insertions(+), 3 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-1.c > > create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-2.c > > create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-3.c > > create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-4.c > > create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-5.c > > create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-6.c > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index 49bd3939eb4..1961c9c7883 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -25762,15 +25762,20 @@ private: > > unsigned m_num_sse_needed[3]; > > /* Number of 256-bit vector permutation. */ > > unsigned m_num_avx256_vec_perm[3]; > > + /* Number of reductions for FMA/DOT_PROD_EXPR/SAD_EXPR */ > > + unsigned m_num_reduc[X86_REDUC_LAST]; > > + /* Don't do unroll if m_prefer_unroll is false, default is true. */ > > + bool m_prefer_unroll; > > }; > > > > ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool > > costing_for_scalar) > > : vector_costs (vinfo, costing_for_scalar), > > m_num_gpr_needed (), > > m_num_sse_needed (), > > - m_num_avx256_vec_perm () > > -{ > > -} > > + m_num_avx256_vec_perm (), > > + m_num_reduc (), > > + m_prefer_unroll (true) > > +{} > > > > /* Implement targetm.vectorize.create_costs. */ > > > > @@ -26067,6 +26072,125 @@ ix86_vector_costs::add_stmt_cost (int count, > > vect_cost_for_stmt kind, > > } > > } > > > > + /* Record number of load/store/gather/scatter in vectorized body. */ > > + if (where == vect_body && !m_costing_for_scalar) > > + { > > + switch (kind) > > + { > > + /* Emulated gather/scatter or any scalarization. */ > > + case scalar_load: > > + case scalar_stmt: > > + case scalar_store: > > + case vector_gather_load: > > + case vector_scatter_store: > > + m_prefer_unroll = false; > > + break; > > + > > + case vector_stmt: > > + case vec_to_scalar: > > + /* Count number of reduction FMA and "real" DOT_PROD_EXPR, > > + unroll in the vectorizer will enable partial sum. */ > > + if (stmt_info > > + && vect_is_reduction (stmt_info) > > + && stmt_info->stmt) > > + { > > + /* Handle __builtin_fma. */ > > + if (gimple_call_combined_fn (stmt_info->stmt) == CFN_FMA) > > + { > > + m_num_reduc[X86_REDUC_FMA] += count; > > + break; > > + } > > + > > + if (!is_gimple_assign (stmt_info->stmt)) > > + break; > > + > > + tree_code subcode = gimple_assign_rhs_code (stmt_info->stmt); > > + machine_mode inner_mode = GET_MODE_INNER (mode); > > + tree rhs1, rhs2; > > + bool native_vnni_p = true; > > + gimple* def; > > + machine_mode mode_rhs; > > + switch (subcode) > > + { > > + case PLUS_EXPR: > > + case MINUS_EXPR: > > + if (!fp || !flag_associative_math > > + || flag_fp_contract_mode != FP_CONTRACT_FAST) > > + break; > > + > > + /* FMA condition for different modes. */ > > + if (((inner_mode == DFmode || inner_mode == SFmode) > > + && !TARGET_FMA && !TARGET_AVX512VL) > > + || (inner_mode == HFmode && !TARGET_AVX512FP16) > > + || (inner_mode == BFmode && !TARGET_AVX10_2)) > > + break; > > + > > + /* MULT_EXPR + PLUS_EXPR/MINUS_EXPR is transformed > > + to FMA/FNMA after vectorization. */ > > + rhs1 = gimple_assign_rhs1 (stmt_info->stmt); > > + rhs2 = gimple_assign_rhs2 (stmt_info->stmt); > > + if (subcode == PLUS_EXPR > > + && TREE_CODE (rhs1) == SSA_NAME > > + && (def = SSA_NAME_DEF_STMT (rhs1), true) > > + && is_gimple_assign (def) > > + && gimple_assign_rhs_code (def) == MULT_EXPR) > > + m_num_reduc[X86_REDUC_FMA] += count; > > + else if (TREE_CODE (rhs2) == SSA_NAME > > + && (def = SSA_NAME_DEF_STMT (rhs2), true) > > + && is_gimple_assign (def) > > + && gimple_assign_rhs_code (def) == MULT_EXPR) > > + m_num_reduc[X86_REDUC_FMA] += count; > > + break; > > + > > + /* Vectorizer lane_reducing_op_p supports DOT_PROX_EXPR, > > + WIDEN_SUM_EXPR and SAD_EXPR, x86 backend only supports > > + SAD_EXPR (usad{v16qi,v32qi,v64qi}) and DOT_PROD_EXPR. */ > > + case DOT_PROD_EXPR: > > + rhs1 = gimple_assign_rhs1 (stmt_info->stmt); > > + mode_rhs = TYPE_MODE (TREE_TYPE (rhs1)); > > + if (mode_rhs == QImode) > > + { > > + rhs2 = gimple_assign_rhs2 (stmt_info->stmt); > > + signop signop1_p = TYPE_SIGN (TREE_TYPE (rhs1)); > > + signop signop2_p = TYPE_SIGN (TREE_TYPE (rhs2)); > > + > > + /* vpdpbusd. */ > > + if (signop1_p != signop2_p) > > + native_vnni_p > > + = (GET_MODE_SIZE (mode) == 64 > > + ? TARGET_AVX512VNNI > > + : ((TARGET_AVX512VNNI && TARGET_AVX512VL) > > + || TARGET_AVXVNNI)); > > + else > > + /* vpdpbssd. */ > > + native_vnni_p > > + = (GET_MODE_SIZE (mode) == 64 > > + ? TARGET_AVX10_2 > > + : (TARGET_AVXVNNIINT8 || TARGET_AVX10_2)); > > + } > > + m_num_reduc[X86_REDUC_DOT_PROD] += count; > > + > > + /* Dislike to do unroll and partial sum for > > + emulated DOT_PROD_EXPR. */ > > + if (!native_vnni_p) > > + m_num_reduc[X86_REDUC_DOT_PROD] += 3 * count; > > + break; > > + > > + case SAD_EXPR: > > + m_num_reduc[X86_REDUC_SAD] += count; > > + break; > > + > > + default: > > + break; > > + } > > + } > > + > > + default: > > + break; > > + } > > + } > > + > > + > > combined_fn cfn; > > if ((kind == vector_stmt || kind == scalar_stmt) > > && stmt_info > > @@ -26282,6 +26406,41 @@ ix86_vector_costs::finish_cost (const vector_costs > > *scalar_costs) > > && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ()) > > > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) > > m_costs[vect_body] = INT_MAX; > > + > > + bool any_reduc_p = false; > > + for (int i = 0; i != X86_REDUC_LAST; i++) > > + if (m_num_reduc[i]) > > + { > > + any_reduc_p = true; > > + break; > > + } > > + > > + if (any_reduc_p > > + /* Not much gain for loop with gather and scatter. */ > > + && m_prefer_unroll > > + && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)) > > + { > > + unsigned unroll_factor > > + = OPTION_SET_P (ix86_vect_unroll_limit) > > + ? ix86_vect_unroll_limit > > + : ix86_cost->vect_unroll_limit; > > + > > + if (unroll_factor > 1) > > + { > > + for (int i = 0 ; i != X86_REDUC_LAST; i++) > > + { > > + if (m_num_reduc[i]) > > + { > > + unsigned tmp = CEIL (ix86_cost->reduc_lat_mult_thr[i], > > + m_num_reduc[i]); > > + unroll_factor = MIN (unroll_factor, tmp); > > + } > > + } > > + > > + m_suggested_unroll_factor = 1 << ceil_log2 (unroll_factor); > > + } > > + } > > + > > } > > > > ix86_vect_estimate_reg_pressure (); > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > > index 791f3b9e133..817bf665c40 100644 > > --- a/gcc/config/i386/i386.h > > +++ b/gcc/config/i386/i386.h > > @@ -102,6 +102,15 @@ struct stringop_algs > > #define COSTS_N_BYTES(N) ((N) * 2) > > #endif > > > > + > > +enum ix86_reduc_unroll_factor{ > > + X86_REDUC_FMA, > > + X86_REDUC_DOT_PROD, > > + X86_REDUC_SAD, > > + > > + X86_REDUC_LAST > > +}; > > + > > /* Define the specific costs for a given cpu. NB: hard_register is used > > by TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute > > hard register move costs by register allocator. Relative costs of > > @@ -225,6 +234,13 @@ struct processor_costs { > > to number of instructions executed in > > parallel. See also > > ix86_reassociation_width. */ > > + const unsigned reduc_lat_mult_thr[X86_REDUC_LAST]; > > + /* Latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + const unsigned vect_unroll_limit; /* Limit how much the autovectorizer > > + may unroll a loop. */ > > struct stringop_algs *memcpy, *memset; > > const int cond_taken_branch_cost; /* Cost of taken branch for > > vectorizer > > cost model. */ > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt > > index c93c0b1bb38..6bda22f4843 100644 > > --- a/gcc/config/i386/i386.opt > > +++ b/gcc/config/i386/i386.opt > > @@ -1246,6 +1246,10 @@ munroll-only-small-loops > > Target Var(ix86_unroll_only_small_loops) Init(0) Optimization > > Enable conservative small loop unrolling. > > > > +-param=ix86-vect-unroll-limit= > > +Target Joined UInteger Var(ix86_vect_unroll_limit) Init(4) Param > > +Limit how much the autovectorizer may unroll a loop. > > + > > mlam= > > Target RejectNegative Joined Enum(lam_type) Var(ix86_lam_type) > > Init(lam_none) > > -mlam=[none|u48|u57] Instrument meta data position in user data pointers. > > diff --git a/gcc/config/i386/x86-tune-costs.h > > b/gcc/config/i386/x86-tune-costs.h > > index c8603b982af..1649ea2fe3e 100644 > > --- a/gcc/config/i386/x86-tune-costs.h > > +++ b/gcc/config/i386/x86-tune-costs.h > > @@ -141,6 +141,12 @@ struct processor_costs ix86_size_cost = {/* costs for > > tuning for size */ > > COSTS_N_BYTES (4), /* cost of CVT(T)PS2PI instruction. > > */ > > > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > ix86_size_memcpy, > > ix86_size_memset, > > COSTS_N_BYTES (1), /* cond_taken_branch_cost. */ > > @@ -261,6 +267,12 @@ struct processor_costs i386_cost = { /* 386 > > specific costs */ > > COSTS_N_INSNS (27), /* cost of CVTPI2PS > > instruction. */ > > COSTS_N_INSNS (27), /* cost of CVT(T)PS2PI > > instruction. */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > i386_memcpy, > > i386_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -382,6 +394,12 @@ struct processor_costs i486_cost = { /* 486 > > specific costs */ > > COSTS_N_INSNS (27), /* cost of CVTPI2PS > > instruction. */ > > COSTS_N_INSNS (27), /* cost of CVT(T)PS2PI > > instruction. */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > i486_memcpy, > > i486_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -501,6 +519,12 @@ struct processor_costs pentium_cost = { > > COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > pentium_memcpy, > > pentium_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -613,6 +637,12 @@ struct processor_costs lakemont_cost = { > > COSTS_N_INSNS (5), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (5), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > pentium_memcpy, > > pentium_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -740,6 +770,12 @@ struct processor_costs pentiumpro_cost = { > > COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > pentiumpro_memcpy, > > pentiumpro_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -858,6 +894,12 @@ struct processor_costs geode_cost = { > > COSTS_N_INSNS (6), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (6), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > geode_memcpy, > > geode_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -979,6 +1021,12 @@ struct processor_costs k6_cost = { > > COSTS_N_INSNS (2), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (2), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > k6_memcpy, > > k6_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -1101,6 +1149,12 @@ struct processor_costs athlon_cost = { > > COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (6), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > athlon_memcpy, > > athlon_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -1232,6 +1286,12 @@ struct processor_costs k8_cost = { > > COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (5), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > k8_memcpy, > > k8_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -1371,6 +1431,12 @@ struct processor_costs amdfam10_cost = { > > COSTS_N_INSNS (7), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > amdfam10_memcpy, > > amdfam10_memset, > > COSTS_N_INSNS (2), /* cond_taken_branch_cost. */ > > @@ -1503,6 +1569,12 @@ const struct processor_costs bdver_cost = { > > COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 2, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > bdver_memcpy, > > bdver_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -1668,6 +1740,12 @@ struct processor_costs znver1_cost = { > > plus/minus operations per cycle but only one multiply. This is > > adjusted > > in ix86_reassociation_width. */ > > 4, 4, 3, 6, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {5, 1, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > znver1_memcpy, > > znver1_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -1836,6 +1914,12 @@ struct processor_costs znver2_cost = { > > plus/minus operations per cycle but only one multiply. This is > > adjusted > > in ix86_reassociation_width. */ > > 4, 4, 3, 6, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {10, 1, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > znver2_memcpy, > > znver2_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -1979,6 +2063,12 @@ struct processor_costs znver3_cost = { > > plus/minus operations per cycle but only one multiply. This is > > adjusted > > in ix86_reassociation_width. */ > > 4, 4, 3, 6, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 1, 6}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > znver2_memcpy, > > znver2_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -2125,6 +2215,12 @@ struct processor_costs znver4_cost = { > > plus/minus operations per cycle but only one multiply. This is > > adjusted > > in ix86_reassociation_width. */ > > 4, 4, 3, 6, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 8, 6}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > znver2_memcpy, > > znver2_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -2287,6 +2383,12 @@ struct processor_costs znver5_cost = { > > We increase width to 6 for multiplications > > in ix86_reassociation_width. */ > > 6, 6, 4, 6, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 8, 6}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > znver2_memcpy, > > znver2_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -2422,6 +2524,12 @@ struct processor_costs skylake_cost = { > > COSTS_N_INSNS (6), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (7), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 4, 2, 2, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 1, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > skylake_memcpy, > > skylake_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -2559,6 +2667,12 @@ struct processor_costs icelake_cost = { > > COSTS_N_INSNS (7), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (6), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 4, 2, 2, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 10, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > icelake_memcpy, > > icelake_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -2690,6 +2804,12 @@ struct processor_costs alderlake_cost = { > > COSTS_N_INSNS (7), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (6), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 4, 3, 3, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 8, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > alderlake_memcpy, > > alderlake_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -2814,6 +2934,12 @@ const struct processor_costs btver1_cost = { > > COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > btver1_memcpy, > > btver1_memset, > > COSTS_N_INSNS (2), /* cond_taken_branch_cost. */ > > @@ -2935,6 +3061,12 @@ const struct processor_costs btver2_cost = { > > COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > btver2_memcpy, > > btver2_memset, > > COSTS_N_INSNS (2), /* cond_taken_branch_cost. */ > > @@ -3055,6 +3187,12 @@ struct processor_costs pentium4_cost = { > > COSTS_N_INSNS (12), /* cost of CVTPI2PS > > instruction. */ > > COSTS_N_INSNS (8), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > pentium4_memcpy, > > pentium4_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -3178,6 +3316,12 @@ struct processor_costs nocona_cost = { > > COSTS_N_INSNS (12), /* cost of CVTPI2PS > > instruction. */ > > COSTS_N_INSNS (8), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 1, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {1, 1, 1}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > nocona_memcpy, > > nocona_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -3299,6 +3443,12 @@ struct processor_costs atom_cost = { > > COSTS_N_INSNS (6), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. > > */ > > 2, 2, 2, 2, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 8, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 2, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > atom_memcpy, > > atom_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -3420,6 +3570,12 @@ struct processor_costs slm_cost = { > > COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 2, 1, 1, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 8, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > slm_memcpy, > > slm_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -3555,6 +3711,12 @@ struct processor_costs tremont_cost = { > > COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 4, 3, 3, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 1, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > tremont_memcpy, > > tremont_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -3681,6 +3843,12 @@ struct processor_costs lujiazui_cost = { > > COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 4, 3, 3, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 1, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > lujiazui_memcpy, > > lujiazui_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -3805,6 +3973,12 @@ struct processor_costs yongfeng_cost = { > > COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. > > */ > > 4, 4, 4, 4, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 1, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > yongfeng_memcpy, > > yongfeng_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -3929,6 +4103,12 @@ struct processor_costs shijidadao_cost = { > > COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. > > */ > > 4, 4, 4, 4, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 1, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > shijidadao_memcpy, > > shijidadao_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > @@ -4078,6 +4258,12 @@ struct processor_costs generic_cost = { > > COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 4, 3, 3, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 8, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 4, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > generic_memcpy, > > generic_memset, > > COSTS_N_INSNS (4), /* cond_taken_branch_cost. */ > > @@ -4215,6 +4401,12 @@ struct processor_costs core_cost = { > > COSTS_N_INSNS (6), /* cost of CVTPI2PS instruction. */ > > COSTS_N_INSNS (7), /* cost of CVT(T)PS2PI instruction. > > */ > > 1, 4, 2, 2, /* reassoc int, fp, vec_int, > > vec_fp. */ > > + {8, 1, 3}, /* latency times throughput of > > + FMA/DOT_PROD_EXPR/SAD_EXPR, > > + it's used to determine unroll > > + factor in the vectorizer. */ > > + 1, /* Limit how much the autovectorizer > > + may unroll a loop. */ > > core_memcpy, > > core_memset, > > COSTS_N_INSNS (3), /* cond_taken_branch_cost. */ > > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-1.c > > b/gcc/testsuite/gcc.target/i386/vect_unroll-1.c > > new file mode 100644 > > index 00000000000..2e294d3aea6 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-1.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-march=x86-64-v3 -Ofast" } */ > > +/* { dg-final { scan-assembler-times {(?n)vfmadd[1-3]*ps[^\n]*ymm} 4 } } */ > > + > > +float > > +foo (float* a, float* b, int n) > > +{ > > + float sum = 0; > > + for (int i = 0; i != n; i++) > > + sum += a[i] * b[i]; > > + return sum; > > +} > > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-2.c > > b/gcc/testsuite/gcc.target/i386/vect_unroll-2.c > > new file mode 100644 > > index 00000000000..069f7d37ae7 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-2.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-march=x86-64-v3 -Ofast" } */ > > +/* { dg-final { scan-assembler-times {(?n)vfnmadd[1-3]*ps[^\n]*ymm} 4 } } > > */ > > + > > +float > > +foo (float* a, float* b, int n) > > +{ > > + float sum = 0; > > + for (int i = 0; i != n; i++) > > + sum -= a[i] * b[i]; > > + return sum; > > +} > > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-3.c > > b/gcc/testsuite/gcc.target/i386/vect_unroll-3.c > > new file mode 100644 > > index 00000000000..6860c2ffbd5 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-3.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-mavxvnni -O3" } */ > > +/* { dg-final { scan-assembler-times {(?n)vpdpbusd[^\n]*ymm} 4 } } */ > > + > > +int > > +foo (unsigned char* a, char* b, int n) > > +{ > > + int sum = 0; > > + for (int i = 0; i != n; i++) > > + sum += a[i] * b[i]; > > + return sum; > > +} > > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-4.c > > b/gcc/testsuite/gcc.target/i386/vect_unroll-4.c > > new file mode 100644 > > index 00000000000..01d8af67b6e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-4.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-march=x86-64-v3 -O3 -mno-avxvnni" } */ > > +/* { dg-final { scan-assembler-times {(?n)vpmaddwd[^\n]*ymm} 4 } } */ > > + > > +int > > +foo (unsigned char* a, char* b, int n) > > +{ > > + int sum = 0; > > + for (int i = 0; i != n; i++) > > + sum += a[i] * b[i]; > > + return sum; > > +} > > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-5.c > > b/gcc/testsuite/gcc.target/i386/vect_unroll-5.c > > new file mode 100644 > > index 00000000000..c6375b1bc8d > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-5.c > > @@ -0,0 +1,13 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-march=x86-64-v3 -Ofast -mgather" } */ > > +/* { dg-final { scan-assembler-times {(?n)vfmadd[1-3]*ps[^\n]*ymm} 1 } } */ > > + > > +float > > +foo (float* a, int* b, float* c, int n) > > +{ > > + float sum = 0; > > + for (int i = 0; i != n; i++) > > + sum += a[b[i]] *c[i]; > > + return sum; > > +} > > + > > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-6.c > > b/gcc/testsuite/gcc.target/i386/vect_unroll-6.c > > new file mode 100644 > > index 00000000000..b64c2fbde57 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-6.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-march=x86-64-v3 -Ofast" } */ > > +/* { dg-final { scan-assembler-times {(?n)vfmadd[1-3]*ps[^\n]*ymm} 4 } } */ > > + > > +float > > +foo (float* a, float* b, int n) > > +{ > > + float sum = 0; > > + for (int i = 0; i != n; i++) > > + sum = __builtin_fma (a[i], b[i], sum); > > + return sum; > > +} > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) -- BR, Hongtao