>
> The comment doesn't match the bool type.
>
Fixed.
>
> is_gimple_assign (stmt_info->stmt)
>
Changed.
> There's also SAD_EXPR? The vectorizer has lane_reducing_op_p ()
> for this that also lists WIDEN_SUM_EXPR.
Add SAD_EXPR since x86 supports usad{v16qi, v32qi, v64qi}.
Not add WIDEN_SUM_EXPR since x86 doesn't support the optab.
> is issue rate a good measure here? I think for the given
> operation it's more like the number of ops that can be
> issued in parallel (like 2 for FMA) times the latency
> (like 3), thus the number of op that can be in flight?
Yes, it's better to have instruction latency times throughput.
The patch adds a new member to processor_cost: reduc_lat_mult_thr
which should be latency times throughput.
.i.e
For fma, latency is 4, throught is 2, reduc_lat_mult_thr is 8
if there's 1 FMA for reduction then unroll factor is 8 / 1 = 8.
There's also a vect_unroll_limit, the final suggested_unroll_factor is
set as MIN (vect_unroll_limix, 8).
The vect_unroll_limit is mainly for register pressure, avoid to many
spills.
Ideally, all instructions in the vectorized loop should be used to
determine unroll_factor with their (latency * throughput) / number,
but that would too much for this patch, and may just GIGO, so the
patch only considers 3 kinds of instructions: FMA, DOT_PROD_EXPR,
SAD_EXPR.
For latest AMD/Intel procesors, reduc_lat_mult_thr is set according to
Instruction tables by Agner Fog
.i.e
reduc_lat_mult_thr of Zen4 is set according as
(8, 8, 6)
FMA: latency is 4 cycles , throughput is 2.
VPDPBUSD: latency is 4 cycles , throughput is 2.
VPSADBW: latency is 3 cycles , throughput is 2.
reduc_lat_mult_thr of SPR is set according as
(8, 10, 3)
FMA: latency is 4 cycles , throughput is 2.
VPDPBUSD: latency is 5 cycles , throughput is 2.
VPSADBW: latency is 3 cycles , throughput is 1.
> ix86_issue_rate should only be a very rough approximation of
> that? I suppose we should have separate tuning entries
> for this, like one for the number of FMAC units and the
> FMAC units (best case) latency? As for SAD_EXPR that
> would be an integer op, that probably goes to a different
> pipeline unit.
>
> In general we should have a look at register pressure, I
> suppose issue_rate / m_num_reductions ensures we're never
> getting close to this in practice.
Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
The patch is trying to unroll the vectorized loop when there're
FMA/DOT_PRDO_EXPR/SAD_EXPR reductions, it will break cross-iteration dependence
and enable more parallelism(since vectorize will also enable partial
sum).
When there's gather/scatter or scalarization in the loop, don't do the
unroll since the performance bottleneck is not at the reduction.
The unroll factor is set according to FMA/DOT_PROX_EXPR/SAD_EXPR
CEIL ((latency * throught), num_of_reduction)
.i.e
For fma, latency is 4, throught is 2, if there's 1 FMA for reduction
then unroll factor is 2 * 4 / 1 = 8.
There's also a vect_unroll_limit, the final suggested_unroll_factor is
set as MIN (vect_unroll_limix, 8).
The vect_unroll_limit is mainly for register pressure, avoid to many
spills.
Ideally, all instructions in the vectorized loop should be used to
determine unroll_factor with their (latency * throughput) / number,
but that would too much for this patch, and may just GIGO, so the
patch only considers 3 kinds of instructions: FMA, DOT_PROD_EXPR,
SAD_EXPR.
Note when DOT_PROD_EXPR is not native support,
m_num_reduction += 3 * count which almost prevents unroll.
There's performance boost for simple benchmark with DOT_PRDO_EXPR/FMA
chain, slight improvement in SPEC2017 performance.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_vector_costs::ix86_vector_costs):
Addd new memeber m_num_reduc, m_prefer_unroll.
(ix86_vector_costs::add_stmt_cost): Set m_prefer_unroll and
m_num_reduc
(ix86_vector_costs::finish_cost): Determine
m_suggested_unroll_vector with consideration of
reduc_lat_mult_thr, m_num_reduction and
ix86_vect_unroll_limit.
* config/i386/i386.h (enum ix86_reduc_unroll_factor): New
enum.
(processor_costs): Add reduc_lat_mult_thr and
vect_unroll_limit.
* config/i386/x86-tune-costs.h: Initialize
reduc_lat_mult_thr and vect_unroll_limit.
* config/i386/i386.opt: Add -param=ix86-vect-unroll-limit.
gcc/testsuite/ChangeLog:
* gcc.target/i386/vect_unroll-1.c: New test.
* gcc.target/i386/vect_unroll-2.c: New test.
* gcc.target/i386/vect_unroll-3.c: New test.
* gcc.target/i386/vect_unroll-4.c: New test.
* gcc.target/i386/vect_unroll-5.c: New test.
---
gcc/config/i386/i386.cc | 165 ++++++++++++++-
gcc/config/i386/i386.h | 16 ++
gcc/config/i386/i386.opt | 4 +
gcc/config/i386/x86-tune-costs.h | 192 ++++++++++++++++++
gcc/testsuite/gcc.target/i386/vect_unroll-1.c | 12 ++
gcc/testsuite/gcc.target/i386/vect_unroll-2.c | 12 ++
gcc/testsuite/gcc.target/i386/vect_unroll-3.c | 12 ++
gcc/testsuite/gcc.target/i386/vect_unroll-4.c | 12 ++
gcc/testsuite/gcc.target/i386/vect_unroll-5.c | 13 ++
gcc/testsuite/gcc.target/i386/vect_unroll-6.c | 12 ++
10 files changed, 447 insertions(+), 3 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-1.c
create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-2.c
create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-3.c
create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-4.c
create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-5.c
create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-6.c
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 49bd3939eb4..1961c9c7883 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25762,15 +25762,20 @@ private:
unsigned m_num_sse_needed[3];
/* Number of 256-bit vector permutation. */
unsigned m_num_avx256_vec_perm[3];
+ /* Number of reductions for FMA/DOT_PROD_EXPR/SAD_EXPR */
+ unsigned m_num_reduc[X86_REDUC_LAST];
+ /* Don't do unroll if m_prefer_unroll is false, default is true. */
+ bool m_prefer_unroll;
};
ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar)
: vector_costs (vinfo, costing_for_scalar),
m_num_gpr_needed (),
m_num_sse_needed (),
- m_num_avx256_vec_perm ()
-{
-}
+ m_num_avx256_vec_perm (),
+ m_num_reduc (),
+ m_prefer_unroll (true)
+{}
/* Implement targetm.vectorize.create_costs. */
@@ -26067,6 +26072,125 @@ ix86_vector_costs::add_stmt_cost (int count,
vect_cost_for_stmt kind,
}
}
+ /* Record number of load/store/gather/scatter in vectorized body. */
+ if (where == vect_body && !m_costing_for_scalar)
+ {
+ switch (kind)
+ {
+ /* Emulated gather/scatter or any scalarization. */
+ case scalar_load:
+ case scalar_stmt:
+ case scalar_store:
+ case vector_gather_load:
+ case vector_scatter_store:
+ m_prefer_unroll = false;
+ break;
+
+ case vector_stmt:
+ case vec_to_scalar:
+ /* Count number of reduction FMA and "real" DOT_PROD_EXPR,
+ unroll in the vectorizer will enable partial sum. */
+ if (stmt_info
+ && vect_is_reduction (stmt_info)
+ && stmt_info->stmt)
+ {
+ /* Handle __builtin_fma. */
+ if (gimple_call_combined_fn (stmt_info->stmt) == CFN_FMA)
+ {
+ m_num_reduc[X86_REDUC_FMA] += count;
+ break;
+ }
+
+ if (!is_gimple_assign (stmt_info->stmt))
+ break;
+
+ tree_code subcode = gimple_assign_rhs_code (stmt_info->stmt);
+ machine_mode inner_mode = GET_MODE_INNER (mode);
+ tree rhs1, rhs2;
+ bool native_vnni_p = true;
+ gimple* def;
+ machine_mode mode_rhs;
+ switch (subcode)
+ {
+ case PLUS_EXPR:
+ case MINUS_EXPR:
+ if (!fp || !flag_associative_math
+ || flag_fp_contract_mode != FP_CONTRACT_FAST)
+ break;
+
+ /* FMA condition for different modes. */
+ if (((inner_mode == DFmode || inner_mode == SFmode)
+ && !TARGET_FMA && !TARGET_AVX512VL)
+ || (inner_mode == HFmode && !TARGET_AVX512FP16)
+ || (inner_mode == BFmode && !TARGET_AVX10_2))
+ break;
+
+ /* MULT_EXPR + PLUS_EXPR/MINUS_EXPR is transformed
+ to FMA/FNMA after vectorization. */
+ rhs1 = gimple_assign_rhs1 (stmt_info->stmt);
+ rhs2 = gimple_assign_rhs2 (stmt_info->stmt);
+ if (subcode == PLUS_EXPR
+ && TREE_CODE (rhs1) == SSA_NAME
+ && (def = SSA_NAME_DEF_STMT (rhs1), true)
+ && is_gimple_assign (def)
+ && gimple_assign_rhs_code (def) == MULT_EXPR)
+ m_num_reduc[X86_REDUC_FMA] += count;
+ else if (TREE_CODE (rhs2) == SSA_NAME
+ && (def = SSA_NAME_DEF_STMT (rhs2), true)
+ && is_gimple_assign (def)
+ && gimple_assign_rhs_code (def) == MULT_EXPR)
+ m_num_reduc[X86_REDUC_FMA] += count;
+ break;
+
+ /* Vectorizer lane_reducing_op_p supports DOT_PROX_EXPR,
+ WIDEN_SUM_EXPR and SAD_EXPR, x86 backend only supports
+ SAD_EXPR (usad{v16qi,v32qi,v64qi}) and DOT_PROD_EXPR. */
+ case DOT_PROD_EXPR:
+ rhs1 = gimple_assign_rhs1 (stmt_info->stmt);
+ mode_rhs = TYPE_MODE (TREE_TYPE (rhs1));
+ if (mode_rhs == QImode)
+ {
+ rhs2 = gimple_assign_rhs2 (stmt_info->stmt);
+ signop signop1_p = TYPE_SIGN (TREE_TYPE (rhs1));
+ signop signop2_p = TYPE_SIGN (TREE_TYPE (rhs2));
+
+ /* vpdpbusd. */
+ if (signop1_p != signop2_p)
+ native_vnni_p
+ = (GET_MODE_SIZE (mode) == 64
+ ? TARGET_AVX512VNNI
+ : ((TARGET_AVX512VNNI && TARGET_AVX512VL)
+ || TARGET_AVXVNNI));
+ else
+ /* vpdpbssd. */
+ native_vnni_p
+ = (GET_MODE_SIZE (mode) == 64
+ ? TARGET_AVX10_2
+ : (TARGET_AVXVNNIINT8 || TARGET_AVX10_2));
+ }
+ m_num_reduc[X86_REDUC_DOT_PROD] += count;
+
+ /* Dislike to do unroll and partial sum for
+ emulated DOT_PROD_EXPR. */
+ if (!native_vnni_p)
+ m_num_reduc[X86_REDUC_DOT_PROD] += 3 * count;
+ break;
+
+ case SAD_EXPR:
+ m_num_reduc[X86_REDUC_SAD] += count;
+ break;
+
+ default:
+ break;
+ }
+ }
+
+ default:
+ break;
+ }
+ }
+
+
combined_fn cfn;
if ((kind == vector_stmt || kind == scalar_stmt)
&& stmt_info
@@ -26282,6 +26406,41 @@ ix86_vector_costs::finish_cost (const vector_costs
*scalar_costs)
&& (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
m_costs[vect_body] = INT_MAX;
+
+ bool any_reduc_p = false;
+ for (int i = 0; i != X86_REDUC_LAST; i++)
+ if (m_num_reduc[i])
+ {
+ any_reduc_p = true;
+ break;
+ }
+
+ if (any_reduc_p
+ /* Not much gain for loop with gather and scatter. */
+ && m_prefer_unroll
+ && !LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+ {
+ unsigned unroll_factor
+ = OPTION_SET_P (ix86_vect_unroll_limit)
+ ? ix86_vect_unroll_limit
+ : ix86_cost->vect_unroll_limit;
+
+ if (unroll_factor > 1)
+ {
+ for (int i = 0 ; i != X86_REDUC_LAST; i++)
+ {
+ if (m_num_reduc[i])
+ {
+ unsigned tmp = CEIL (ix86_cost->reduc_lat_mult_thr[i],
+ m_num_reduc[i]);
+ unroll_factor = MIN (unroll_factor, tmp);
+ }
+ }
+
+ m_suggested_unroll_factor = 1 << ceil_log2 (unroll_factor);
+ }
+ }
+
}
ix86_vect_estimate_reg_pressure ();
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 791f3b9e133..817bf665c40 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -102,6 +102,15 @@ struct stringop_algs
#define COSTS_N_BYTES(N) ((N) * 2)
#endif
+
+enum ix86_reduc_unroll_factor{
+ X86_REDUC_FMA,
+ X86_REDUC_DOT_PROD,
+ X86_REDUC_SAD,
+
+ X86_REDUC_LAST
+};
+
/* Define the specific costs for a given cpu. NB: hard_register is used
by TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute
hard register move costs by register allocator. Relative costs of
@@ -225,6 +234,13 @@ struct processor_costs {
to number of instructions executed in
parallel. See also
ix86_reassociation_width. */
+ const unsigned reduc_lat_mult_thr[X86_REDUC_LAST];
+ /* Latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ const unsigned vect_unroll_limit; /* Limit how much the autovectorizer
+ may unroll a loop. */
struct stringop_algs *memcpy, *memset;
const int cond_taken_branch_cost; /* Cost of taken branch for vectorizer
cost model. */
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index c93c0b1bb38..6bda22f4843 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1246,6 +1246,10 @@ munroll-only-small-loops
Target Var(ix86_unroll_only_small_loops) Init(0) Optimization
Enable conservative small loop unrolling.
+-param=ix86-vect-unroll-limit=
+Target Joined UInteger Var(ix86_vect_unroll_limit) Init(4) Param
+Limit how much the autovectorizer may unroll a loop.
+
mlam=
Target RejectNegative Joined Enum(lam_type) Var(ix86_lam_type) Init(lam_none)
-mlam=[none|u48|u57] Instrument meta data position in user data pointers.
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index c8603b982af..1649ea2fe3e 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -141,6 +141,12 @@ struct processor_costs ix86_size_cost = {/* costs for
tuning for size */
COSTS_N_BYTES (4), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
ix86_size_memcpy,
ix86_size_memset,
COSTS_N_BYTES (1), /* cond_taken_branch_cost. */
@@ -261,6 +267,12 @@ struct processor_costs i386_cost = { /* 386 specific
costs */
COSTS_N_INSNS (27), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (27), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
i386_memcpy,
i386_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -382,6 +394,12 @@ struct processor_costs i486_cost = { /* 486 specific
costs */
COSTS_N_INSNS (27), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (27), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
i486_memcpy,
i486_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -501,6 +519,12 @@ struct processor_costs pentium_cost = {
COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
pentium_memcpy,
pentium_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -613,6 +637,12 @@ struct processor_costs lakemont_cost = {
COSTS_N_INSNS (5), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (5), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
pentium_memcpy,
pentium_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -740,6 +770,12 @@ struct processor_costs pentiumpro_cost = {
COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
pentiumpro_memcpy,
pentiumpro_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -858,6 +894,12 @@ struct processor_costs geode_cost = {
COSTS_N_INSNS (6), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (6), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
geode_memcpy,
geode_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -979,6 +1021,12 @@ struct processor_costs k6_cost = {
COSTS_N_INSNS (2), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (2), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
k6_memcpy,
k6_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -1101,6 +1149,12 @@ struct processor_costs athlon_cost = {
COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (6), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
athlon_memcpy,
athlon_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -1232,6 +1286,12 @@ struct processor_costs k8_cost = {
COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (5), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
k8_memcpy,
k8_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -1371,6 +1431,12 @@ struct processor_costs amdfam10_cost = {
COSTS_N_INSNS (7), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
amdfam10_memcpy,
amdfam10_memset,
COSTS_N_INSNS (2), /* cond_taken_branch_cost. */
@@ -1503,6 +1569,12 @@ const struct processor_costs bdver_cost = {
COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. */
1, 2, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
bdver_memcpy,
bdver_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -1668,6 +1740,12 @@ struct processor_costs znver1_cost = {
plus/minus operations per cycle but only one multiply. This is adjusted
in ix86_reassociation_width. */
4, 4, 3, 6, /* reassoc int, fp, vec_int, vec_fp. */
+ {5, 1, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
znver1_memcpy,
znver1_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -1836,6 +1914,12 @@ struct processor_costs znver2_cost = {
plus/minus operations per cycle but only one multiply. This is adjusted
in ix86_reassociation_width. */
4, 4, 3, 6, /* reassoc int, fp, vec_int, vec_fp. */
+ {10, 1, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
znver2_memcpy,
znver2_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -1979,6 +2063,12 @@ struct processor_costs znver3_cost = {
plus/minus operations per cycle but only one multiply. This is adjusted
in ix86_reassociation_width. */
4, 4, 3, 6, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 1, 6}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
znver2_memcpy,
znver2_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -2125,6 +2215,12 @@ struct processor_costs znver4_cost = {
plus/minus operations per cycle but only one multiply. This is adjusted
in ix86_reassociation_width. */
4, 4, 3, 6, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 8, 6}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
znver2_memcpy,
znver2_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -2287,6 +2383,12 @@ struct processor_costs znver5_cost = {
We increase width to 6 for multiplications
in ix86_reassociation_width. */
6, 6, 4, 6, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 8, 6}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
znver2_memcpy,
znver2_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -2422,6 +2524,12 @@ struct processor_costs skylake_cost = {
COSTS_N_INSNS (6), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (7), /* cost of CVT(T)PS2PI instruction. */
1, 4, 2, 2, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 1, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
skylake_memcpy,
skylake_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -2559,6 +2667,12 @@ struct processor_costs icelake_cost = {
COSTS_N_INSNS (7), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (6), /* cost of CVT(T)PS2PI instruction. */
1, 4, 2, 2, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 10, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
icelake_memcpy,
icelake_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -2690,6 +2804,12 @@ struct processor_costs alderlake_cost = {
COSTS_N_INSNS (7), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (6), /* cost of CVT(T)PS2PI instruction. */
1, 4, 3, 3, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 8, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
alderlake_memcpy,
alderlake_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -2814,6 +2934,12 @@ const struct processor_costs btver1_cost = {
COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
btver1_memcpy,
btver1_memset,
COSTS_N_INSNS (2), /* cond_taken_branch_cost. */
@@ -2935,6 +3061,12 @@ const struct processor_costs btver2_cost = {
COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
btver2_memcpy,
btver2_memset,
COSTS_N_INSNS (2), /* cond_taken_branch_cost. */
@@ -3055,6 +3187,12 @@ struct processor_costs pentium4_cost = {
COSTS_N_INSNS (12), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (8), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
pentium4_memcpy,
pentium4_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -3178,6 +3316,12 @@ struct processor_costs nocona_cost = {
COSTS_N_INSNS (12), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (8), /* cost of CVT(T)PS2PI instruction. */
1, 1, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {1, 1, 1}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
nocona_memcpy,
nocona_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -3299,6 +3443,12 @@ struct processor_costs atom_cost = {
COSTS_N_INSNS (6), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. */
2, 2, 2, 2, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 8, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 2, /* Limit how much the autovectorizer
+ may unroll a loop. */
atom_memcpy,
atom_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -3420,6 +3570,12 @@ struct processor_costs slm_cost = {
COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. */
1, 2, 1, 1, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 8, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
slm_memcpy,
slm_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -3555,6 +3711,12 @@ struct processor_costs tremont_cost = {
COSTS_N_INSNS (4), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (4), /* cost of CVT(T)PS2PI instruction. */
1, 4, 3, 3, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 1, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
tremont_memcpy,
tremont_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -3681,6 +3843,12 @@ struct processor_costs lujiazui_cost = {
COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. */
1, 4, 3, 3, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 1, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
lujiazui_memcpy,
lujiazui_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -3805,6 +3973,12 @@ struct processor_costs yongfeng_cost = {
COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. */
4, 4, 4, 4, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 1, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
yongfeng_memcpy,
yongfeng_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -3929,6 +4103,12 @@ struct processor_costs shijidadao_cost = {
COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. */
4, 4, 4, 4, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 1, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
shijidadao_memcpy,
shijidadao_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
@@ -4078,6 +4258,12 @@ struct processor_costs generic_cost = {
COSTS_N_INSNS (3), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (3), /* cost of CVT(T)PS2PI instruction. */
1, 4, 3, 3, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 8, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 4, /* Limit how much the autovectorizer
+ may unroll a loop. */
generic_memcpy,
generic_memset,
COSTS_N_INSNS (4), /* cond_taken_branch_cost. */
@@ -4215,6 +4401,12 @@ struct processor_costs core_cost = {
COSTS_N_INSNS (6), /* cost of CVTPI2PS instruction. */
COSTS_N_INSNS (7), /* cost of CVT(T)PS2PI instruction. */
1, 4, 2, 2, /* reassoc int, fp, vec_int, vec_fp. */
+ {8, 1, 3}, /* latency times throughput of
+ FMA/DOT_PROD_EXPR/SAD_EXPR,
+ it's used to determine unroll
+ factor in the vectorizer. */
+ 1, /* Limit how much the autovectorizer
+ may unroll a loop. */
core_memcpy,
core_memset,
COSTS_N_INSNS (3), /* cond_taken_branch_cost. */
diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-1.c
b/gcc/testsuite/gcc.target/i386/vect_unroll-1.c
new file mode 100644
index 00000000000..2e294d3aea6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect_unroll-1.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64-v3 -Ofast" } */
+/* { dg-final { scan-assembler-times {(?n)vfmadd[1-3]*ps[^\n]*ymm} 4 } } */
+
+float
+foo (float* a, float* b, int n)
+{
+ float sum = 0;
+ for (int i = 0; i != n; i++)
+ sum += a[i] * b[i];
+ return sum;
+}
diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-2.c
b/gcc/testsuite/gcc.target/i386/vect_unroll-2.c
new file mode 100644
index 00000000000..069f7d37ae7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect_unroll-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64-v3 -Ofast" } */
+/* { dg-final { scan-assembler-times {(?n)vfnmadd[1-3]*ps[^\n]*ymm} 4 } } */
+
+float
+foo (float* a, float* b, int n)
+{
+ float sum = 0;
+ for (int i = 0; i != n; i++)
+ sum -= a[i] * b[i];
+ return sum;
+}
diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-3.c
b/gcc/testsuite/gcc.target/i386/vect_unroll-3.c
new file mode 100644
index 00000000000..6860c2ffbd5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect_unroll-3.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-mavxvnni -O3" } */
+/* { dg-final { scan-assembler-times {(?n)vpdpbusd[^\n]*ymm} 4 } } */
+
+int
+foo (unsigned char* a, char* b, int n)
+{
+ int sum = 0;
+ for (int i = 0; i != n; i++)
+ sum += a[i] * b[i];
+ return sum;
+}
diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-4.c
b/gcc/testsuite/gcc.target/i386/vect_unroll-4.c
new file mode 100644
index 00000000000..01d8af67b6e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect_unroll-4.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64-v3 -O3 -mno-avxvnni" } */
+/* { dg-final { scan-assembler-times {(?n)vpmaddwd[^\n]*ymm} 4 } } */
+
+int
+foo (unsigned char* a, char* b, int n)
+{
+ int sum = 0;
+ for (int i = 0; i != n; i++)
+ sum += a[i] * b[i];
+ return sum;
+}
diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-5.c
b/gcc/testsuite/gcc.target/i386/vect_unroll-5.c
new file mode 100644
index 00000000000..c6375b1bc8d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect_unroll-5.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64-v3 -Ofast -mgather" } */
+/* { dg-final { scan-assembler-times {(?n)vfmadd[1-3]*ps[^\n]*ymm} 1 } } */
+
+float
+foo (float* a, int* b, float* c, int n)
+{
+ float sum = 0;
+ for (int i = 0; i != n; i++)
+ sum += a[b[i]] *c[i];
+ return sum;
+}
+
diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-6.c
b/gcc/testsuite/gcc.target/i386/vect_unroll-6.c
new file mode 100644
index 00000000000..b64c2fbde57
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect_unroll-6.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64-v3 -Ofast" } */
+/* { dg-final { scan-assembler-times {(?n)vfmadd[1-3]*ps[^\n]*ymm} 4 } } */
+
+float
+foo (float* a, float* b, int n)
+{
+ float sum = 0;
+ for (int i = 0; i != n; i++)
+ sum = __builtin_fma (a[i], b[i], sum);
+ return sum;
+}
--
2.34.1