x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP
reduction is scalarized into per-lane extracts plus a dependent scalar
add chain (vect_expand_fold_left in tree-vect-loop.cc).
vect_model_reduction_cost costs this as ncopies vec_deconstruct ops and
ncopies * nunits scalar_stmt ops. That captures the per-lane work but
not its serialization: the extracts and adds all feed the same scalar
accumulator, whereas the cost model treats the lanes as independent.
The serialization grows with ncopies, since each extra vector copy
lengthens the same scalar chain. Track it in a new accumulator
m_fold_left_latency_penalty: for a loop-body vec_deconstruct or
scalar_stmt of an FP FOLD_LEFT_REDUCTION with ncopies > 1, add the
missing (ncopies - 1) * count * stmt_cost. vec_deconstruct is recorded
with count == ncopies and scalar_stmt with count == ncopies * nunits, so
ncopies is recovered as count / nunits for the latter. finish_cost folds
the penalty into m_costs[vect_body] once. With ncopies == 1 there is a
single copy and no cross-copy chain, so nothing is added.
Without this the cost model can pick a large VF (e.g. VF=32 driven by a
uint8_t load co-located with float ops) and emit a loop body far larger
than the scalar loop with no real gain.
The patch reduces codesize(.text) of 731.astcenc_r by 13%/9% for O2/O3,
codesize of 503.bwaves_r by 7% for O3. Performance impact for SPEC2026
and SPEC2017 is negligible with O2/O3.
Bootstrapped and regtested on x86_64-pc-linux-gnu{m32,}.
It regressed
gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c scan-tree-dump vect
"optimized: epilogue loop vectorized using 32 byte vectors"
gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c scan-tree-dump vect
"optimized: loop vectorized using 64 byte vectors"
gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect "optimized:
epilogue loop vectorized using 32 byte vectors"
gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect "optimized: loop
vectorized using 64 byte vectors"
gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect "optimized:
epilogue loop vectorized using masked 64 byte vectors"
gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect "optimized: loop
vectorized using 64 byte vectors"
unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect
"optimized: epilogue loop vectorized using 32 byte vectors"
unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect
"optimized: loop vectorized using 64 byte vectors"
unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect
"optimized: epilogue loop vectorized using masked 64 byte vectors"
unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect
"optimized: loop vectorized using 64 byte vectors"
Because they're not vectorized anymore with more penalty added in the patch.
If the patch makes sense, I'll just adjust those testcases as xfail?
Any comments?
gcc/ChangeLog:
* config/i386/i386.cc (ix86_vector_costs): Add
m_fold_left_latency_penalty member, initialized to 0.
(ix86_vector_costs::add_stmt_cost): Accumulate
count * stmt_cost * (ncopies - 1) into
m_fold_left_latency_penalty for loop-body vec_deconstruct and
scalar_stmt costs of FP FOLD_LEFT_REDUCTION when ncopies > 1.
(ix86_vector_costs::finish_cost): Add the accumulated penalty
to m_costs[vect_body].
gcc/testsuite/ChangeLog:
* gcc.target/i386/fold-left-reduc-cost.c: New test.
---
gcc/config/i386/i386.cc | 37 +++++++++++++++++++
.../gcc.target/i386/fold-left-reduc-cost.c | 17 +++++++++
2 files changed, 54 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index e66958db7ac..6126f265cd3 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -26186,6 +26186,8 @@ private:
unsigned m_num_avx512_vec_perm[3];
/* Number of reductions for FMA/DOT_PROD_EXPR/SAD_EXPR */
unsigned m_num_reduc[X86_REDUC_LAST];
+ /* Scalarization latency penalty for loop-body fold-left reductions. */
+ unsigned m_fold_left_latency_penalty;
/* Don't do unroll if m_prefer_unroll is false, default is true. */
bool m_prefer_unroll;
};
@@ -26197,6 +26199,7 @@ ix86_vector_costs::ix86_vector_costs (vec_info* vinfo,
bool costing_for_scalar)
m_num_avx256_vec_perm (),
m_num_avx512_vec_perm (),
m_num_reduc (),
+ m_fold_left_latency_penalty (0),
m_prefer_unroll (true)
{}
@@ -26826,6 +26829,37 @@ ix86_vector_costs::add_stmt_cost (int count,
vect_cost_for_stmt kind,
if (stmt_cost == -1)
stmt_cost = ix86_default_vector_cost (kind, mode);
+ /* x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP
+ reduction is scalarized into per-lane extracts plus a dependent scalar
+ add chain (vect_expand_fold_left). vect_model_reduction_cost costs
+ this as ncopies vec_deconstruct ops and ncopies * nunits scalar_stmt
+ ops, which captures the per-lane work but not its serialization: the
+ extracts and adds all feed the same scalar accumulator, while the cost
+ model treats the lanes as independent. With ncopies copies the chain
+ is ncopies times longer, so add the missing (ncopies - 1) * count *
+ stmt_cost for both kinds; finish_cost folds it in once. vec_deconstruct
+ is recorded with count == ncopies and scalar_stmt with count == ncopies
+ * nunits, hence the per-kind ncopies below. */
+ if ((kind == vec_deconstruct || kind == scalar_stmt)
+ && where == vect_body
+ && fp
+ && !m_costing_for_scalar
+ && is_a<loop_vec_info> (m_vinfo)
+ && node
+ && vect_reduc_type (m_vinfo, node) == FOLD_LEFT_REDUCTION)
+ {
+ unsigned ncopies = (unsigned) count;
+ if (kind == scalar_stmt)
+ {
+ unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype);
+ ncopies = (ncopies + nunits - 1) / nunits;
+ }
+ if (ncopies > 1)
+ m_fold_left_latency_penalty
+ += adjust_cost_for_freq (stmt_info, where,
+ count * stmt_cost * (ncopies - 1));
+ }
+
/* BIT_FIELD_REF <vect_**, 64, 0> with count 0 costs 0 in body. */
if (kind == vec_perm && vectype && count != 0)
{
@@ -26960,6 +26994,9 @@ ix86_vector_costs::finish_cost (const vector_costs
*scalar_costs)
}
+ if (m_fold_left_latency_penalty && m_costs[vect_body] != INT_MAX)
+ m_costs[vect_body] += m_fold_left_latency_penalty;
+
ix86_vect_estimate_reg_pressure ();
for (int i = 0; i != 3; i++)
diff --git a/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
new file mode 100644
index 00000000000..5f26f968429
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64-v3 -fdump-tree-vect-details" } */
+/* { dg-final { scan-tree-dump "vectorized 0 loops in function" "vect" } } */
+
+/* Loop driven by a uint8_t load co-located with float ops makes the
+ vectorizer pick a large VF (e.g. VF=32) so that ncopies > 1 for the
+ FOLD_LEFT FP reduction. The per-copy scalar-add chain serialization
+ should make vectorization unprofitable. */
+
+float
+foo (char *a, char *b, int n)
+{
+ float sum = 0;
+ for (int i = 0; i != n; i++)
+ sum += a[i] * b[i];
+ return sum;
+}
--
2.34.1