Re: [PATCH] i386: Penalize fold-left FP reduction scalarization by ncopies

Richard Biener Tue, 30 Jun 2026 02:03:51 -0700

On Tue, 30 Jun 2026, Hongtao Liu wrote:

> On Tue, Jun 30, 2026 at 4:32 PM Richard Biener <[email protected]> wrote:
> >
> > On Tue, 30 Jun 2026, Hongtao Liu wrote:
> >
> > > On Tue, Jun 30, 2026 at 4:00 PM Richard Biener <[email protected]> wrote:
> > > >
> > > > On Tue, 30 Jun 2026, Hongtao Liu wrote:
> > > >
> > > > > On Tue, Jun 30, 2026 at 3:12 PM Richard Biener <[email protected]> 
> > > > > wrote:
> > > > > >
> > > > > > On Tue, 30 Jun 2026, hongtao.liu wrote:
> > > > > >
> > > > > > > x86 has no in-order FP reduction instruction, so a FOLD_LEFT FP
> > > > > > > reduction is scalarized into per-lane extracts plus a dependent 
> > > > > > > scalar
> > > > > > > add chain (vect_expand_fold_left in tree-vect-loop.cc).
> > > > > > > vect_model_reduction_cost costs this as ncopies vec_deconstruct 
> > > > > > > ops and
> > > > > > > ncopies * nunits scalar_stmt ops.  That captures the per-lane 
> > > > > > > work but
> > > > > > > not its serialization: the extracts and adds all feed the same 
> > > > > > > scalar
> > > > > > > accumulator, whereas the cost model treats the lanes as 
> > > > > > > independent.
> > > > > >
> > > > > > It really also boils down to how we interpret 'cost'.  As we sum
> > > > > > latency numbers you could view it as a pessimistic overall iteration
> > > > > > latency.  We don't have any idea of throughput in the x86 cost 
> > > > > > tables
> > > > > > (maybe via reassoc-width in a very ad-hoc way), nor do we attempt to
> > > > > > track the actual dependence chains.
> > > > > >
> > > > > > > The serialization grows with ncopies, since each extra vector copy
> > > > > > > lengthens the same scalar chain.  Track it in a new accumulator
> > > > > > > m_fold_left_latency_penalty: for a loop-body vec_deconstruct or
> > > > > > > scalar_stmt of an FP FOLD_LEFT_REDUCTION with ncopies > 1, add the
> > > > > > > missing (ncopies - 1) * count * stmt_cost.  vec_deconstruct is 
> > > > > > > recorded
> > > > > > > with count == ncopies and scalar_stmt with count == ncopies * 
> > > > > > > nunits, so
> > > > > > > ncopies is recovered as count / nunits for the latter.  
> > > > > > > finish_cost folds
> > > > > > > the penalty into m_costs[vect_body] once.  With ncopies == 1 
> > > > > > > there is a
> > > > > > > single copy and no cross-copy chain, so nothing is added.
> > > > > > >
> > > > > > > Without this the cost model can pick a large VF (e.g. VF=32 
> > > > > > > driven by a
> > > > > > > uint8_t load co-located with float ops) and emit a loop body far 
> > > > > > > larger
> > > > > > > than the scalar loop with no real gain.
> > > > > > >
> > > > > > > The patch reduces codesize(.text) of 731.astcenc_r by 13%/9% for 
> > > > > > > O2/O3,
> > > > > > > codesize of 503.bwaves_r by 7% for O3. Performance impact for 
> > > > > > > SPEC2026
> > > > > > > and SPEC2017 is negligible with O2/O3.
> > > > > > >
> > > > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{m32,}.
> > > > > > >
> > > > > > > It regressed
> > > > > > > gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c 
> > > > > > > scan-tree-dump vect "optimized: epilogue loop vectorized using 32 
> > > > > > > byte vectors"
> > > > > > > gcc: gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c 
> > > > > > > scan-tree-dump vect "optimized: loop vectorized using 64 byte 
> > > > > > > vectors"
> > > > > > > gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect 
> > > > > > > "optimized: epilogue loop vectorized using 32 byte vectors"
> > > > > > > gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump vect 
> > > > > > > "optimized: loop vectorized using 64 byte vectors"
> > > > > > > gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect 
> > > > > > > "optimized: epilogue loop vectorized using masked 64 byte vectors"
> > > > > > > gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump vect 
> > > > > > > "optimized: loop vectorized using 64 byte vectors"
> > > > > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump 
> > > > > > > vect "optimized: epilogue loop vectorized using 32 byte vectors"
> > > > > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-6.c scan-tree-dump 
> > > > > > > vect "optimized: loop vectorized using 64 byte vectors"
> > > > > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump 
> > > > > > > vect "optimized: epilogue loop vectorized using masked 64 byte 
> > > > > > > vectors"
> > > > > > > unix/-m32: gcc: gcc.target/i386/vect-epilogues-7.c scan-tree-dump 
> > > > > > > vect "optimized: loop vectorized using 64 byte vectors"
> > > > > > >
> > > > > > > Because they're not vectorized anymore with more penalty added in 
> > > > > > > the patch.
> > > > > > > If the patch makes sense, I'll just adjust those testcases as 
> > > > > > > xfail?
> > > > > >
> > > > > > Rather than XFAILing the indiviual tests need looking at, for 
> > > > > > example
> > > > > > gcc.dg/vect/costmodel/x86_64/costmodel-vect-epil-1.c looks like the
> > > > > > important part was to make sure we are not using a masked epilog.
> > > > > >
> > > > > > > Any comments?
> > > > > >
> > > > > > There is now a new costing hook that could make these kind of
> > > > > > "special cases" easier to track (I hope).  By overloading
> > > > > > ix86_vector_costs::add_slp_cost it should be possible to catch
> > > > > > all pieces of the operation in one go, the deconstruction
> > > > > > and scalar ops.  This might make ::add_stmt_cost easier to
> > > > > > follow.  I did want to experiment with that a bit, but didn't
> > > > > > yet get to that.  Maybe you want to give that a try?
> > > > > Sure.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >       * config/i386/i386.cc (ix86_vector_costs): Add
> > > > > > >       m_fold_left_latency_penalty member, initialized to 0.
> > > > > > >       (ix86_vector_costs::add_stmt_cost): Accumulate
> > > > > > >       count * stmt_cost * (ncopies - 1) into
> > > > > > >       m_fold_left_latency_penalty for loop-body vec_deconstruct 
> > > > > > > and
> > > > > > >       scalar_stmt costs of FP FOLD_LEFT_REDUCTION when ncopies > 
> > > > > > > 1.
> > > > > > >       (ix86_vector_costs::finish_cost): Add the accumulated 
> > > > > > > penalty
> > > > > > >       to m_costs[vect_body].
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >       * gcc.target/i386/fold-left-reduc-cost.c: New test.
> > > > > > > ---
> > > > > > >  gcc/config/i386/i386.cc                       | 37 
> > > > > > > +++++++++++++++++++
> > > > > > >  .../gcc.target/i386/fold-left-reduc-cost.c    | 17 +++++++++
> > > > > > >  2 files changed, 54 insertions(+)
> > > > > > >  create mode 100644 
> > > > > > > gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> > > > > > >
> > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > > > index e66958db7ac..6126f265cd3 100644
> > > > > > > --- a/gcc/config/i386/i386.cc
> > > > > > > +++ b/gcc/config/i386/i386.cc
> > > > > > > @@ -26186,6 +26186,8 @@ private:
> > > > > > >    unsigned m_num_avx512_vec_perm[3];
> > > > > > >    /* Number of reductions for FMA/DOT_PROD_EXPR/SAD_EXPR  */
> > > > > > >    unsigned m_num_reduc[X86_REDUC_LAST];
> > > > > > > +  /* Scalarization latency penalty for loop-body fold-left 
> > > > > > > reductions.  */
> > > > > > > +  unsigned m_fold_left_latency_penalty;
> > > > > > >    /* Don't do unroll if m_prefer_unroll is false, default is 
> > > > > > > true.  */
> > > > > > >    bool m_prefer_unroll;
> > > > > > >  };
> > > > > > > @@ -26197,6 +26199,7 @@ ix86_vector_costs::ix86_vector_costs 
> > > > > > > (vec_info* vinfo, bool costing_for_scalar)
> > > > > > >      m_num_avx256_vec_perm (),
> > > > > > >      m_num_avx512_vec_perm (),
> > > > > > >      m_num_reduc (),
> > > > > > > +    m_fold_left_latency_penalty (0),
> > > > > > >      m_prefer_unroll (true)
> > > > > > >  {}
> > > > > > >
> > > > > > > @@ -26826,6 +26829,37 @@ ix86_vector_costs::add_stmt_cost (int 
> > > > > > > count, vect_cost_for_stmt kind,
> > > > > > >    if (stmt_cost == -1)
> > > > > > >      stmt_cost = ix86_default_vector_cost (kind, mode);
> > > > > > >
> > > > > > > +  /* x86 has no in-order FP reduction instruction, so a 
> > > > > > > FOLD_LEFT FP
> > > > > > > +     reduction is scalarized into per-lane extracts plus a 
> > > > > > > dependent scalar
> > > > > > > +     add chain (vect_expand_fold_left).  
> > > > > > > vect_model_reduction_cost costs
> > > > > > > +     this as ncopies vec_deconstruct ops and ncopies * nunits 
> > > > > > > scalar_stmt
> > > > > > > +     ops, which captures the per-lane work but not its 
> > > > > > > serialization: the
> > > > > > > +     extracts and adds all feed the same scalar accumulator, 
> > > > > > > while the cost
> > > > > > > +     model treats the lanes as independent.  With ncopies copies 
> > > > > > > the chain
> > > > > > > +     is ncopies times longer, so add the missing (ncopies - 1) * 
> > > > > > > count *
> > > > > > > +     stmt_cost for both kinds; finish_cost folds it in once.  
> > > > > > > vec_deconstruct
> > > > > > > +     is recorded with count == ncopies and scalar_stmt with 
> > > > > > > count == ncopies
> > > > > > > +     * nunits, hence the per-kind ncopies below.  */
> > > > > > > +  if ((kind == vec_deconstruct || kind == scalar_stmt)
> > > > > > > +      && where == vect_body
> > > > > > > +      && fp
> > > > > > > +      && !m_costing_for_scalar
> > > > > > > +      && is_a<loop_vec_info> (m_vinfo)
> > > > > > > +      && node
> > > > > > > +      && vect_reduc_type (m_vinfo, node) == FOLD_LEFT_REDUCTION)
> > > > > > > +    {
> > > > > > > +      unsigned ncopies = (unsigned) count;
> > > > > > > +      if (kind == scalar_stmt)
> > > > > > > +     {
> > > > > > > +       unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype);
> > > > > > > +       ncopies = (ncopies + nunits - 1) / nunits;
> > > > > > > +     }
> > > > > > > +      if (ncopies > 1)
> > > > > > > +     m_fold_left_latency_penalty
> > > > > > > +       += adjust_cost_for_freq (stmt_info, where,
> > > > > > > +                                count * stmt_cost * (ncopies - 
> > > > > > > 1));
> > > > > >
> > > > > > Why multiply by count?  How did you arrive at the scaling factor?
> > > > > >
> > > > > The original design only added a linear cost penalty to scalar_stmt,
> > > > > but it still can't prevent vectorization for the testcase in the PR.
> > > > > So I add a quadratic penalty cost of *count*(count * (ncopies - 1)) to
> > > > > both scalar_stmt and vec_deconstruct. It's more like a benchmark
> > > > > number.
> > > >
> > > > OK, so I think it's reasonable to simply prefer a low VF if there's
> > > > any in-order reduction.  Or is the intention to disable vectorization
> > > > completely here?  If the former then rather than these magic scaling
> > > > factors we might want to adjust 
> > > > better_main_loop_than_p/better_epilogue_loop_than_p
> > > > to explicitly state this intent, outside of the rest of the costing?
> > >
> > > I think it prefers low VF * ncopies, not VF only.
> > > Basically for ncopies == 1, the patch doesn't add any penalty.
> > > But when ncopies> 1, the penalty is like VF * ncopies, it tries to
> > > prevent *vec_unpack_{lo,hi}_expr +  fold_left_reduc* which increases
> > > code size too much.
> >
> > Hmm, but then this is separate from the in-order reduction.  It
> > doesn't change the number of reduction steps.  What does is a
> > larger VF.  Different ncopies at the same VF (so %xmm vs %ymm vs
> > %zmm at same VF) doesn't change any of the arguments against
> > in-order reductions either.
> Eventually it's VF of the smallest data type(char), or VF * ncopies
> for the real reduction type(float).
> It represents the number of scalar ops in the reduction chain.


VF * SLP_TREE_LANES it would be, the VF is the same for all types
(and usually the smallest type determines it).

What we should prefer is the least number of vector lanes involved
in in-order reductions (but possibly w/o resorting to MMX-sized
emulated vector modes?).  So possibly track that overall number
and use it in better_main_loop_than_p/better_epilogue_loop_than_p
rather than inflating costs?

> >
> > >   vect_patt_24.18_95 = [vec_unpack_lo_expr] vect_patt_23.15_89;
> > >   vect_patt_24.18_96 = [vec_unpack_hi_expr] vect_patt_23.15_89;
> > >   vect_patt_24.18_97 = [vec_unpack_lo_expr] vect_patt_23.15_90;
> > >   vect_patt_24.18_98 = [vec_unpack_hi_expr] vect_patt_23.15_90;
> > >   vect__9.19_99 = (vector(8) float) vect_patt_24.18_95;
> > >   vect__9.19_100 = (vector(8) float) vect_patt_24.18_96;
> > >   vect__9.19_101 = (vector(8) float) vect_patt_24.18_97;
> > >   vect__9.19_102 = (vector(8) float) vect_patt_24.18_98;
> > >   stmp_sum_16.20_103 = BIT_FIELD_REF <vect__9.19_99, 32, 0>;
> > >   stmp_sum_16.20_104 = sum_20 + stmp_sum_16.20_103;
> > >   stmp_sum_16.20_105 = BIT_FIELD_REF <vect__9.19_99, 32, 32>;
> > >   stmp_sum_16.20_106 = stmp_sum_16.20_104 + stmp_sum_16.20_105;
> > >   stmp_sum_16.20_107 = BIT_FIELD_REF <vect__9.19_99, 32, 64>;
> > >   stmp_sum_16.20_108 = stmp_sum_16.20_106 + stmp_sum_16.20_107;
> > >   stmp_sum_16.20_109 = BIT_FIELD_REF <vect__9.19_99, 32, 96>;
> > >   stmp_sum_16.20_110 = stmp_sum_16.20_108 + stmp_sum_16.20_109;
> > >   stmp_sum_16.20_111 = BIT_FIELD_REF <vect__9.19_99, 32, 128>;
> > >   stmp_sum_16.20_112 = stmp_sum_16.20_110 + stmp_sum_16.20_111;
> > >   stmp_sum_16.20_113 = BIT_FIELD_REF <vect__9.19_99, 32, 160>;
> > >   stmp_sum_16.20_114 = stmp_sum_16.20_112 + stmp_sum_16.20_113;
> > >   stmp_sum_16.20_115 = BIT_FIELD_REF <vect__9.19_99, 32, 192>;
> > >   stmp_sum_16.20_116 = stmp_sum_16.20_114 + stmp_sum_16.20_115;
> > >   stmp_sum_16.20_117 = BIT_FIELD_REF <vect__9.19_99, 32, 224>;
> > >   stmp_sum_16.20_118 = stmp_sum_16.20_116 + stmp_sum_16.20_117;
> > >   stmp_sum_16.20_119 = BIT_FIELD_REF <vect__9.19_100, 32, 0>;
> > >   stmp_sum_16.20_120 = stmp_sum_16.20_118 + stmp_sum_16.20_119;
> > >   stmp_sum_16.20_121 = BIT_FIELD_REF <vect__9.19_100, 32, 32>;
> > >   stmp_sum_16.20_122 = stmp_sum_16.20_120 + stmp_sum_16.20_121;
> > >   stmp_sum_16.20_123 = BIT_FIELD_REF <vect__9.19_100, 32, 64>;
> > >   stmp_sum_16.20_124 = stmp_sum_16.20_122 + stmp_sum_16.20_123;
> > >   stmp_sum_16.20_125 = BIT_FIELD_REF <vect__9.19_100, 32, 96>;
> > >   stmp_sum_16.20_126 = stmp_sum_16.20_124 + stmp_sum_16.20_125;
> > >   stmp_sum_16.20_127 = BIT_FIELD_REF <vect__9.19_100, 32, 128>;
> > >   stmp_sum_16.20_128 = stmp_sum_16.20_126 + stmp_sum_16.20_127;
> > >   stmp_sum_16.20_129 = BIT_FIELD_REF <vect__9.19_100, 32, 160>;
> > >   stmp_sum_16.20_130 = stmp_sum_16.20_128 + stmp_sum_16.20_129;
> > >   stmp_sum_16.20_131 = BIT_FIELD_REF <vect__9.19_100, 32, 192>;
> > >   stmp_sum_16.20_132 = stmp_sum_16.20_130 + stmp_sum_16.20_131;
> > >   stmp_sum_16.20_133 = BIT_FIELD_REF <vect__9.19_100, 32, 224>;
> > >   stmp_sum_16.20_134 = stmp_sum_16.20_132 + stmp_sum_16.20_133;
> > >   stmp_sum_16.20_135 = BIT_FIELD_REF <vect__9.19_101, 32, 0>;
> > >   stmp_sum_16.20_136 = stmp_sum_16.20_134 + stmp_sum_16.20_135;
> > >   stmp_sum_16.20_137 = BIT_FIELD_REF <vect__9.19_101, 32, 32>;
> > >   stmp_sum_16.20_138 = stmp_sum_16.20_136 + stmp_sum_16.20_137;
> > >   stmp_sum_16.20_139 = BIT_FIELD_REF <vect__9.19_101, 32, 64>;
> > >   stmp_sum_16.20_140 = stmp_sum_16.20_138 + stmp_sum_16.20_139;
> > >   stmp_sum_16.20_141 = BIT_FIELD_REF <vect__9.19_101, 32, 96>;
> > >   stmp_sum_16.20_142 = stmp_sum_16.20_140 + stmp_sum_16.20_141;
> > >   stmp_sum_16.20_143 = BIT_FIELD_REF <vect__9.19_101, 32, 128>;
> > >   stmp_sum_16.20_144 = stmp_sum_16.20_142 + stmp_sum_16.20_143;
> > >   stmp_sum_16.20_145 = BIT_FIELD_REF <vect__9.19_101, 32, 160>;
> > >   stmp_sum_16.20_146 = stmp_sum_16.20_144 + stmp_sum_16.20_145;
> > >   stmp_sum_16.20_147 = BIT_FIELD_REF <vect__9.19_101, 32, 192>;
> > >   stmp_sum_16.20_148 = stmp_sum_16.20_146 + stmp_sum_16.20_147;
> > >   stmp_sum_16.20_149 = BIT_FIELD_REF <vect__9.19_101, 32, 224>;
> > >   stmp_sum_16.20_150 = stmp_sum_16.20_148 + stmp_sum_16.20_149;
> > >   stmp_sum_16.20_151 = BIT_FIELD_REF <vect__9.19_102, 32, 0>;
> > >   stmp_sum_16.20_152 = stmp_sum_16.20_150 + stmp_sum_16.20_151;
> > >   stmp_sum_16.20_153 = BIT_FIELD_REF <vect__9.19_102, 32, 32>;
> > >   stmp_sum_16.20_154 = stmp_sum_16.20_152 + stmp_sum_16.20_153;
> > >   stmp_sum_16.20_155 = BIT_FIELD_REF <vect__9.19_102, 32, 64>;
> > >   stmp_sum_16.20_156 = stmp_sum_16.20_154 + stmp_sum_16.20_155;
> > >   stmp_sum_16.20_157 = BIT_FIELD_REF <vect__9.19_102, 32, 96>;
> > >   stmp_sum_16.20_158 = stmp_sum_16.20_156 + stmp_sum_16.20_157;
> > >   stmp_sum_16.20_159 = BIT_FIELD_REF <vect__9.19_102, 32, 128>;
> > >   stmp_sum_16.20_160 = stmp_sum_16.20_158 + stmp_sum_16.20_159;
> > >   stmp_sum_16.20_161 = BIT_FIELD_REF <vect__9.19_102, 32, 160>;
> > >   stmp_sum_16.20_162 = stmp_sum_16.20_160 + stmp_sum_16.20_161;
> > >   stmp_sum_16.20_163 = BIT_FIELD_REF <vect__9.19_102, 32, 192>;
> > >   stmp_sum_16.20_164 = stmp_sum_16.20_162 + stmp_sum_16.20_163;
> > >   stmp_sum_16.20_165 = BIT_FIELD_REF <vect__9.19_102, 32, 224>;
> > >   sum_16 = stmp_sum_16.20_164 + stmp_sum_16.20_165;
> > >
> > >
> > >
> > > >
> > > > > > So you are penaltizing the deconstruction and the scalar op the 
> > > > > > same?
> > > > > > The 'cost' for the deconstruction is the overall work, it's latency
> > > > > > is lower when there's infinite throughput and in particular we do
> > > > > > not have to wait for all parts to be ready but can interleave the
> > > > > > scalar operations with the deconstructed lanes arriving, improving
> > > > > > instruction issue and avoiding throughput bottlenecks.
> > > > > >
> > > > > > That said, I know we're doing the above, but we're mixing
> > > > > > latency and number-of-overall-ops (which is what 'cost' is 
> > > > > > accumulating)
> > > > > Yes, I think we already mix latency and throughput for the vectorized 
> > > > > stmt cost.
> > > > > >
> > > > > > > +    }
> > > > > > > +
> > > > > > >    /* BIT_FIELD_REF <vect_**, 64, 0> with count 0 costs 0 in 
> > > > > > > body.  */
> > > > > > >    if (kind == vec_perm && vectype && count != 0)
> > > > > > >      {
> > > > > > > @@ -26960,6 +26994,9 @@ ix86_vector_costs::finish_cost (const 
> > > > > > > vector_costs *scalar_costs)
> > > > > > >
> > > > > > >      }
> > > > > > >
> > > > > > > +  if (m_fold_left_latency_penalty && m_costs[vect_body] != 
> > > > > > > INT_MAX)
> > > > > > > +    m_costs[vect_body] += m_fold_left_latency_penalty;
> > > > > > > +
> > > > > >
> > > > > > So you are just adding that here - why delay that?  This way it 
> > > > > > doesn't
> > > > > > appear in any of the dumps.  An alternative might be to just track
> > > > > > whether there is a fold-left reduction (or how many "SSE bits" of 
> > > > > > it)
> > > > > > and use that as a "tie breaker" in
> > > > > > better_main_loop_than_p/better_epilogue_loop_than_p?
> > > > > >
> > > > > > IMO we might want to (again, did so a few years ago without 
> > > > > > success) try
> > > > > > to build a dependence DAG with uops we create during costing and at
> > > > > > finish_cost time compute overall latency.  When I attempted this I
> > > > > > made this _very_ simple, assuming infinite throughput and likely too
> > > > > > coarse grained "ops" - it was as much GIGO as the current scheme.
> > > > > I can let AI play with it to see if we can have some findings ;)
> > > >
> > > > Heh, wish you luck.  I didn't yet have much of that when trying AI
> > > > for prototyping of ideas ;)
> > > >
> > > > Richard.
> > > >
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Richard.
> > > > > >
> > > > > > >    ix86_vect_estimate_reg_pressure ();
> > > > > > >
> > > > > > >    for (int i = 0; i != 3; i++)
> > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c 
> > > > > > > b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> > > > > > > new file mode 100644
> > > > > > > index 00000000000..5f26f968429
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/fold-left-reduc-cost.c
> > > > > > > @@ -0,0 +1,17 @@
> > > > > > > +/* { dg-do compile } */
> > > > > > > +/* { dg-options "-O2 -march=x86-64-v3 -fdump-tree-vect-details" 
> > > > > > > } */
> > > > > > > +/* { dg-final { scan-tree-dump "vectorized 0 loops in function" 
> > > > > > > "vect" } } */
> > > > > > > +
> > > > > > > +/* Loop driven by a uint8_t load co-located with float ops makes 
> > > > > > > the
> > > > > > > +   vectorizer pick a large VF (e.g. VF=32) so that ncopies > 1 
> > > > > > > for the
> > > > > > > +   FOLD_LEFT FP reduction.  The per-copy scalar-add chain 
> > > > > > > serialization
> > > > > > > +   should make vectorization unprofitable.  */
> > > > > > > +
> > > > > > > +float
> > > > > > > +foo (char *a, char *b, int n)
> > > > > > > +{
> > > > > > > +  float sum = 0;
> > > > > > > +  for (int i = 0; i != n; i++)
> > > > > > > +    sum += a[i] * b[i];
> > > > > > > +  return sum;
> > > > > > > +}
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Richard Biener <[email protected]>
> > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
> > > > > > Nuernberg)
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <[email protected]>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
> > > > Nuernberg)
> > >
> > >
> > >
> > >
> >
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
> > Nuernberg)
> 
> 
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH] i386: Penalize fold-left FP reduction scalarization by ncopies

Reply via email to