SAD_EXPR

Hongtao Liu Mon, 18 Aug 2025 01:36:09 -0700

On Mon, Aug 11, 2025 at 8:57 PM Richard Biener <[email protected]> wrote:
>
> On Sun, 10 Aug 2025, liuhongt wrote:
>
> > >
> > > The comment doesn't match the bool type.
> > >
> > Fixed.
> >
> > >
> > > is_gimple_assign (stmt_info->stmt)
> > >
> > Changed.
> >
> > > There's also SAD_EXPR?  The vectorizer has lane_reducing_op_p ()
> > > for this that also lists WIDEN_SUM_EXPR.
> > Add SAD_EXPR since x86 supports usad{v16qi, v32qi, v64qi}.
> > Not add WIDEN_SUM_EXPR since x86 doesn't support the optab.
> >
> > > is issue rate a good measure here?  I think for the given
> > > operation it's more like the number of ops that can be
> > > issued in parallel (like 2 for FMA) times the latency
> > > (like 3), thus the number of op that can be in flight?
> > Yes, it's better to have instruction latency times throughput.
> > The patch adds a new member to processor_cost: reduc_lat_mult_thr
> > which should be latency times throughput.
> > .i.e
> > For fma, latency is 4, throught is 2, reduc_lat_mult_thr is 8
> > if there's 1 FMA for reduction then unroll factor is 8 / 1 = 8.
> >
> > There's also a vect_unroll_limit, the final suggested_unroll_factor is
> > set as MIN (vect_unroll_limix, 8).
> > The vect_unroll_limit is mainly for register pressure, avoid to many
> > spills.
> > Ideally, all instructions in the vectorized loop should be used to
> > determine unroll_factor with their (latency * throughput) / number,
> > but that would too much for this patch, and may just GIGO, so the
> > patch only considers 3 kinds of instructions: FMA, DOT_PROD_EXPR,
> > SAD_EXPR.
> >
> > For latest AMD/Intel procesors, reduc_lat_mult_thr is set according to
> > Instruction tables by Agner Fog
> > .i.e
> >
> > reduc_lat_mult_thr of Zen4 is set according as
> > (8, 8, 6)
> >
> > FMA: latency is 4 cycles , throughput is 2.
> > VPDPBUSD: latency is 4 cycles , throughput is 2.
> > VPSADBW: latency is 3 cycles , throughput is 2.
> >
> > reduc_lat_mult_thr of SPR is set according as
> > (8, 10, 3)
> >
> > FMA: latency is 4 cycles , throughput is 2.
> > VPDPBUSD: latency is 5 cycles , throughput is 2.
> > VPSADBW: latency is 3 cycles , throughput is 1.
> >
> > > ix86_issue_rate should only be a very rough approximation of
> > > that?  I suppose we should have separate tuning entries
> > > for this, like one for the number of FMAC units and the
> > > FMAC units (best case) latency?  As for SAD_EXPR that
> > > would be an integer op, that probably goes to a different
> > > pipeline unit.
> > >
> > > In general we should have a look at register pressure, I
> > > suppose issue_rate / m_num_reductions ensures we're never
> > > getting close to this in practice.
> >
> > Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
>
> This looks reasonable from my side now.  Please give Honza the
> chance to chime in.


Any comments Honza?

>
> Thanks,
> Richard.
>
> >
> > The patch is trying to unroll the vectorized loop when there're
> > FMA/DOT_PRDO_EXPR/SAD_EXPR reductions, it will break cross-iteration 
> > dependence
> > and enable more parallelism(since vectorize will also enable partial
> > sum).
> >
> > When there's gather/scatter or scalarization in the loop, don't do the
> > unroll since the performance bottleneck is not at the reduction.
> >
> > The unroll factor is set according to FMA/DOT_PROX_EXPR/SAD_EXPR
> > CEIL ((latency * throught), num_of_reduction)
> > .i.e
> > For fma, latency is 4, throught is 2, if there's 1 FMA for reduction
> > then unroll factor is 2 * 4 / 1 = 8.
> >
> > There's also a vect_unroll_limit, the final suggested_unroll_factor is
> > set as MIN (vect_unroll_limix, 8).
> >
> > The vect_unroll_limit is mainly for register pressure, avoid to many
> > spills.
> > Ideally, all instructions in the vectorized loop should be used to
> > determine unroll_factor with their (latency * throughput) / number,
> > but that would too much for this patch, and may just GIGO, so the
> > patch only considers 3 kinds of instructions: FMA, DOT_PROD_EXPR,
> > SAD_EXPR.
> >
> > Note when DOT_PROD_EXPR is not native support,
> > m_num_reduction += 3 * count which almost prevents unroll.
> >
> > There's performance boost for simple benchmark with DOT_PRDO_EXPR/FMA
> > chain, slight improvement in SPEC2017 performance.
> >
> > gcc/ChangeLog:
> >
> >       * config/i386/i386.cc (ix86_vector_costs::ix86_vector_costs):
> >       Addd new memeber m_num_reduc, m_prefer_unroll.
> >       (ix86_vector_costs::add_stmt_cost): Set m_prefer_unroll and
> >       m_num_reduc
> >       (ix86_vector_costs::finish_cost): Determine
> >       m_suggested_unroll_vector with consideration of
> >       reduc_lat_mult_thr, m_num_reduction and
> >       ix86_vect_unroll_limit.
> >       * config/i386/i386.h (enum ix86_reduc_unroll_factor): New
> >       enum.
> >       (processor_costs): Add reduc_lat_mult_thr and
> >       vect_unroll_limit.
> >       * config/i386/x86-tune-costs.h: Initialize
> >       reduc_lat_mult_thr and vect_unroll_limit.
> >       * config/i386/i386.opt: Add -param=ix86-vect-unroll-limit.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/i386/vect_unroll-1.c: New test.
> >       * gcc.target/i386/vect_unroll-2.c: New test.
> >       * gcc.target/i386/vect_unroll-3.c: New test.
> >       * gcc.target/i386/vect_unroll-4.c: New test.
> >       * gcc.target/i386/vect_unroll-5.c: New test.
> > ---
> >  gcc/config/i386/i386.cc                       | 165 ++++++++++++++-
> >  gcc/config/i386/i386.h                        |  16 ++
> >  gcc/config/i386/i386.opt                      |   4 +
> >  gcc/config/i386/x86-tune-costs.h              | 192 ++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/vect_unroll-1.c |  12 ++
> >  gcc/testsuite/gcc.target/i386/vect_unroll-2.c |  12 ++
> >  gcc/testsuite/gcc.target/i386/vect_unroll-3.c |  12 ++
> >  gcc/testsuite/gcc.target/i386/vect_unroll-4.c |  12 ++
> >  gcc/testsuite/gcc.target/i386/vect_unroll-5.c |  13 ++
> >  gcc/testsuite/gcc.target/i386/vect_unroll-6.c |  12 ++
> >  10 files changed, 447 insertions(+), 3 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-4.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-5.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vect_unroll-6.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 49bd3939eb4..1961c9c7883 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -25762,15 +25762,20 @@ private:
> >    unsigned m_num_sse_needed[3];
> >    /* Number of 256-bit vector permutation.  */
> >    unsigned m_num_avx256_vec_perm[3];
> > +  /* Number of reductions for FMA/DOT_PROD_EXPR/SAD_EXPR  */
> > +  unsigned m_num_reduc[X86_REDUC_LAST];
> > +  /* Don't do unroll if m_prefer_unroll is false, default is true.  */
> > +  bool m_prefer_unroll;
> >  };
> >
> >  ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool 
> > costing_for_scalar)
> >    : vector_costs (vinfo, costing_for_scalar),
> >      m_num_gpr_needed (),
> >      m_num_sse_needed (),
> > -    m_num_avx256_vec_perm ()
> > -{
> > -}
> > +    m_num_avx256_vec_perm (),
> > +    m_num_reduc (),
> > +    m_prefer_unroll (true)
> > +{}
> >
> >  /* Implement targetm.vectorize.create_costs.  */
> >
> > @@ -26067,6 +26072,125 @@ ix86_vector_costs::add_stmt_cost (int count, 
> > vect_cost_for_stmt kind,
> >       }
> >      }
> >
> > +  /* Record number of load/store/gather/scatter in vectorized body.  */
> > +  if (where == vect_body && !m_costing_for_scalar)
> > +    {
> > +      switch (kind)
> > +     {
> > +       /* Emulated gather/scatter or any scalarization.  */
> > +     case scalar_load:
> > +     case scalar_stmt:
> > +     case scalar_store:
> > +     case vector_gather_load:
> > +     case vector_scatter_store:
> > +       m_prefer_unroll = false;
> > +       break;
> > +
> > +     case vector_stmt:
> > +     case vec_to_scalar:
> > +       /* Count number of reduction FMA and "real" DOT_PROD_EXPR,
> > +          unroll in the vectorizer will enable partial sum.  */
> > +       if (stmt_info
> > +           && vect_is_reduction (stmt_info)
> > +           && stmt_info->stmt)
> > +         {
> > +           /* Handle __builtin_fma.  */
> > +           if (gimple_call_combined_fn (stmt_info->stmt) == CFN_FMA)
> > +             {
> > +               m_num_reduc[X86_REDUC_FMA] += count;
> > +               break;
> > +             }
> > +
> > +           if (!is_gimple_assign (stmt_info->stmt))
> > +             break;
> > +
> > +           tree_code subcode = gimple_assign_rhs_code (stmt_info->stmt);
> > +           machine_mode inner_mode = GET_MODE_INNER (mode);
> > +           tree rhs1, rhs2;
> > +           bool native_vnni_p = true;
> > +           gimple* def;
> > +           machine_mode mode_rhs;
> > +           switch (subcode)
> > +             {
> > +             case PLUS_EXPR:
> > +             case MINUS_EXPR:
> > +               if (!fp || !flag_associative_math
> > +                   || flag_fp_contract_mode != FP_CONTRACT_FAST)
> > +                 break;
> > +
> > +               /* FMA condition for different modes.  */
> > +               if (((inner_mode == DFmode || inner_mode == SFmode)
> > +                    && !TARGET_FMA && !TARGET_AVX512VL)
> > +                   || (inner_mode == HFmode && !TARGET_AVX512FP16)
> > +                   || (inner_mode == BFmode && !TARGET_AVX10_2))
> > +                 break;
> > +
> > +               /* MULT_EXPR + PLUS_EXPR/MINUS_EXPR is transformed
> > +                  to FMA/FNMA after vectorization.  */
> > +               rhs1 = gimple_assign_rhs1 (stmt_info->stmt);
> > +               rhs2 = gimple_assign_rhs2 (stmt_info->stmt);
> > +               if (subcode == PLUS_EXPR
> > +                   && TREE_CODE (rhs1) == SSA_NAME
> > +                   && (def = SSA_NAME_DEF_STMT (rhs1), true)
> > +                   && is_gimple_assign (def)
> > +                   && gimple_assign_rhs_code (def) == MULT_EXPR)
> > +                 m_num_reduc[X86_REDUC_FMA] += count;
> > +               else if (TREE_CODE (rhs2) == SSA_NAME
> > +                        && (def = SSA_NAME_DEF_STMT (rhs2), true)
> > +                        && is_gimple_assign (def)
> > +                        && gimple_assign_rhs_code (def) == MULT_EXPR)
> > +                 m_num_reduc[X86_REDUC_FMA] += count;
> > +               break;
> > +
> > +               /* Vectorizer lane_reducing_op_p supports DOT_PROX_EXPR,
> > +                  WIDEN_SUM_EXPR and SAD_EXPR, x86 backend only supports
> > +                  SAD_EXPR (usad{v16qi,v32qi,v64qi}) and DOT_PROD_EXPR.  */
> > +             case DOT_PROD_EXPR:
> > +               rhs1 = gimple_assign_rhs1 (stmt_info->stmt);
> > +               mode_rhs = TYPE_MODE (TREE_TYPE (rhs1));
> > +               if (mode_rhs == QImode)
> > +                 {
> > +                   rhs2 = gimple_assign_rhs2 (stmt_info->stmt);
> > +                   signop signop1_p = TYPE_SIGN (TREE_TYPE (rhs1));
> > +                   signop signop2_p = TYPE_SIGN (TREE_TYPE (rhs2));
> > +
> > +                   /* vpdpbusd.  */
> > +                   if (signop1_p != signop2_p)
> > +                     native_vnni_p
> > +                       = (GET_MODE_SIZE (mode) == 64
> > +                          ? TARGET_AVX512VNNI
> > +                          : ((TARGET_AVX512VNNI && TARGET_AVX512VL)
> > +                             || TARGET_AVXVNNI));
> > +                   else
> > +                     /* vpdpbssd.  */
> > +                     native_vnni_p
> > +                       = (GET_MODE_SIZE (mode) == 64
> > +                          ? TARGET_AVX10_2
> > +                          : (TARGET_AVXVNNIINT8 || TARGET_AVX10_2));
> > +                 }
> > +               m_num_reduc[X86_REDUC_DOT_PROD] += count;
> > +
> > +               /* Dislike to do unroll and partial sum for
> > +                  emulated DOT_PROD_EXPR.  */
> > +               if (!native_vnni_p)
> > +                 m_num_reduc[X86_REDUC_DOT_PROD] += 3 * count;
> > +               break;
> > +
> > +             case SAD_EXPR:
> > +               m_num_reduc[X86_REDUC_SAD] += count;
> > +               break;
> > +
> > +             default:
> > +               break;
> > +             }
> > +         }
> > +
> > +     default:
> > +       break;
> > +     }
> > +    }
> > +
> > +
> >    combined_fn cfn;
> >    if ((kind == vector_stmt || kind == scalar_stmt)
> >        && stmt_info
> > @@ -26282,6 +26406,41 @@ ix86_vector_costs::finish_cost (const vector_costs 
> > *scalar_costs)
> >         && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> >             > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> >       m_costs[vect_body] = INT_MAX;
> > +
> > +      bool any_reduc_p = false;
> > +      for (int i = 0; i != X86_REDUC_LAST; i++)
> > +     if (m_num_reduc[i])
> > +       {
> > +         any_reduc_p = true;
> > +         break;
> > +       }
> > +
> > +      if (any_reduc_p
> > +       /* Not much gain for loop with gather and scatter.  */
> > +       && m_prefer_unroll
> > +       && !LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> > +     {
> > +       unsigned unroll_factor
> > +         = OPTION_SET_P (ix86_vect_unroll_limit)
> > +         ? ix86_vect_unroll_limit
> > +         : ix86_cost->vect_unroll_limit;
> > +
> > +       if (unroll_factor > 1)
> > +         {
> > +           for (int i = 0 ; i != X86_REDUC_LAST; i++)
> > +             {
> > +               if (m_num_reduc[i])
> > +                 {
> > +                   unsigned tmp = CEIL (ix86_cost->reduc_lat_mult_thr[i],
> > +                                        m_num_reduc[i]);
> > +                   unroll_factor = MIN (unroll_factor, tmp);
> > +                 }
> > +             }
> > +
> > +           m_suggested_unroll_factor  = 1 << ceil_log2 (unroll_factor);
> > +         }
> > +     }
> > +
> >      }
> >
> >    ix86_vect_estimate_reg_pressure ();
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index 791f3b9e133..817bf665c40 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -102,6 +102,15 @@ struct stringop_algs
> >  #define COSTS_N_BYTES(N) ((N) * 2)
> >  #endif
> >
> > +
> > +enum ix86_reduc_unroll_factor{
> > +  X86_REDUC_FMA,
> > +  X86_REDUC_DOT_PROD,
> > +  X86_REDUC_SAD,
> > +
> > +  X86_REDUC_LAST
> > +};
> > +
> >  /* Define the specific costs for a given cpu.  NB: hard_register is used
> >     by TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute
> >     hard register move costs by register allocator.  Relative costs of
> > @@ -225,6 +234,13 @@ struct processor_costs {
> >                                  to number of instructions executed in
> >                                  parallel.  See also
> >                                  ix86_reassociation_width.  */
> > +  const unsigned reduc_lat_mult_thr[X86_REDUC_LAST];
> > +                             /* Latency times throughput of
> > +                                FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                it's used to determine unroll
> > +                                factor in the vectorizer.  */
> > +  const unsigned vect_unroll_limit;    /* Limit how much the autovectorizer
> > +                                       may unroll a loop.  */
> >    struct stringop_algs *memcpy, *memset;
> >    const int cond_taken_branch_cost;    /* Cost of taken branch for 
> > vectorizer
> >                                         cost model.  */
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > index c93c0b1bb38..6bda22f4843 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -1246,6 +1246,10 @@ munroll-only-small-loops
> >  Target Var(ix86_unroll_only_small_loops) Init(0) Optimization
> >  Enable conservative small loop unrolling.
> >
> > +-param=ix86-vect-unroll-limit=
> > +Target Joined UInteger Var(ix86_vect_unroll_limit) Init(4) Param
> > +Limit how much the autovectorizer may unroll a loop.
> > +
> >  mlam=
> >  Target RejectNegative Joined Enum(lam_type) Var(ix86_lam_type) 
> > Init(lam_none)
> >  -mlam=[none|u48|u57] Instrument meta data position in user data pointers.
> > diff --git a/gcc/config/i386/x86-tune-costs.h 
> > b/gcc/config/i386/x86-tune-costs.h
> > index c8603b982af..1649ea2fe3e 100644
> > --- a/gcc/config/i386/x86-tune-costs.h
> > +++ b/gcc/config/i386/x86-tune-costs.h
> > @@ -141,6 +141,12 @@ struct processor_costs ix86_size_cost = {/* costs for 
> > tuning for size */
> >    COSTS_N_BYTES (4),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    ix86_size_memcpy,
> >    ix86_size_memset,
> >    COSTS_N_BYTES (1),                 /* cond_taken_branch_cost.  */
> > @@ -261,6 +267,12 @@ struct processor_costs i386_cost = {     /* 386 
> > specific costs */
> >    COSTS_N_INSNS (27),                        /* cost of CVTPI2PS 
> > instruction.  */
> >    COSTS_N_INSNS (27),                        /* cost of CVT(T)PS2PI 
> > instruction.  */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    i386_memcpy,
> >    i386_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -382,6 +394,12 @@ struct processor_costs i486_cost = {     /* 486 
> > specific costs */
> >    COSTS_N_INSNS (27),                        /* cost of CVTPI2PS 
> > instruction.  */
> >    COSTS_N_INSNS (27),                        /* cost of CVT(T)PS2PI 
> > instruction.  */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    i486_memcpy,
> >    i486_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -501,6 +519,12 @@ struct processor_costs pentium_cost = {
> >    COSTS_N_INSNS (3),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (3),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    pentium_memcpy,
> >    pentium_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -613,6 +637,12 @@ struct processor_costs lakemont_cost = {
> >    COSTS_N_INSNS (5),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (5),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    pentium_memcpy,
> >    pentium_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -740,6 +770,12 @@ struct processor_costs pentiumpro_cost = {
> >    COSTS_N_INSNS (3),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (3),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    pentiumpro_memcpy,
> >    pentiumpro_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -858,6 +894,12 @@ struct processor_costs geode_cost = {
> >    COSTS_N_INSNS (6),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (6),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    geode_memcpy,
> >    geode_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -979,6 +1021,12 @@ struct processor_costs k6_cost = {
> >    COSTS_N_INSNS (2),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (2),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    k6_memcpy,
> >    k6_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -1101,6 +1149,12 @@ struct processor_costs athlon_cost = {
> >    COSTS_N_INSNS (4),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (6),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    athlon_memcpy,
> >    athlon_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -1232,6 +1286,12 @@ struct processor_costs k8_cost = {
> >    COSTS_N_INSNS (4),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (5),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    k8_memcpy,
> >    k8_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -1371,6 +1431,12 @@ struct processor_costs amdfam10_cost = {
> >    COSTS_N_INSNS (7),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (4),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    amdfam10_memcpy,
> >    amdfam10_memset,
> >    COSTS_N_INSNS (2),                 /* cond_taken_branch_cost.  */
> > @@ -1503,6 +1569,12 @@ const struct processor_costs bdver_cost = {
> >    COSTS_N_INSNS (4),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (4),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 2, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    bdver_memcpy,
> >    bdver_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -1668,6 +1740,12 @@ struct processor_costs znver1_cost = {
> >       plus/minus operations per cycle but only one multiply.  This is 
> > adjusted
> >       in ix86_reassociation_width.  */
> >    4, 4, 3, 6,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {5, 1, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    znver1_memcpy,
> >    znver1_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -1836,6 +1914,12 @@ struct processor_costs znver2_cost = {
> >       plus/minus operations per cycle but only one multiply.  This is 
> > adjusted
> >       in ix86_reassociation_width.  */
> >    4, 4, 3, 6,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {10, 1, 3},                                /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    znver2_memcpy,
> >    znver2_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -1979,6 +2063,12 @@ struct processor_costs znver3_cost = {
> >       plus/minus operations per cycle but only one multiply.  This is 
> > adjusted
> >       in ix86_reassociation_width.  */
> >    4, 4, 3, 6,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 1, 6},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    znver2_memcpy,
> >    znver2_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -2125,6 +2215,12 @@ struct processor_costs znver4_cost = {
> >       plus/minus operations per cycle but only one multiply.  This is 
> > adjusted
> >       in ix86_reassociation_width.  */
> >    4, 4, 3, 6,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 8, 6},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    znver2_memcpy,
> >    znver2_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -2287,6 +2383,12 @@ struct processor_costs znver5_cost = {
> >       We increase width to 6 for multiplications
> >       in ix86_reassociation_width.  */
> >    6, 6, 4, 6,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 8, 6},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    znver2_memcpy,
> >    znver2_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -2422,6 +2524,12 @@ struct processor_costs skylake_cost = {
> >    COSTS_N_INSNS (6),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (7),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 4, 2, 2,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 1, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    skylake_memcpy,
> >    skylake_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -2559,6 +2667,12 @@ struct processor_costs icelake_cost = {
> >    COSTS_N_INSNS (7),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (6),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 4, 2, 2,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 10, 3},                                /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    icelake_memcpy,
> >    icelake_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -2690,6 +2804,12 @@ struct processor_costs alderlake_cost = {
> >    COSTS_N_INSNS (7),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (6),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 4, 3, 3,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 8, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    alderlake_memcpy,
> >    alderlake_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -2814,6 +2934,12 @@ const struct processor_costs btver1_cost = {
> >    COSTS_N_INSNS (4),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (4),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    btver1_memcpy,
> >    btver1_memset,
> >    COSTS_N_INSNS (2),                 /* cond_taken_branch_cost.  */
> > @@ -2935,6 +3061,12 @@ const struct processor_costs btver2_cost = {
> >    COSTS_N_INSNS (4),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (4),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    btver2_memcpy,
> >    btver2_memset,
> >    COSTS_N_INSNS (2),                 /* cond_taken_branch_cost.  */
> > @@ -3055,6 +3187,12 @@ struct processor_costs pentium4_cost = {
> >    COSTS_N_INSNS (12),                        /* cost of CVTPI2PS 
> > instruction.  */
> >    COSTS_N_INSNS (8),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    pentium4_memcpy,
> >    pentium4_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -3178,6 +3316,12 @@ struct processor_costs nocona_cost = {
> >    COSTS_N_INSNS (12),                        /* cost of CVTPI2PS 
> > instruction.  */
> >    COSTS_N_INSNS (8),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 1, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {1, 1, 1},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    nocona_memcpy,
> >    nocona_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -3299,6 +3443,12 @@ struct processor_costs atom_cost = {
> >    COSTS_N_INSNS (6),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (4),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    2, 2, 2, 2,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 8, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  2,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    atom_memcpy,
> >    atom_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -3420,6 +3570,12 @@ struct processor_costs slm_cost = {
> >    COSTS_N_INSNS (4),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (4),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 2, 1, 1,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 8, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    slm_memcpy,
> >    slm_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -3555,6 +3711,12 @@ struct processor_costs tremont_cost = {
> >    COSTS_N_INSNS (4),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (4),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 4, 3, 3,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 1, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    tremont_memcpy,
> >    tremont_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -3681,6 +3843,12 @@ struct processor_costs lujiazui_cost = {
> >    COSTS_N_INSNS (3),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (3),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 4, 3, 3,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 1, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    lujiazui_memcpy,
> >    lujiazui_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -3805,6 +3973,12 @@ struct processor_costs yongfeng_cost = {
> >    COSTS_N_INSNS (3),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (3),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    4, 4, 4, 4,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 1, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    yongfeng_memcpy,
> >    yongfeng_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -3929,6 +4103,12 @@ struct processor_costs shijidadao_cost = {
> >    COSTS_N_INSNS (3),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (3),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    4, 4, 4, 4,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 1, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    shijidadao_memcpy,
> >    shijidadao_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > @@ -4078,6 +4258,12 @@ struct processor_costs generic_cost = {
> >    COSTS_N_INSNS (3),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (3),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 4, 3, 3,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 8, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  4,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    generic_memcpy,
> >    generic_memset,
> >    COSTS_N_INSNS (4),                 /* cond_taken_branch_cost.  */
> > @@ -4215,6 +4401,12 @@ struct processor_costs core_cost = {
> >    COSTS_N_INSNS (6),                 /* cost of CVTPI2PS instruction.  */
> >    COSTS_N_INSNS (7),                 /* cost of CVT(T)PS2PI instruction.  
> > */
> >    1, 4, 2, 2,                                /* reassoc int, fp, vec_int, 
> > vec_fp.  */
> > +  {8, 1, 3},                         /* latency times throughput of
> > +                                        FMA/DOT_PROD_EXPR/SAD_EXPR,
> > +                                        it's used to determine unroll
> > +                                        factor in the vectorizer.  */
> > +  1,                                 /* Limit how much the autovectorizer
> > +                                        may unroll a loop.  */
> >    core_memcpy,
> >    core_memset,
> >    COSTS_N_INSNS (3),                 /* cond_taken_branch_cost.  */
> > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-1.c 
> > b/gcc/testsuite/gcc.target/i386/vect_unroll-1.c
> > new file mode 100644
> > index 00000000000..2e294d3aea6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-1.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-march=x86-64-v3 -Ofast" } */
> > +/* { dg-final { scan-assembler-times {(?n)vfmadd[1-3]*ps[^\n]*ymm} 4 } } */
> > +
> > +float
> > +foo (float* a, float* b, int n)
> > +{
> > +  float sum = 0;
> > +  for (int i = 0; i != n; i++)
> > +    sum += a[i] * b[i];
> > +  return sum;
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-2.c 
> > b/gcc/testsuite/gcc.target/i386/vect_unroll-2.c
> > new file mode 100644
> > index 00000000000..069f7d37ae7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-2.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-march=x86-64-v3 -Ofast" } */
> > +/* { dg-final { scan-assembler-times {(?n)vfnmadd[1-3]*ps[^\n]*ymm} 4 } } 
> > */
> > +
> > +float
> > +foo (float* a, float* b, int n)
> > +{
> > +  float sum = 0;
> > +  for (int i = 0; i != n; i++)
> > +    sum -= a[i] * b[i];
> > +  return sum;
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-3.c 
> > b/gcc/testsuite/gcc.target/i386/vect_unroll-3.c
> > new file mode 100644
> > index 00000000000..6860c2ffbd5
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-3.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavxvnni -O3" } */
> > +/* { dg-final { scan-assembler-times {(?n)vpdpbusd[^\n]*ymm} 4 } } */
> > +
> > +int
> > +foo (unsigned char* a, char* b, int n)
> > +{
> > +  int sum = 0;
> > +  for (int i = 0; i != n; i++)
> > +    sum += a[i] * b[i];
> > +  return sum;
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-4.c 
> > b/gcc/testsuite/gcc.target/i386/vect_unroll-4.c
> > new file mode 100644
> > index 00000000000..01d8af67b6e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-4.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-march=x86-64-v3 -O3 -mno-avxvnni" } */
> > +/* { dg-final { scan-assembler-times {(?n)vpmaddwd[^\n]*ymm} 4 } } */
> > +
> > +int
> > +foo (unsigned char* a, char* b, int n)
> > +{
> > +  int sum = 0;
> > +  for (int i = 0; i != n; i++)
> > +    sum += a[i] * b[i];
> > +  return sum;
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-5.c 
> > b/gcc/testsuite/gcc.target/i386/vect_unroll-5.c
> > new file mode 100644
> > index 00000000000..c6375b1bc8d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-5.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-march=x86-64-v3 -Ofast -mgather" } */
> > +/* { dg-final { scan-assembler-times {(?n)vfmadd[1-3]*ps[^\n]*ymm} 1 } } */
> > +
> > +float
> > +foo (float* a, int* b, float* c, int n)
> > +{
> > +  float sum = 0;
> > +  for (int i = 0; i != n; i++)
> > +    sum += a[b[i]] *c[i];
> > +  return sum;
> > +}
> > +
> > diff --git a/gcc/testsuite/gcc.target/i386/vect_unroll-6.c 
> > b/gcc/testsuite/gcc.target/i386/vect_unroll-6.c
> > new file mode 100644
> > index 00000000000..b64c2fbde57
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vect_unroll-6.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-march=x86-64-v3 -Ofast" } */
> > +/* { dg-final { scan-assembler-times {(?n)vfmadd[1-3]*ps[^\n]*ymm} 4 } } */
> > +
> > +float
> > +foo (float* a, float* b, int n)
> > +{
> > +  float sum = 0;
> > +  for (int i = 0; i != n; i++)
> > +    sum = __builtin_fma (a[i], b[i], sum);
> > +  return sum;
> > +}
> >
>
> --
> Richard Biener <[email protected]>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)



-- 
BR,
Hongtao

Re: [PATCH v2] [x86] Enable unroll in the vectorizer when there's reduction for FMA/DOT_PROD_EXPR/SAD_EXPR

Reply via email to