> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Wednesday, April 23, 2025 9:46 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
> <richard.sandif...@arm.com>
> Subject: RE: [PATCH]middle-end: Add new "max" vector cost model
> 
> On Wed, 23 Apr 2025, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguent...@suse.de>
> > > Sent: Wednesday, April 23, 2025 9:37 AM
> > > To: Tamar Christina <tamar.christ...@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
> > > <richard.sandif...@arm.com>
> > > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > >
> > > On Wed, 23 Apr 2025, Richard Biener wrote:
> > >
> > > > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > This patch proposes a new vector cost model called "max".  The cost 
> > > > > model
> is
> > > an
> > > > > intersection between two of our existing cost models.  Like 
> > > > > `unlimited` it
> > > > > disables the costing vs scalar and assumes all vectorization to be 
> > > > > profitable.
> > > > >
> > > > > But unlike unlimited it does not fully disable the vector cost model. 
> > > > >  That
> > > > > means that we still perform comparisons between vector modes.
> > > > >
> > > > > As an example, the following:
> > > > >
> > > > > void
> > > > > foo (char *restrict a, int *restrict b, int *restrict c,
> > > > >      int *restrict d, int stride)
> > > > > {
> > > > >     if (stride <= 1)
> > > > >         return;
> > > > >
> > > > >     for (int i = 0; i < 3; i++)
> > > > >         {
> > > > >             int res = c[i];
> > > > >             int t = b[i * stride];
> > > > >             if (a[i] != 0)
> > > > >                 res = t * d[i];
> > > > >             c[i] = res;
> > > > >         }
> > > > > }
> > > > >
> > > > > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails 
> > > > > to
> > > > > vectorize as it assumes scalar would be faster, and with
> > > > > -fvect-cost-model=unlimited it picks a vector type that's so big that 
> > > > > the large
> > > > > sequence generated is working on mostly inactive lanes:
> > > > >
> > > > >         ...
> > > > >         and     p3.b, p3/z, p4.b, p4.b
> > > > >         whilelo p0.s, wzr, w7
> > > > >         ld1w    z23.s, p3/z, [x3, #3, mul vl]
> > > > >         ld1w    z28.s, p0/z, [x5, z31.s, sxtw 2]
> > > > >         add     x0, x5, x0
> > > > >         punpklo p6.h, p6.b
> > > > >         ld1w    z27.s, p4/z, [x0, z31.s, sxtw 2]
> > > > >         and     p6.b, p6/z, p0.b, p0.b
> > > > >         punpklo p4.h, p7.b
> > > > >         ld1w    z24.s, p6/z, [x3, #2, mul vl]
> > > > >         and     p4.b, p4/z, p2.b, p2.b
> > > > >         uqdecw  w6
> > > > >         ld1w    z26.s, p4/z, [x3]
> > > > >         whilelo p1.s, wzr, w6
> > > > >         mul     z27.s, p5/m, z27.s, z23.s
> > > > >         ld1w    z29.s, p1/z, [x4, z31.s, sxtw 2]
> > > > >         punpkhi p7.h, p7.b
> > > > >         mul     z24.s, p5/m, z24.s, z28.s
> > > > >         and     p7.b, p7/z, p1.b, p1.b
> > > > >         mul     z26.s, p5/m, z26.s, z30.s
> > > > >         ld1w    z25.s, p7/z, [x3, #1, mul vl]
> > > > >         st1w    z27.s, p3, [x2, #3, mul vl]
> > > > >         mul     z25.s, p5/m, z25.s, z29.s
> > > > >         st1w    z24.s, p6, [x2, #2, mul vl]
> > > > >         st1w    z25.s, p7, [x2, #1, mul vl]
> > > > >         st1w    z26.s, p4, [x2]
> > > > >         ...
> > > > >
> > > > > With -fvect-cost-model=max you get more reasonable code:
> > > > >
> > > > > foo:
> > > > >         cmp     w4, 1
> > > > >         ble     .L1
> > > > >         ptrue   p7.s, vl3
> > > > >         index   z0.s, #0, w4
> > > > >         ld1b    z29.s, p7/z, [x0]
> > > > >         ld1w    z30.s, p7/z, [x1, z0.s, sxtw 2]
> > > > >       ptrue   p6.b, all
> > > > >         cmpne   p7.b, p7/z, z29.b, #0
> > > > >         ld1w    z31.s, p7/z, [x3]
> > > > >       mul     z31.s, p6/m, z31.s, z30.s
> > > > >         st1w    z31.s, p7, [x2]
> > > > > .L1:
> > > > >         ret
> > > > >
> > > > > This model has been useful internally for performance exploration and 
> > > > > cost-
> > > model
> > > > > validation.  It allows us to force realistic vectorization overriding 
> > > > > the cost
> > > > > model to be able to tell whether it's correct wrt to profitability.
> > > > >
> > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > > -m32, -m64 and no issues.
> > > > >
> > > > > Ok for master?
> > > >
> > > > Hmm.  I don't like another cost model.  Instead how about changing
> > > > 'unlimited' to still iterate through vector sizes?  Cost modeling
> > > > is really about vector vs. scalar, not vector vs. vector which is
> > > > completely under target control.  Targets should provide a way
> > > > to limit iteration, like aarch64 has with the aarch64-autovec-preference
> > > > --param or x86 has with -mprefer-vector-width.
> > > >
> > > > Of course changing 'unlimited' might result in somewhat of a testsuite
> > > > churn, but still the fix there would be to inject a proper -mXYZ
> > > > or --param to get the old behavior back (or even consider cycling
> > > > through the different aarch64-autovec-preference settings for the
> > > > testsuite).
> > >
> > > Note this will completely remove the ability to reject never profitable
> > > vectorizations, so I'm not sure that this is what you'd want in practice.
> > > You want to fix cost modeling instead.
> > >
> > > So why does it consider the scalar code to be faster with =dynamic
> > > and why do you think that's not possible to fix?  Don't we have
> > > per-loop #pragma control to force vectorization here (but maybe that
> > > has the 'unlimited' cost modeling issue)?
> > >
> >
> > The addition wasn't for the GCC testsuite usage specifically.   This is 
> > about
> > testing real world code wrt to our cost models.  In these instances it's not
> > feasible to sprinkle pragmas over every loop in every program.
> 
> Sure, but still cost modeling should be fixed then - with using
> unlimited (or max) you'd still have to sprinkle novector on the loops
> that will be slower otherwise.

Ack, that said we compare performance at the BB level as well.  So we are able
to tell improvements/regressions to a lower level extend than loop boundaries.

> 
> > We also use this during uarch design validation, as e.g. it gives someone
> > working on a CPU the ability to generate vector code for design purposes
> > regardless of what the compiler thinks is profitable on current designs.
> 
> For the latter I believe the target should provide ways to force a
> specific mode with =unlimited then, otherwise you can't reliably get
> all variants anyway but would depend on costing to pick the correct
> one out of a set of enabled modes.
> 

This doesn't quite work though.  I do believe a target param to pick a mode
or VF is useful. And at some point Andre was working on one but never finished
It.

However such parameter is a global option, and if vectorization is not possible
with the specified mode it'll fail.

Such precise control is useful for small testcases, but not "programs".

But if I'm not misunderstanding you, you're saying you're ok with changing 
unlimited,
and to fix testsuite fallout we can add params?  That said isn't the ability to 
control
the vector mode useful for writing testcases for all targets?

Thanks,
Tamar

> Richard.
> 
> > Thanks,
> > Tamar
> >
> > > Richard.
> > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >       * common.opt (vect-cost-model, simd-cost-model): Add max cost 
> > > > > model.
> > > > >       * doc/invoke.texi: Document it.
> > > > >       * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
> > > > >       * tree-vect-data-refs.cc (vect_peeling_hash_insert,
> > > > >       vect_peeling_hash_choose_best_peeling,
> > > > >       vect_enhance_data_refs_alignment): Use it.
> > > > >       * tree-vect-loop.cc (vect_analyze_loop_costing,
> > > > >       vect_estimate_min_profitable_iters): Likewise.
> > > > >
> > > > > ---
> > > > > diff --git a/gcc/common.opt b/gcc/common.opt
> > > > > index
> > >
> 88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9ad
> > > dc736e336cd0e4ee 100644
> > > > > --- a/gcc/common.opt
> > > > > +++ b/gcc/common.opt
> > > > > @@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on
> trees.
> > > > >
> > > > >  fvect-cost-model=
> > > > >  Common Joined RejectNegative Enum(vect_cost_model)
> > > Var(flag_vect_cost_model) Init(VECT_COST_MODEL_DEFAULT) Optimization
> > > > > --fvect-cost-model=[unlimited|dynamic|cheap|very-cheap]       
> > > > > Specifies
> the cost
> > > model for vectorization.
> > > > > +-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap]
>       Specifies
> > > the cost model for vectorization.
> > > > >
> > > > >  fsimd-cost-model=
> > > > >  Common Joined RejectNegative Enum(vect_cost_model)
> > > Var(flag_simd_cost_model) Init(VECT_COST_MODEL_UNLIMITED)
> Optimization
> > > > > --fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap]       
> > > > > Specifies
> > > the vectorization cost model for code marked with a simd directive.
> > > > > +-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap]
>       Specifies
> > > the vectorization cost model for code marked with a simd directive.
> > > > >
> > > > >  Enum
> > > > >  Name(vect_cost_model) Type(enum vect_cost_model)
> > > UnknownError(unknown vectorizer cost model %qs)
> > > > > @@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum
> > > vect_cost_model) UnknownError(unknown vectorizer
> > > > >  EnumValue
> > > > >  Enum(vect_cost_model) String(unlimited)
> > > Value(VECT_COST_MODEL_UNLIMITED)
> > > > >
> > > > > +EnumValue
> > > > > +Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
> > > > > +
> > > > >  EnumValue
> > > > >  Enum(vect_cost_model) String(dynamic)
> > > Value(VECT_COST_MODEL_DYNAMIC)
> > > > >
> > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > > index
> > >
> 14a78fd236f64185fc129f18b52b20692d49305c..e7b242c9134ff17022c92f81c
> > > 8b24762cfd59c6c 100644
> > > > > --- a/gcc/doc/invoke.texi
> > > > > +++ b/gcc/doc/invoke.texi
> > > > > @@ -14449,9 +14449,11 @@ With the @samp{unlimited} model the
> > > vectorized code-path is assumed
> > > > >  to be profitable while with the @samp{dynamic} model a runtime check
> > > > >  guards the vectorized code-path to enable it only for iteration
> > > > >  counts that will likely execute faster than when executing the 
> > > > > original
> > > > > -scalar loop.  The @samp{cheap} model disables vectorization of
> > > > > -loops where doing so would be cost prohibitive for example due to
> > > > > -required runtime checks for data dependence or alignment but 
> > > > > otherwise
> > > > > +scalar loop.  The @samp{max} model similarly to the @samp{unlimited}
> model
> > > > > +assumes all vector code is profitable over scalar within loops but 
> > > > > does not
> > > > > +disable the vector to vector costing.  The @samp{cheap} model 
> > > > > disables
> > > > > +vectorization of loops where doing so would be cost prohibitive for
> example
> > > due
> > > > > +to required runtime checks for data dependence or alignment but
> otherwise
> > > > >  is equal to the @samp{dynamic} model.  The @samp{very-cheap} model
> > > disables
> > > > >  vectorization of loops when any runtime check for data dependence or
> > > alignment
> > > > >  is required, it also disables vectorization of epilogue loops but 
> > > > > otherwise is
> > > > > diff --git a/gcc/flag-types.h b/gcc/flag-types.h
> > > > > index
> > >
> db573768c23d9f6809ae115e71370960314f16ce..1c941c295a2e608eae58c3e3
> > > fb0eba1284f731ca 100644
> > > > > --- a/gcc/flag-types.h
> > > > > +++ b/gcc/flag-types.h
> > > > > @@ -277,9 +277,10 @@ enum scalar_storage_order_kind {
> > > > >  /* Vectorizer cost-model.  Except for DEFAULT, the values are 
> > > > > ordered from
> > > > >     the most conservative to the least conservative.  */
> > > > >  enum vect_cost_model {
> > > > > -  VECT_COST_MODEL_VERY_CHEAP = -3,
> > > > > -  VECT_COST_MODEL_CHEAP = -2,
> > > > > -  VECT_COST_MODEL_DYNAMIC = -1,
> > > > > +  VECT_COST_MODEL_VERY_CHEAP = -4,
> > > > > +  VECT_COST_MODEL_CHEAP = -3,
> > > > > +  VECT_COST_MODEL_DYNAMIC = -2,
> > > > > +  VECT_COST_MODEL_MAX = -1,
> > > > >    VECT_COST_MODEL_UNLIMITED = 0,
> > > > >    VECT_COST_MODEL_DEFAULT = 1
> > > > >  };
> > > > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> > > > > index
> > >
> c9395e33fcdfc7deedd979c764daae93b15abace..5c56956c2edcb76210c36b605
> > > 26f031011c8e0c7 100644
> > > > > --- a/gcc/tree-vect-data-refs.cc
> > > > > +++ b/gcc/tree-vect-data-refs.cc
> > > > > @@ -1847,7 +1847,9 @@ vect_peeling_hash_insert
> > > (hash_table<peel_info_hasher> *peeling_htab,
> > > > >    /* If this DR is not supported with unknown misalignment then bias
> > > > >       this slot when the cost model is disabled.  */
> > > > >    if (!supportable_if_not_aligned
> > > > > -      && unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > +      && (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > +       || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > +             == VECT_COST_MODEL_MAX))
> > > > >      slot->count += VECT_MAX_COST;
> > > > >  }
> > > > >
> > > > > @@ -2002,7 +2004,8 @@ vect_peeling_hash_choose_best_peeling
> > > (hash_table<peel_info_hasher> *peeling_hta
> > > > >     res.peel_info.dr_info = NULL;
> > > > >     res.vinfo = loop_vinfo;
> > > > >
> > > > > -   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > +   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > +     && loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) !=
> > > VECT_COST_MODEL_MAX)
> > > > >       {
> > > > >         res.inside_cost = INT_MAX;
> > > > >         res.outside_cost = INT_MAX;
> > > > > @@ -2348,7 +2351,8 @@ vect_enhance_data_refs_alignment
> (loop_vec_info
> > > loop_vinfo)
> > > > >                   We do this automatically for cost model, since we 
> > > > > calculate
> > > > >                cost for every peeling option.  */
> > > > >             poly_uint64 nscalars = npeel_tmp;
> > > > > -              if (unlimited_cost_model (LOOP_VINFO_LOOP 
> > > > > (loop_vinfo)))
> > > > > +              if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > +               || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> ==
> > > VECT_COST_MODEL_MAX)
> > > > >               {
> > > > >                 poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > > >                 unsigned group_size = 1;
> > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > > index
> > >
> 958b829fa8d1ad267fbde3be915719f3a51e6a38..5f3adc257f6581850f901c774
> > > 7771f5931df942a 100644
> > > > > --- a/gcc/tree-vect-loop.cc
> > > > > +++ b/gcc/tree-vect-loop.cc
> > > > > @@ -2407,7 +2407,8 @@ vect_analyze_loop_costing (loop_vec_info
> > > loop_vinfo,
> > > > >                                     &min_profitable_estimate,
> > > > >                                     suggested_unroll_factor);
> > > > >
> > > > > -  if (min_profitable_iters < 0)
> > > > > +  if (min_profitable_iters < 0
> > > > > +      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> > > > >      {
> > > > >        if (dump_enabled_p ())
> > > > >       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > > @@ -2430,7 +2431,8 @@ vect_analyze_loop_costing (loop_vec_info
> > > loop_vinfo,
> > > > >    LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
> > > > >
> > > > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > > -      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
> > > > > +      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th
> > > > > +      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> > > > >      {
> > > > >        if (dump_enabled_p ())
> > > > >       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > > @@ -2490,6 +2492,7 @@ vect_analyze_loop_costing (loop_vec_info
> > > loop_vinfo,
> > > > >       estimated_niter = likely_max_stmt_executions_int (loop);
> > > > >      }
> > > > >    if (estimated_niter != -1
> > > > > +      && loop_cost_model (loop) != VECT_COST_MODEL_MAX
> > > > >        && ((unsigned HOST_WIDE_INT) estimated_niter
> > > > >         < MAX (th, (unsigned) min_profitable_estimate)))
> > > > >      {
> > > > > @@ -4638,7 +4641,7 @@ vect_estimate_min_profitable_iters
> (loop_vec_info
> > > loop_vinfo,
> > > > >    vector_costs *target_cost_data = loop_vinfo->vector_costs;
> > > > >
> > > > >    /* Cost model disabled.  */
> > > > > -  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > +   if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > >      {
> > > > >        if (dump_enabled_p ())
> > > > >       dump_printf_loc (MSG_NOTE, vect_location, "cost model 
> > > > > disabled.\n");
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguent...@suse.de>
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to