On Wed, 23 Apr 2025, Tamar Christina wrote:
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Wednesday, April 23, 2025 9:46 AM
> > To: Tamar Christina <[email protected]>
> > Cc: [email protected]; nd <[email protected]>; Richard Sandiford
> > <[email protected]>
> > Subject: RE: [PATCH]middle-end: Add new "max" vector cost model
> >
> > On Wed, 23 Apr 2025, Tamar Christina wrote:
> >
> > > > -----Original Message-----
> > > > From: Richard Biener <[email protected]>
> > > > Sent: Wednesday, April 23, 2025 9:37 AM
> > > > To: Tamar Christina <[email protected]>
> > > > Cc: [email protected]; nd <[email protected]>; Richard Sandiford
> > > > <[email protected]>
> > > > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> > > >
> > > > On Wed, 23 Apr 2025, Richard Biener wrote:
> > > >
> > > > > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > This patch proposes a new vector cost model called "max". The cost
> > > > > > model
> > is
> > > > an
> > > > > > intersection between two of our existing cost models. Like
> > > > > > `unlimited` it
> > > > > > disables the costing vs scalar and assumes all vectorization to be
> > > > > > profitable.
> > > > > >
> > > > > > But unlike unlimited it does not fully disable the vector cost
> > > > > > model. That
> > > > > > means that we still perform comparisons between vector modes.
> > > > > >
> > > > > > As an example, the following:
> > > > > >
> > > > > > void
> > > > > > foo (char *restrict a, int *restrict b, int *restrict c,
> > > > > > int *restrict d, int stride)
> > > > > > {
> > > > > > if (stride <= 1)
> > > > > > return;
> > > > > >
> > > > > > for (int i = 0; i < 3; i++)
> > > > > > {
> > > > > > int res = c[i];
> > > > > > int t = b[i * stride];
> > > > > > if (a[i] != 0)
> > > > > > res = t * d[i];
> > > > > > c[i] = res;
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic
> > > > > > fails to
> > > > > > vectorize as it assumes scalar would be faster, and with
> > > > > > -fvect-cost-model=unlimited it picks a vector type that's so big
> > > > > > that the large
> > > > > > sequence generated is working on mostly inactive lanes:
> > > > > >
> > > > > > ...
> > > > > > and p3.b, p3/z, p4.b, p4.b
> > > > > > whilelo p0.s, wzr, w7
> > > > > > ld1w z23.s, p3/z, [x3, #3, mul vl]
> > > > > > ld1w z28.s, p0/z, [x5, z31.s, sxtw 2]
> > > > > > add x0, x5, x0
> > > > > > punpklo p6.h, p6.b
> > > > > > ld1w z27.s, p4/z, [x0, z31.s, sxtw 2]
> > > > > > and p6.b, p6/z, p0.b, p0.b
> > > > > > punpklo p4.h, p7.b
> > > > > > ld1w z24.s, p6/z, [x3, #2, mul vl]
> > > > > > and p4.b, p4/z, p2.b, p2.b
> > > > > > uqdecw w6
> > > > > > ld1w z26.s, p4/z, [x3]
> > > > > > whilelo p1.s, wzr, w6
> > > > > > mul z27.s, p5/m, z27.s, z23.s
> > > > > > ld1w z29.s, p1/z, [x4, z31.s, sxtw 2]
> > > > > > punpkhi p7.h, p7.b
> > > > > > mul z24.s, p5/m, z24.s, z28.s
> > > > > > and p7.b, p7/z, p1.b, p1.b
> > > > > > mul z26.s, p5/m, z26.s, z30.s
> > > > > > ld1w z25.s, p7/z, [x3, #1, mul vl]
> > > > > > st1w z27.s, p3, [x2, #3, mul vl]
> > > > > > mul z25.s, p5/m, z25.s, z29.s
> > > > > > st1w z24.s, p6, [x2, #2, mul vl]
> > > > > > st1w z25.s, p7, [x2, #1, mul vl]
> > > > > > st1w z26.s, p4, [x2]
> > > > > > ...
> > > > > >
> > > > > > With -fvect-cost-model=max you get more reasonable code:
> > > > > >
> > > > > > foo:
> > > > > > cmp w4, 1
> > > > > > ble .L1
> > > > > > ptrue p7.s, vl3
> > > > > > index z0.s, #0, w4
> > > > > > ld1b z29.s, p7/z, [x0]
> > > > > > ld1w z30.s, p7/z, [x1, z0.s, sxtw 2]
> > > > > > ptrue p6.b, all
> > > > > > cmpne p7.b, p7/z, z29.b, #0
> > > > > > ld1w z31.s, p7/z, [x3]
> > > > > > mul z31.s, p6/m, z31.s, z30.s
> > > > > > st1w z31.s, p7, [x2]
> > > > > > .L1:
> > > > > > ret
> > > > > >
> > > > > > This model has been useful internally for performance exploration
> > > > > > and cost-
> > > > model
> > > > > > validation. It allows us to force realistic vectorization
> > > > > > overriding the cost
> > > > > > model to be able to tell whether it's correct wrt to profitability.
> > > > > >
> > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > > > -m32, -m64 and no issues.
> > > > > >
> > > > > > Ok for master?
> > > > >
> > > > > Hmm. I don't like another cost model. Instead how about changing
> > > > > 'unlimited' to still iterate through vector sizes? Cost modeling
> > > > > is really about vector vs. scalar, not vector vs. vector which is
> > > > > completely under target control. Targets should provide a way
> > > > > to limit iteration, like aarch64 has with the
> > > > > aarch64-autovec-preference
> > > > > --param or x86 has with -mprefer-vector-width.
> > > > >
> > > > > Of course changing 'unlimited' might result in somewhat of a testsuite
> > > > > churn, but still the fix there would be to inject a proper -mXYZ
> > > > > or --param to get the old behavior back (or even consider cycling
> > > > > through the different aarch64-autovec-preference settings for the
> > > > > testsuite).
> > > >
> > > > Note this will completely remove the ability to reject never profitable
> > > > vectorizations, so I'm not sure that this is what you'd want in
> > > > practice.
> > > > You want to fix cost modeling instead.
> > > >
> > > > So why does it consider the scalar code to be faster with =dynamic
> > > > and why do you think that's not possible to fix? Don't we have
> > > > per-loop #pragma control to force vectorization here (but maybe that
> > > > has the 'unlimited' cost modeling issue)?
> > > >
> > >
> > > The addition wasn't for the GCC testsuite usage specifically. This is
> > > about
> > > testing real world code wrt to our cost models. In these instances it's
> > > not
> > > feasible to sprinkle pragmas over every loop in every program.
> >
> > Sure, but still cost modeling should be fixed then - with using
> > unlimited (or max) you'd still have to sprinkle novector on the loops
> > that will be slower otherwise.
>
> Ack, that said we compare performance at the BB level as well. So we are able
> to tell improvements/regressions to a lower level extend than loop boundaries.
>
> >
> > > We also use this during uarch design validation, as e.g. it gives someone
> > > working on a CPU the ability to generate vector code for design purposes
> > > regardless of what the compiler thinks is profitable on current designs.
> >
> > For the latter I believe the target should provide ways to force a
> > specific mode with =unlimited then, otherwise you can't reliably get
> > all variants anyway but would depend on costing to pick the correct
> > one out of a set of enabled modes.
> >
>
> This doesn't quite work though. I do believe a target param to pick a mode
> or VF is useful. And at some point Andre was working on one but never finished
> It.
>
> However such parameter is a global option, and if vectorization is not
> possible
> with the specified mode it'll fail.
>
> Such precise control is useful for small testcases, but not "programs".
>
> But if I'm not misunderstanding you, you're saying you're ok with
> changing unlimited, and to fix testsuite fallout we can add params?
> That said isn't the ability to control the vector mode useful for
> writing testcases for all targets?
I'm OK with changing unlimited to give the target control on how exactly
it iterates over modes or not - on a similar note we might want to change
that iteration to make the target pick the next mode to iterate to
rather than having the up-front vector of modes and the vectorizer
somehow figuring what modes are useful or not.
Richard.
> Thanks,
> Tamar
>
> > Richard.
> >
> > > Thanks,
> > > Tamar
> > >
> > > > Richard.
> > > >
> > > > > Richard.
> > > > >
> > > > > > Thanks,
> > > > > > Tamar
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > > * common.opt (vect-cost-model, simd-cost-model): Add max cost
> > > > > > model.
> > > > > > * doc/invoke.texi: Document it.
> > > > > > * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
> > > > > > * tree-vect-data-refs.cc (vect_peeling_hash_insert,
> > > > > > vect_peeling_hash_choose_best_peeling,
> > > > > > vect_enhance_data_refs_alignment): Use it.
> > > > > > * tree-vect-loop.cc (vect_analyze_loop_costing,
> > > > > > vect_estimate_min_profitable_iters): Likewise.
> > > > > >
> > > > > > ---
> > > > > > diff --git a/gcc/common.opt b/gcc/common.opt
> > > > > > index
> > > >
> > 88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9ad
> > > > dc736e336cd0e4ee 100644
> > > > > > --- a/gcc/common.opt
> > > > > > +++ b/gcc/common.opt
> > > > > > @@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on
> > trees.
> > > > > >
> > > > > > fvect-cost-model=
> > > > > > Common Joined RejectNegative Enum(vect_cost_model)
> > > > Var(flag_vect_cost_model) Init(VECT_COST_MODEL_DEFAULT) Optimization
> > > > > > --fvect-cost-model=[unlimited|dynamic|cheap|very-cheap]
> > > > > > Specifies
> > the cost
> > > > model for vectorization.
> > > > > > +-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap]
> > Specifies
> > > > the cost model for vectorization.
> > > > > >
> > > > > > fsimd-cost-model=
> > > > > > Common Joined RejectNegative Enum(vect_cost_model)
> > > > Var(flag_simd_cost_model) Init(VECT_COST_MODEL_UNLIMITED)
> > Optimization
> > > > > > --fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap]
> > > > > > Specifies
> > > > the vectorization cost model for code marked with a simd directive.
> > > > > > +-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap]
> > Specifies
> > > > the vectorization cost model for code marked with a simd directive.
> > > > > >
> > > > > > Enum
> > > > > > Name(vect_cost_model) Type(enum vect_cost_model)
> > > > UnknownError(unknown vectorizer cost model %qs)
> > > > > > @@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum
> > > > vect_cost_model) UnknownError(unknown vectorizer
> > > > > > EnumValue
> > > > > > Enum(vect_cost_model) String(unlimited)
> > > > Value(VECT_COST_MODEL_UNLIMITED)
> > > > > >
> > > > > > +EnumValue
> > > > > > +Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
> > > > > > +
> > > > > > EnumValue
> > > > > > Enum(vect_cost_model) String(dynamic)
> > > > Value(VECT_COST_MODEL_DYNAMIC)
> > > > > >
> > > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > > > index
> > > >
> > 14a78fd236f64185fc129f18b52b20692d49305c..e7b242c9134ff17022c92f81c
> > > > 8b24762cfd59c6c 100644
> > > > > > --- a/gcc/doc/invoke.texi
> > > > > > +++ b/gcc/doc/invoke.texi
> > > > > > @@ -14449,9 +14449,11 @@ With the @samp{unlimited} model the
> > > > vectorized code-path is assumed
> > > > > > to be profitable while with the @samp{dynamic} model a runtime
> > > > > > check
> > > > > > guards the vectorized code-path to enable it only for iteration
> > > > > > counts that will likely execute faster than when executing the
> > > > > > original
> > > > > > -scalar loop. The @samp{cheap} model disables vectorization of
> > > > > > -loops where doing so would be cost prohibitive for example due to
> > > > > > -required runtime checks for data dependence or alignment but
> > > > > > otherwise
> > > > > > +scalar loop. The @samp{max} model similarly to the
> > > > > > @samp{unlimited}
> > model
> > > > > > +assumes all vector code is profitable over scalar within loops but
> > > > > > does not
> > > > > > +disable the vector to vector costing. The @samp{cheap} model
> > > > > > disables
> > > > > > +vectorization of loops where doing so would be cost prohibitive for
> > example
> > > > due
> > > > > > +to required runtime checks for data dependence or alignment but
> > otherwise
> > > > > > is equal to the @samp{dynamic} model. The @samp{very-cheap} model
> > > > disables
> > > > > > vectorization of loops when any runtime check for data dependence
> > > > > > or
> > > > alignment
> > > > > > is required, it also disables vectorization of epilogue loops but
> > > > > > otherwise is
> > > > > > diff --git a/gcc/flag-types.h b/gcc/flag-types.h
> > > > > > index
> > > >
> > db573768c23d9f6809ae115e71370960314f16ce..1c941c295a2e608eae58c3e3
> > > > fb0eba1284f731ca 100644
> > > > > > --- a/gcc/flag-types.h
> > > > > > +++ b/gcc/flag-types.h
> > > > > > @@ -277,9 +277,10 @@ enum scalar_storage_order_kind {
> > > > > > /* Vectorizer cost-model. Except for DEFAULT, the values are
> > > > > > ordered from
> > > > > > the most conservative to the least conservative. */
> > > > > > enum vect_cost_model {
> > > > > > - VECT_COST_MODEL_VERY_CHEAP = -3,
> > > > > > - VECT_COST_MODEL_CHEAP = -2,
> > > > > > - VECT_COST_MODEL_DYNAMIC = -1,
> > > > > > + VECT_COST_MODEL_VERY_CHEAP = -4,
> > > > > > + VECT_COST_MODEL_CHEAP = -3,
> > > > > > + VECT_COST_MODEL_DYNAMIC = -2,
> > > > > > + VECT_COST_MODEL_MAX = -1,
> > > > > > VECT_COST_MODEL_UNLIMITED = 0,
> > > > > > VECT_COST_MODEL_DEFAULT = 1
> > > > > > };
> > > > > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> > > > > > index
> > > >
> > c9395e33fcdfc7deedd979c764daae93b15abace..5c56956c2edcb76210c36b605
> > > > 26f031011c8e0c7 100644
> > > > > > --- a/gcc/tree-vect-data-refs.cc
> > > > > > +++ b/gcc/tree-vect-data-refs.cc
> > > > > > @@ -1847,7 +1847,9 @@ vect_peeling_hash_insert
> > > > (hash_table<peel_info_hasher> *peeling_htab,
> > > > > > /* If this DR is not supported with unknown misalignment then
> > > > > > bias
> > > > > > this slot when the cost model is disabled. */
> > > > > > if (!supportable_if_not_aligned
> > > > > > - && unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > > + && (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > > + || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > > + == VECT_COST_MODEL_MAX))
> > > > > > slot->count += VECT_MAX_COST;
> > > > > > }
> > > > > >
> > > > > > @@ -2002,7 +2004,8 @@ vect_peeling_hash_choose_best_peeling
> > > > (hash_table<peel_info_hasher> *peeling_hta
> > > > > > res.peel_info.dr_info = NULL;
> > > > > > res.vinfo = loop_vinfo;
> > > > > >
> > > > > > - if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > > + if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > > > + && loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) !=
> > > > VECT_COST_MODEL_MAX)
> > > > > > {
> > > > > > res.inside_cost = INT_MAX;
> > > > > > res.outside_cost = INT_MAX;
> > > > > > @@ -2348,7 +2351,8 @@ vect_enhance_data_refs_alignment
> > (loop_vec_info
> > > > loop_vinfo)
> > > > > > We do this automatically for cost model, since we
> > > > > > calculate
> > > > > > cost for every peeling option. */
> > > > > > poly_uint64 nscalars = npeel_tmp;
> > > > > > - if (unlimited_cost_model (LOOP_VINFO_LOOP
> > > > > > (loop_vinfo)))
> > > > > > + if (unlimited_cost_model (LOOP_VINFO_LOOP
> > > > > > (loop_vinfo))
> > > > > > + || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > ==
> > > > VECT_COST_MODEL_MAX)
> > > > > > {
> > > > > > poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > > > > unsigned group_size = 1;
> > > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > > > index
> > > >
> > 958b829fa8d1ad267fbde3be915719f3a51e6a38..5f3adc257f6581850f901c774
> > > > 7771f5931df942a 100644
> > > > > > --- a/gcc/tree-vect-loop.cc
> > > > > > +++ b/gcc/tree-vect-loop.cc
> > > > > > @@ -2407,7 +2407,8 @@ vect_analyze_loop_costing (loop_vec_info
> > > > loop_vinfo,
> > > > > > &min_profitable_estimate,
> > > > > > suggested_unroll_factor);
> > > > > >
> > > > > > - if (min_profitable_iters < 0)
> > > > > > + if (min_profitable_iters < 0
> > > > > > + && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> > > > > > {
> > > > > > if (dump_enabled_p ())
> > > > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > > > @@ -2430,7 +2431,8 @@ vect_analyze_loop_costing (loop_vec_info
> > > > loop_vinfo,
> > > > > > LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
> > > > > >
> > > > > > if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > > > - && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
> > > > > > + && LOOP_VINFO_INT_NITERS (loop_vinfo) < th
> > > > > > + && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> > > > > > {
> > > > > > if (dump_enabled_p ())
> > > > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > > > @@ -2490,6 +2492,7 @@ vect_analyze_loop_costing (loop_vec_info
> > > > loop_vinfo,
> > > > > > estimated_niter = likely_max_stmt_executions_int (loop);
> > > > > > }
> > > > > > if (estimated_niter != -1
> > > > > > + && loop_cost_model (loop) != VECT_COST_MODEL_MAX
> > > > > > && ((unsigned HOST_WIDE_INT) estimated_niter
> > > > > > < MAX (th, (unsigned) min_profitable_estimate)))
> > > > > > {
> > > > > > @@ -4638,7 +4641,7 @@ vect_estimate_min_profitable_iters
> > (loop_vec_info
> > > > loop_vinfo,
> > > > > > vector_costs *target_cost_data = loop_vinfo->vector_costs;
> > > > > >
> > > > > > /* Cost model disabled. */
> > > > > > - if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > > + if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > > {
> > > > > > if (dump_enabled_p ())
> > > > > > dump_printf_loc (MSG_NOTE, vect_location, "cost model
> > > > > > disabled.\n");
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <[email protected]>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> >
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)