On Wed, 23 Apr 2025, Tamar Christina wrote:
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Wednesday, April 23, 2025 9:37 AM
> > To: Tamar Christina <[email protected]>
> > Cc: [email protected]; nd <[email protected]>; Richard Sandiford
> > <[email protected]>
> > Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
> >
> > On Wed, 23 Apr 2025, Richard Biener wrote:
> >
> > > On Wed, 23 Apr 2025, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This patch proposes a new vector cost model called "max". The cost
> > > > model is
> > an
> > > > intersection between two of our existing cost models. Like `unlimited`
> > > > it
> > > > disables the costing vs scalar and assumes all vectorization to be
> > > > profitable.
> > > >
> > > > But unlike unlimited it does not fully disable the vector cost model.
> > > > That
> > > > means that we still perform comparisons between vector modes.
> > > >
> > > > As an example, the following:
> > > >
> > > > void
> > > > foo (char *restrict a, int *restrict b, int *restrict c,
> > > > int *restrict d, int stride)
> > > > {
> > > > if (stride <= 1)
> > > > return;
> > > >
> > > > for (int i = 0; i < 3; i++)
> > > > {
> > > > int res = c[i];
> > > > int t = b[i * stride];
> > > > if (a[i] != 0)
> > > > res = t * d[i];
> > > > c[i] = res;
> > > > }
> > > > }
> > > >
> > > > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> > > > vectorize as it assumes scalar would be faster, and with
> > > > -fvect-cost-model=unlimited it picks a vector type that's so big that
> > > > the large
> > > > sequence generated is working on mostly inactive lanes:
> > > >
> > > > ...
> > > > and p3.b, p3/z, p4.b, p4.b
> > > > whilelo p0.s, wzr, w7
> > > > ld1w z23.s, p3/z, [x3, #3, mul vl]
> > > > ld1w z28.s, p0/z, [x5, z31.s, sxtw 2]
> > > > add x0, x5, x0
> > > > punpklo p6.h, p6.b
> > > > ld1w z27.s, p4/z, [x0, z31.s, sxtw 2]
> > > > and p6.b, p6/z, p0.b, p0.b
> > > > punpklo p4.h, p7.b
> > > > ld1w z24.s, p6/z, [x3, #2, mul vl]
> > > > and p4.b, p4/z, p2.b, p2.b
> > > > uqdecw w6
> > > > ld1w z26.s, p4/z, [x3]
> > > > whilelo p1.s, wzr, w6
> > > > mul z27.s, p5/m, z27.s, z23.s
> > > > ld1w z29.s, p1/z, [x4, z31.s, sxtw 2]
> > > > punpkhi p7.h, p7.b
> > > > mul z24.s, p5/m, z24.s, z28.s
> > > > and p7.b, p7/z, p1.b, p1.b
> > > > mul z26.s, p5/m, z26.s, z30.s
> > > > ld1w z25.s, p7/z, [x3, #1, mul vl]
> > > > st1w z27.s, p3, [x2, #3, mul vl]
> > > > mul z25.s, p5/m, z25.s, z29.s
> > > > st1w z24.s, p6, [x2, #2, mul vl]
> > > > st1w z25.s, p7, [x2, #1, mul vl]
> > > > st1w z26.s, p4, [x2]
> > > > ...
> > > >
> > > > With -fvect-cost-model=max you get more reasonable code:
> > > >
> > > > foo:
> > > > cmp w4, 1
> > > > ble .L1
> > > > ptrue p7.s, vl3
> > > > index z0.s, #0, w4
> > > > ld1b z29.s, p7/z, [x0]
> > > > ld1w z30.s, p7/z, [x1, z0.s, sxtw 2]
> > > > ptrue p6.b, all
> > > > cmpne p7.b, p7/z, z29.b, #0
> > > > ld1w z31.s, p7/z, [x3]
> > > > mul z31.s, p6/m, z31.s, z30.s
> > > > st1w z31.s, p7, [x2]
> > > > .L1:
> > > > ret
> > > >
> > > > This model has been useful internally for performance exploration and
> > > > cost-
> > model
> > > > validation. It allows us to force realistic vectorization overriding
> > > > the cost
> > > > model to be able to tell whether it's correct wrt to profitability.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> > > > -m32, -m64 and no issues.
> > > >
> > > > Ok for master?
> > >
> > > Hmm. I don't like another cost model. Instead how about changing
> > > 'unlimited' to still iterate through vector sizes? Cost modeling
> > > is really about vector vs. scalar, not vector vs. vector which is
> > > completely under target control. Targets should provide a way
> > > to limit iteration, like aarch64 has with the aarch64-autovec-preference
> > > --param or x86 has with -mprefer-vector-width.
> > >
> > > Of course changing 'unlimited' might result in somewhat of a testsuite
> > > churn, but still the fix there would be to inject a proper -mXYZ
> > > or --param to get the old behavior back (or even consider cycling
> > > through the different aarch64-autovec-preference settings for the
> > > testsuite).
> >
> > Note this will completely remove the ability to reject never profitable
> > vectorizations, so I'm not sure that this is what you'd want in practice.
> > You want to fix cost modeling instead.
> >
> > So why does it consider the scalar code to be faster with =dynamic
> > and why do you think that's not possible to fix? Don't we have
> > per-loop #pragma control to force vectorization here (but maybe that
> > has the 'unlimited' cost modeling issue)?
> >
>
> The addition wasn't for the GCC testsuite usage specifically. This is about
> testing real world code wrt to our cost models. In these instances it's not
> feasible to sprinkle pragmas over every loop in every program.
Sure, but still cost modeling should be fixed then - with using
unlimited (or max) you'd still have to sprinkle novector on the loops
that will be slower otherwise.
> We also use this during uarch design validation, as e.g. it gives someone
> working on a CPU the ability to generate vector code for design purposes
> regardless of what the compiler thinks is profitable on current designs.
For the latter I believe the target should provide ways to force a
specific mode with =unlimited then, otherwise you can't reliably get
all variants anyway but would depend on costing to pick the correct
one out of a set of enabled modes.
Richard.
> Thanks,
> Tamar
>
> > Richard.
> >
> > > Richard.
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * common.opt (vect-cost-model, simd-cost-model): Add max cost
> > > > model.
> > > > * doc/invoke.texi: Document it.
> > > > * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
> > > > * tree-vect-data-refs.cc (vect_peeling_hash_insert,
> > > > vect_peeling_hash_choose_best_peeling,
> > > > vect_enhance_data_refs_alignment): Use it.
> > > > * tree-vect-loop.cc (vect_analyze_loop_costing,
> > > > vect_estimate_min_profitable_iters): Likewise.
> > > >
> > > > ---
> > > > diff --git a/gcc/common.opt b/gcc/common.opt
> > > > index
> > 88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9ad
> > dc736e336cd0e4ee 100644
> > > > --- a/gcc/common.opt
> > > > +++ b/gcc/common.opt
> > > > @@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on trees.
> > > >
> > > > fvect-cost-model=
> > > > Common Joined RejectNegative Enum(vect_cost_model)
> > Var(flag_vect_cost_model) Init(VECT_COST_MODEL_DEFAULT) Optimization
> > > > --fvect-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the
> > > > cost
> > model for vectorization.
> > > > +-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap]
> > > > Specifies
> > the cost model for vectorization.
> > > >
> > > > fsimd-cost-model=
> > > > Common Joined RejectNegative Enum(vect_cost_model)
> > Var(flag_simd_cost_model) Init(VECT_COST_MODEL_UNLIMITED) Optimization
> > > > --fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies
> > the vectorization cost model for code marked with a simd directive.
> > > > +-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap]
> > > > Specifies
> > the vectorization cost model for code marked with a simd directive.
> > > >
> > > > Enum
> > > > Name(vect_cost_model) Type(enum vect_cost_model)
> > UnknownError(unknown vectorizer cost model %qs)
> > > > @@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum
> > vect_cost_model) UnknownError(unknown vectorizer
> > > > EnumValue
> > > > Enum(vect_cost_model) String(unlimited)
> > Value(VECT_COST_MODEL_UNLIMITED)
> > > >
> > > > +EnumValue
> > > > +Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
> > > > +
> > > > EnumValue
> > > > Enum(vect_cost_model) String(dynamic)
> > Value(VECT_COST_MODEL_DYNAMIC)
> > > >
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > index
> > 14a78fd236f64185fc129f18b52b20692d49305c..e7b242c9134ff17022c92f81c
> > 8b24762cfd59c6c 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -14449,9 +14449,11 @@ With the @samp{unlimited} model the
> > vectorized code-path is assumed
> > > > to be profitable while with the @samp{dynamic} model a runtime check
> > > > guards the vectorized code-path to enable it only for iteration
> > > > counts that will likely execute faster than when executing the original
> > > > -scalar loop. The @samp{cheap} model disables vectorization of
> > > > -loops where doing so would be cost prohibitive for example due to
> > > > -required runtime checks for data dependence or alignment but otherwise
> > > > +scalar loop. The @samp{max} model similarly to the @samp{unlimited}
> > > > model
> > > > +assumes all vector code is profitable over scalar within loops but
> > > > does not
> > > > +disable the vector to vector costing. The @samp{cheap} model disables
> > > > +vectorization of loops where doing so would be cost prohibitive for
> > > > example
> > due
> > > > +to required runtime checks for data dependence or alignment but
> > > > otherwise
> > > > is equal to the @samp{dynamic} model. The @samp{very-cheap} model
> > disables
> > > > vectorization of loops when any runtime check for data dependence or
> > alignment
> > > > is required, it also disables vectorization of epilogue loops but
> > > > otherwise is
> > > > diff --git a/gcc/flag-types.h b/gcc/flag-types.h
> > > > index
> > db573768c23d9f6809ae115e71370960314f16ce..1c941c295a2e608eae58c3e3
> > fb0eba1284f731ca 100644
> > > > --- a/gcc/flag-types.h
> > > > +++ b/gcc/flag-types.h
> > > > @@ -277,9 +277,10 @@ enum scalar_storage_order_kind {
> > > > /* Vectorizer cost-model. Except for DEFAULT, the values are ordered
> > > > from
> > > > the most conservative to the least conservative. */
> > > > enum vect_cost_model {
> > > > - VECT_COST_MODEL_VERY_CHEAP = -3,
> > > > - VECT_COST_MODEL_CHEAP = -2,
> > > > - VECT_COST_MODEL_DYNAMIC = -1,
> > > > + VECT_COST_MODEL_VERY_CHEAP = -4,
> > > > + VECT_COST_MODEL_CHEAP = -3,
> > > > + VECT_COST_MODEL_DYNAMIC = -2,
> > > > + VECT_COST_MODEL_MAX = -1,
> > > > VECT_COST_MODEL_UNLIMITED = 0,
> > > > VECT_COST_MODEL_DEFAULT = 1
> > > > };
> > > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> > > > index
> > c9395e33fcdfc7deedd979c764daae93b15abace..5c56956c2edcb76210c36b605
> > 26f031011c8e0c7 100644
> > > > --- a/gcc/tree-vect-data-refs.cc
> > > > +++ b/gcc/tree-vect-data-refs.cc
> > > > @@ -1847,7 +1847,9 @@ vect_peeling_hash_insert
> > (hash_table<peel_info_hasher> *peeling_htab,
> > > > /* If this DR is not supported with unknown misalignment then bias
> > > > this slot when the cost model is disabled. */
> > > > if (!supportable_if_not_aligned
> > > > - && unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > + && (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > + || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > + == VECT_COST_MODEL_MAX))
> > > > slot->count += VECT_MAX_COST;
> > > > }
> > > >
> > > > @@ -2002,7 +2004,8 @@ vect_peeling_hash_choose_best_peeling
> > (hash_table<peel_info_hasher> *peeling_hta
> > > > res.peel_info.dr_info = NULL;
> > > > res.vinfo = loop_vinfo;
> > > >
> > > > - if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > + if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > + && loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) !=
> > VECT_COST_MODEL_MAX)
> > > > {
> > > > res.inside_cost = INT_MAX;
> > > > res.outside_cost = INT_MAX;
> > > > @@ -2348,7 +2351,8 @@ vect_enhance_data_refs_alignment (loop_vec_info
> > loop_vinfo)
> > > > We do this automatically for cost model, since we
> > > > calculate
> > > > cost for every peeling option. */
> > > > poly_uint64 nscalars = npeel_tmp;
> > > > - if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > + if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
> > > > + || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) ==
> > VECT_COST_MODEL_MAX)
> > > > {
> > > > poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > > unsigned group_size = 1;
> > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > index
> > 958b829fa8d1ad267fbde3be915719f3a51e6a38..5f3adc257f6581850f901c774
> > 7771f5931df942a 100644
> > > > --- a/gcc/tree-vect-loop.cc
> > > > +++ b/gcc/tree-vect-loop.cc
> > > > @@ -2407,7 +2407,8 @@ vect_analyze_loop_costing (loop_vec_info
> > loop_vinfo,
> > > > &min_profitable_estimate,
> > > > suggested_unroll_factor);
> > > >
> > > > - if (min_profitable_iters < 0)
> > > > + if (min_profitable_iters < 0
> > > > + && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> > > > {
> > > > if (dump_enabled_p ())
> > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > @@ -2430,7 +2431,8 @@ vect_analyze_loop_costing (loop_vec_info
> > loop_vinfo,
> > > > LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
> > > >
> > > > if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > - && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
> > > > + && LOOP_VINFO_INT_NITERS (loop_vinfo) < th
> > > > + && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
> > > > {
> > > > if (dump_enabled_p ())
> > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > @@ -2490,6 +2492,7 @@ vect_analyze_loop_costing (loop_vec_info
> > loop_vinfo,
> > > > estimated_niter = likely_max_stmt_executions_int (loop);
> > > > }
> > > > if (estimated_niter != -1
> > > > + && loop_cost_model (loop) != VECT_COST_MODEL_MAX
> > > > && ((unsigned HOST_WIDE_INT) estimated_niter
> > > > < MAX (th, (unsigned) min_profitable_estimate)))
> > > > {
> > > > @@ -4638,7 +4641,7 @@ vect_estimate_min_profitable_iters (loop_vec_info
> > loop_vinfo,
> > > > vector_costs *target_cost_data = loop_vinfo->vector_costs;
> > > >
> > > > /* Cost model disabled. */
> > > > - if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > + if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > {
> > > > if (dump_enabled_p ())
> > > > dump_printf_loc (MSG_NOTE, vect_location, "cost model
> > > > disabled.\n");
> > > >
> > > >
> > > >
> > >
> > >
> >
> > --
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)