On Tue, 13 May 2025, Tamar Christina wrote:
> Hi All,
>
> This patch adds a new param vect-scalar-cost-multiplier to scale the scalar
> costing during vectorization. If the cost is set high enough and when using
> the dynamic cost model it has the effect of effectively disabling the
> costing vs scalar and assumes all vectorization to be profitable.
>
> This is similar to using the unlimited cost model, but unlike unlimited it
> does not fully disable the vector cost model. That means that we still
> perform comparisons between vector modes. And it means it also still does
> costing for alias analysis.
>
> As an example, the following:
>
> void
> foo (char *restrict a, int *restrict b, int *restrict c,
> int *restrict d, int stride)
> {
> if (stride <= 1)
> return;
>
> for (int i = 0; i < 3; i++)
> {
> int res = c[i];
> int t = b[i * stride];
> if (a[i] != 0)
> res = t * d[i];
> c[i] = res;
> }
> }
>
> compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> vectorize as it assumes scalar would be faster, and with
> -fvect-cost-model=unlimited it picks a vector type that's so big that the
> large
> sequence generated is working on mostly inactive lanes:
>
> ...
> and p3.b, p3/z, p4.b, p4.b
> whilelo p0.s, wzr, w7
> ld1w z23.s, p3/z, [x3, #3, mul vl]
> ld1w z28.s, p0/z, [x5, z31.s, sxtw 2]
> add x0, x5, x0
> punpklo p6.h, p6.b
> ld1w z27.s, p4/z, [x0, z31.s, sxtw 2]
> and p6.b, p6/z, p0.b, p0.b
> punpklo p4.h, p7.b
> ld1w z24.s, p6/z, [x3, #2, mul vl]
> and p4.b, p4/z, p2.b, p2.b
> uqdecw w6
> ld1w z26.s, p4/z, [x3]
> whilelo p1.s, wzr, w6
> mul z27.s, p5/m, z27.s, z23.s
> ld1w z29.s, p1/z, [x4, z31.s, sxtw 2]
> punpkhi p7.h, p7.b
> mul z24.s, p5/m, z24.s, z28.s
> and p7.b, p7/z, p1.b, p1.b
> mul z26.s, p5/m, z26.s, z30.s
> ld1w z25.s, p7/z, [x3, #1, mul vl]
> st1w z27.s, p3, [x2, #3, mul vl]
> mul z25.s, p5/m, z25.s, z29.s
> st1w z24.s, p6, [x2, #2, mul vl]
> st1w z25.s, p7, [x2, #1, mul vl]
> st1w z26.s, p4, [x2]
> ...
>
> With -fvect-cost-model=dynamic --param vect-scalar-cost-multiplier=200
> you get more reasonable code:
>
> foo:
> cmp w4, 1
> ble .L1
> ptrue p7.s, vl3
> index z0.s, #0, w4
> ld1b z29.s, p7/z, [x0]
> ld1w z30.s, p7/z, [x1, z0.s, sxtw 2]
> ptrue p6.b, all
> cmpne p7.b, p7/z, z29.b, #0
> ld1w z31.s, p7/z, [x3]
> mul z31.s, p6/m, z31.s, z30.s
> st1w z31.s, p7, [x2]
> .L1:
> ret
>
> This model has been useful internally for performance exploration and
> cost-model
> validation. It allows us to force realistic vectorization overriding the cost
> model to be able to tell whether it's correct wrt to profitability.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> * params.opt (vect-scalar-cost-multiplie): New.
r
> * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Use it.
> * doc/invoke.texi (vect-scalar-cost-multiplie): Document it.
Likewisee.
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/sve/cost_model_16.c: New test.
>
> ---
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index
> f31d504f99e21ff282bd1c2bcb61e4dd0397a748..b58a971f36fce7facfab2a72b2500a471c4e0bc9
> 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -17273,6 +17273,10 @@ this parameter. The default value of this parameter
> is 50.
> @item vect-induction-float
> Enable loop vectorization of floating point inductions.
>
> +@item vect-scalar-cost-multiplier
> +Apply the given penalty to scalar loop costing during vectorization.
Apply the given multiplier to scalar loop ...
> +Increasing the cost multiplier will make vector loops more profitable.
> +
> @item vrp-block-limit
> Maximum number of basic blocks before VRP switches to a lower memory
> algorithm.
>
> diff --git a/gcc/params.opt b/gcc/params.opt
> index
> 1f0abeccc4b9b439ad4a4add6257b4e50962863d..f89ffe8382d55a51c8573d7dd76853a05b530f90
> 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -1253,6 +1253,10 @@ The maximum factor which the loop vectorizer applies
> to the cost of statements i
> Common Joined UInteger Var(param_vect_induction_float) Init(1)
> IntegerRange(0, 1) Param Optimization
> Enable loop vectorization of floating point inductions.
>
> +-param=vect-scalar-cost-multiplier=
> +Common Joined UInteger Var(param_vect_scalar_cost_multiplier) Init(1)
> IntegerRange(0, 100000) Param Optimization
> +The scaling multiplier to add to all scalar loop costing when performing
> vectorization profitability analysis. The default value is 1.
> +
Note this only allows whole number scaling. May I suggest to instead
use percentage as unit, thus the multiplier is --param
param_vect_scalar_cost_multiplier / 100?
> -param=vrp-block-limit=
> Common Joined UInteger Var(param_vrp_block_limit) Init(150000) Optimization
> Param
> Maximum number of basic blocks before VRP switches to a fast model with less
> memory requirements.
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..c405591a101d50b4734bc6d65a6d6c01888bea48
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -march=armv8-a+sve -mmax-vectorization
> -fdump-tree-vect-details" } */
> +
> +void
> +foo (char *restrict a, int *restrict b, int *restrict c,
> + int *restrict d, int stride)
> +{
> + if (stride <= 1)
> + return;
> +
> + for (int i = 0; i < 3; i++)
> + {
> + int res = c[i];
> + int t = b[i * stride];
> + if (a[i] != 0)
> + res = t * d[i];
> + c[i] = res;
> + }
> +}
> +
> +/* { dg-final { scan-tree-dump "vectorized 1 loops in function" "vect" } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index
> fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..a0d933aa100c5dd5f0fc78f1eec71a032df29325
> 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -4646,7 +4646,8 @@ vect_estimate_min_profitable_iters (loop_vec_info
> loop_vinfo,
> TODO: Consider assigning different costs to different scalar
> statements. */
>
> - scalar_single_iter_cost = loop_vinfo->scalar_costs->total_cost ();
> + scalar_single_iter_cost = loop_vinfo->scalar_costs->total_cost ()
> + * param_vect_scalar_cost_multiplier;
>
> /* Add additional cost for the peeled instructions in prologue and epilogue
> loop. (For fully-masked loops there will be no peeling.)
>
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)