On Tue, 13 May 2025, Tamar Christina wrote:

> Hi All,
> 
> This patch adds a new param vect-scalar-cost-multiplier to scale the scalar
> costing during vectorization.  If the cost is set high enough and when using
> the dynamic cost model it has the effect of effectively disabling the
> costing vs scalar and assumes all vectorization to be profitable.
> 
> This is similar to using the unlimited cost model, but unlike unlimited it
> does not fully disable the vector cost model.  That means that we still
> perform comparisons between vector modes.  And it means it also still does
> costing for alias analysis.
> 
> As an example, the following:
> 
> void
> foo (char *restrict a, int *restrict b, int *restrict c,
>      int *restrict d, int stride)
> {
>     if (stride <= 1)
>         return;
> 
>     for (int i = 0; i < 3; i++)
>         {
>             int res = c[i];
>             int t = b[i * stride];
>             if (a[i] != 0)
>                 res = t * d[i];
>             c[i] = res;
>         }
> }
> 
> compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
> vectorize as it assumes scalar would be faster, and with
> -fvect-cost-model=unlimited it picks a vector type that's so big that the 
> large
> sequence generated is working on mostly inactive lanes:
> 
>         ...
>         and     p3.b, p3/z, p4.b, p4.b
>         whilelo p0.s, wzr, w7
>         ld1w    z23.s, p3/z, [x3, #3, mul vl]
>         ld1w    z28.s, p0/z, [x5, z31.s, sxtw 2]
>         add     x0, x5, x0
>         punpklo p6.h, p6.b
>         ld1w    z27.s, p4/z, [x0, z31.s, sxtw 2]
>         and     p6.b, p6/z, p0.b, p0.b
>         punpklo p4.h, p7.b
>         ld1w    z24.s, p6/z, [x3, #2, mul vl]
>         and     p4.b, p4/z, p2.b, p2.b
>         uqdecw  w6
>         ld1w    z26.s, p4/z, [x3]
>         whilelo p1.s, wzr, w6
>         mul     z27.s, p5/m, z27.s, z23.s
>         ld1w    z29.s, p1/z, [x4, z31.s, sxtw 2]
>         punpkhi p7.h, p7.b
>         mul     z24.s, p5/m, z24.s, z28.s
>         and     p7.b, p7/z, p1.b, p1.b
>         mul     z26.s, p5/m, z26.s, z30.s
>         ld1w    z25.s, p7/z, [x3, #1, mul vl]
>         st1w    z27.s, p3, [x2, #3, mul vl]
>         mul     z25.s, p5/m, z25.s, z29.s
>         st1w    z24.s, p6, [x2, #2, mul vl]
>         st1w    z25.s, p7, [x2, #1, mul vl]
>         st1w    z26.s, p4, [x2]
>         ...
> 
> With -fvect-cost-model=dynamic --param vect-scalar-cost-multiplier=200
> you get more reasonable code:
> 
> foo:
>         cmp     w4, 1
>         ble     .L1
>         ptrue   p7.s, vl3
>         index   z0.s, #0, w4
>         ld1b    z29.s, p7/z, [x0]
>         ld1w    z30.s, p7/z, [x1, z0.s, sxtw 2]
>       ptrue   p6.b, all
>         cmpne   p7.b, p7/z, z29.b, #0
>         ld1w    z31.s, p7/z, [x3]
>       mul     z31.s, p6/m, z31.s, z30.s
>         st1w    z31.s, p7, [x2]
> .L1:
>         ret
> 
> This model has been useful internally for performance exploration and 
> cost-model
> validation.  It allows us to force realistic vectorization overriding the cost
> model to be able to tell whether it's correct wrt to profitability.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       * params.opt (vect-scalar-cost-multiplie): New.

r

>       * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Use it.
>       * doc/invoke.texi (vect-scalar-cost-multiplie): Document it.

Likewisee.
 
> gcc/testsuite/ChangeLog:
> 
>       * gcc.target/aarch64/sve/cost_model_16.c: New test.
> 
> ---
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 
> f31d504f99e21ff282bd1c2bcb61e4dd0397a748..b58a971f36fce7facfab2a72b2500a471c4e0bc9
>  100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -17273,6 +17273,10 @@ this parameter.  The default value of this parameter 
> is 50.
>  @item vect-induction-float
>  Enable loop vectorization of floating point inductions.
>  
> +@item vect-scalar-cost-multiplier
> +Apply the given penalty to scalar loop costing during vectorization.

Apply the given multiplier to scalar loop ...

> +Increasing the cost multiplier will make vector loops more profitable.
> +
>  @item vrp-block-limit
>  Maximum number of basic blocks before VRP switches to a lower memory 
> algorithm.
>  
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 
> 1f0abeccc4b9b439ad4a4add6257b4e50962863d..f89ffe8382d55a51c8573d7dd76853a05b530f90
>  100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -1253,6 +1253,10 @@ The maximum factor which the loop vectorizer applies 
> to the cost of statements i
>  Common Joined UInteger Var(param_vect_induction_float) Init(1) 
> IntegerRange(0, 1) Param Optimization
>  Enable loop vectorization of floating point inductions.
>  
> +-param=vect-scalar-cost-multiplier=
> +Common Joined UInteger Var(param_vect_scalar_cost_multiplier) Init(1) 
> IntegerRange(0, 100000) Param Optimization
> +The scaling multiplier to add to all scalar loop costing when performing 
> vectorization profitability analysis.  The default value is 1.
> +

Note this only allows whole number scaling.  May I suggest to instead
use percentage as unit, thus the multiplier is --param 
param_vect_scalar_cost_multiplier / 100?

>  -param=vrp-block-limit=
>  Common Joined UInteger Var(param_vrp_block_limit) Init(150000) Optimization 
> Param
>  Maximum number of basic blocks before VRP switches to a fast model with less 
> memory requirements.
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..c405591a101d50b4734bc6d65a6d6c01888bea48
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -march=armv8-a+sve -mmax-vectorization 
> -fdump-tree-vect-details" } */
> +
> +void
> +foo (char *restrict a, int *restrict b, int *restrict c,
> +     int *restrict d, int stride)
> +{
> +    if (stride <= 1)
> +        return;
> +
> +    for (int i = 0; i < 3; i++)
> +        {
> +            int res = c[i];
> +            int t = b[i * stride];
> +            if (a[i] != 0)
> +                res = t * d[i];
> +            c[i] = res;
> +        }
> +}
> +
> +/* { dg-final { scan-tree-dump "vectorized 1 loops in function" "vect" } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 
> fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..a0d933aa100c5dd5f0fc78f1eec71a032df29325
>  100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -4646,7 +4646,8 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>       TODO: Consider assigning different costs to different scalar
>       statements.  */
>  
> -  scalar_single_iter_cost = loop_vinfo->scalar_costs->total_cost ();
> +  scalar_single_iter_cost = loop_vinfo->scalar_costs->total_cost ()
> +                         * param_vect_scalar_cost_multiplier;
>  
>    /* Add additional cost for the peeled instructions in prologue and epilogue
>       loop.  (For fully-masked loops there will be no peeling.)
> 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to