[PATCH 1/2]middle-end: Add new parameter to scale scalar loop costing in vectorizer

Tamar Christina Tue, 13 May 2025 02:38:40 -0700

Hi All,

This patch adds a new param vect-scalar-cost-multiplier to scale the scalar
costing during vectorization.  If the cost is set high enough and when using
the dynamic cost model it has the effect of effectively disabling the
costing vs scalar and assumes all vectorization to be profitable.


This is similar to using the unlimited cost model, but unlike unlimited it
does not fully disable the vector cost model.  That means that we still
perform comparisons between vector modes.  And it means it also still does
costing for alias analysis.

As an example, the following:

void
foo (char *restrict a, int *restrict b, int *restrict c,
     int *restrict d, int stride)
{
    if (stride <= 1)
        return;

    for (int i = 0; i < 3; i++)
        {
            int res = c[i];
            int t = b[i * stride];
            if (a[i] != 0)
                res = t * d[i];
            c[i] = res;
        }
}

compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
vectorize as it assumes scalar would be faster, and with
-fvect-cost-model=unlimited it picks a vector type that's so big that the large
sequence generated is working on mostly inactive lanes:

        ...
        and     p3.b, p3/z, p4.b, p4.b
        whilelo p0.s, wzr, w7
        ld1w    z23.s, p3/z, [x3, #3, mul vl]
        ld1w    z28.s, p0/z, [x5, z31.s, sxtw 2]
        add     x0, x5, x0
        punpklo p6.h, p6.b
        ld1w    z27.s, p4/z, [x0, z31.s, sxtw 2]
        and     p6.b, p6/z, p0.b, p0.b
        punpklo p4.h, p7.b
        ld1w    z24.s, p6/z, [x3, #2, mul vl]
        and     p4.b, p4/z, p2.b, p2.b
        uqdecw  w6
        ld1w    z26.s, p4/z, [x3]
        whilelo p1.s, wzr, w6
        mul     z27.s, p5/m, z27.s, z23.s
        ld1w    z29.s, p1/z, [x4, z31.s, sxtw 2]
        punpkhi p7.h, p7.b
        mul     z24.s, p5/m, z24.s, z28.s
        and     p7.b, p7/z, p1.b, p1.b
        mul     z26.s, p5/m, z26.s, z30.s
        ld1w    z25.s, p7/z, [x3, #1, mul vl]
        st1w    z27.s, p3, [x2, #3, mul vl]
        mul     z25.s, p5/m, z25.s, z29.s
        st1w    z24.s, p6, [x2, #2, mul vl]
        st1w    z25.s, p7, [x2, #1, mul vl]
        st1w    z26.s, p4, [x2]
        ...

With -fvect-cost-model=dynamic --param vect-scalar-cost-multiplier=200
you get more reasonable code:

foo:
        cmp     w4, 1
        ble     .L1
        ptrue   p7.s, vl3
        index   z0.s, #0, w4
        ld1b    z29.s, p7/z, [x0]
        ld1w    z30.s, p7/z, [x1, z0.s, sxtw 2]
        ptrue   p6.b, all
        cmpne   p7.b, p7/z, z29.b, #0
        ld1w    z31.s, p7/z, [x3]
        mul     z31.s, p6/m, z31.s, z30.s
        st1w    z31.s, p7, [x2]
.L1:
        ret

This model has been useful internally for performance exploration and cost-model
validation.  It allows us to force realistic vectorization overriding the cost
model to be able to tell whether it's correct wrt to profitability.

Bootstrapped Regtested on aarch64-none-linux-gnu,
arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
-m32, -m64 and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

        * params.opt (vect-scalar-cost-multiplie): New.
        * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Use it.
        * doc/invoke.texi (vect-scalar-cost-multiplie): Document it.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/sve/cost_model_16.c: New test.

---
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 
f31d504f99e21ff282bd1c2bcb61e4dd0397a748..b58a971f36fce7facfab2a72b2500a471c4e0bc9
 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -17273,6 +17273,10 @@ this parameter.  The default value of this parameter 
is 50.
 @item vect-induction-float
 Enable loop vectorization of floating point inductions.
 
+@item vect-scalar-cost-multiplier
+Apply the given penalty to scalar loop costing during vectorization.
+Increasing the cost multiplier will make vector loops more profitable.
+
 @item vrp-block-limit
 Maximum number of basic blocks before VRP switches to a lower memory algorithm.
 
diff --git a/gcc/params.opt b/gcc/params.opt
index 
1f0abeccc4b9b439ad4a4add6257b4e50962863d..f89ffe8382d55a51c8573d7dd76853a05b530f90
 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1253,6 +1253,10 @@ The maximum factor which the loop vectorizer applies to 
the cost of statements i
 Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 
1) Param Optimization
 Enable loop vectorization of floating point inductions.
 
+-param=vect-scalar-cost-multiplier=
+Common Joined UInteger Var(param_vect_scalar_cost_multiplier) Init(1) 
IntegerRange(0, 100000) Param Optimization
+The scaling multiplier to add to all scalar loop costing when performing 
vectorization profitability analysis.  The default value is 1.
+
 -param=vrp-block-limit=
 Common Joined UInteger Var(param_vrp_block_limit) Init(150000) Optimization 
Param
 Maximum number of basic blocks before VRP switches to a fast model with less 
memory requirements.
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c
new file mode 100644
index 
0000000000000000000000000000000000000000..c405591a101d50b4734bc6d65a6d6c01888bea48
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8-a+sve -mmax-vectorization 
-fdump-tree-vect-details" } */
+
+void
+foo (char *restrict a, int *restrict b, int *restrict c,
+     int *restrict d, int stride)
+{
+    if (stride <= 1)
+        return;
+
+    for (int i = 0; i < 3; i++)
+        {
+            int res = c[i];
+            int t = b[i * stride];
+            if (a[i] != 0)
+                res = t * d[i];
+            c[i] = res;
+        }
+}
+
+/* { dg-final { scan-tree-dump "vectorized 1 loops in function" "vect" } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..a0d933aa100c5dd5f0fc78f1eec71a032df29325
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -4646,7 +4646,8 @@ vect_estimate_min_profitable_iters (loop_vec_info 
loop_vinfo,
      TODO: Consider assigning different costs to different scalar
      statements.  */
 
-  scalar_single_iter_cost = loop_vinfo->scalar_costs->total_cost ();
+  scalar_single_iter_cost = loop_vinfo->scalar_costs->total_cost ()
+                           * param_vect_scalar_cost_multiplier;
 
   /* Add additional cost for the peeled instructions in prologue and epilogue
      loop.  (For fully-masked loops there will be no peeling.)


--

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index f31d504f99e21ff282bd1c2bcb61e4dd0397a748..b58a971f36fce7facfab2a72b2500a471c4e0bc9 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -17273,6 +17273,10 @@ this parameter.  The default value of this parameter is 50.
 @item vect-induction-float
 Enable loop vectorization of floating point inductions.
 
+@item vect-scalar-cost-multiplier
+Apply the given penalty to scalar loop costing during vectorization.
+Increasing the cost multiplier will make vector loops more profitable.
+
 @item vrp-block-limit
 Maximum number of basic blocks before VRP switches to a lower memory algorithm.
 
diff --git a/gcc/params.opt b/gcc/params.opt
index 1f0abeccc4b9b439ad4a4add6257b4e50962863d..f89ffe8382d55a51c8573d7dd76853a05b530f90 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1253,6 +1253,10 @@ The maximum factor which the loop vectorizer applies to the cost of statements i
 Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 1) Param Optimization
 Enable loop vectorization of floating point inductions.
 
+-param=vect-scalar-cost-multiplier=
+Common Joined UInteger Var(param_vect_scalar_cost_multiplier) Init(1) IntegerRange(0, 100000) Param Optimization
+The scaling multiplier to add to all scalar loop costing when performing vectorization profitability analysis.  The default value is 1.
+
 -param=vrp-block-limit=
 Common Joined UInteger Var(param_vrp_block_limit) Init(150000) Optimization Param
 Maximum number of basic blocks before VRP switches to a fast model with less memory requirements.
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c
new file mode 100644
index 0000000000000000000000000000000000000000..c405591a101d50b4734bc6d65a6d6c01888bea48
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_16.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8-a+sve -mmax-vectorization -fdump-tree-vect-details" } */
+
+void
+foo (char *restrict a, int *restrict b, int *restrict c,
+     int *restrict d, int stride)
+{
+    if (stride <= 1)
+        return;
+
+    for (int i = 0; i < 3; i++)
+        {
+            int res = c[i];
+            int t = b[i * stride];
+            if (a[i] != 0)
+                res = t * d[i];
+            c[i] = res;
+        }
+}
+
+/* { dg-final { scan-tree-dump "vectorized 1 loops in function" "vect" } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index fe6f3cf188e40396b299ff9e814cc402bc2d4e2d..a0d933aa100c5dd5f0fc78f1eec71a032df29325 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -4646,7 +4646,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Consider assigning different costs to different scalar
      statements.  */
 
-  scalar_single_iter_cost = loop_vinfo->scalar_costs->total_cost ();
+  scalar_single_iter_cost = loop_vinfo->scalar_costs->total_cost ()
+			    * param_vect_scalar_cost_multiplier;
 
   /* Add additional cost for the peeled instructions in prologue and epilogue
      loop.  (For fully-masked loops there will be no peeling.)

[PATCH 1/2]middle-end: Add new parameter to scale scalar loop costing in vectorizer

Reply via email to