[PATCH]middle-end: Add new "max" vector cost model

Tamar Christina Wed, 23 Apr 2025 00:18:04 -0700

Hi All,

This patch proposes a new vector cost model called "max".  The cost model is an
intersection between two of our existing cost models.  Like `unlimited` it
disables the costing vs scalar and assumes all vectorization to be profitable.


But unlike unlimited it does not fully disable the vector cost model.  That
means that we still perform comparisons between vector modes.

As an example, the following:

void
foo (char *restrict a, int *restrict b, int *restrict c,
     int *restrict d, int stride)
{
    if (stride <= 1)
        return;

    for (int i = 0; i < 3; i++)
        {
            int res = c[i];
            int t = b[i * stride];
            if (a[i] != 0)
                res = t * d[i];
            c[i] = res;
        }
}

compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
vectorize as it assumes scalar would be faster, and with
-fvect-cost-model=unlimited it picks a vector type that's so big that the large
sequence generated is working on mostly inactive lanes:

        ...
        and     p3.b, p3/z, p4.b, p4.b
        whilelo p0.s, wzr, w7
        ld1w    z23.s, p3/z, [x3, #3, mul vl]
        ld1w    z28.s, p0/z, [x5, z31.s, sxtw 2]
        add     x0, x5, x0
        punpklo p6.h, p6.b
        ld1w    z27.s, p4/z, [x0, z31.s, sxtw 2]
        and     p6.b, p6/z, p0.b, p0.b
        punpklo p4.h, p7.b
        ld1w    z24.s, p6/z, [x3, #2, mul vl]
        and     p4.b, p4/z, p2.b, p2.b
        uqdecw  w6
        ld1w    z26.s, p4/z, [x3]
        whilelo p1.s, wzr, w6
        mul     z27.s, p5/m, z27.s, z23.s
        ld1w    z29.s, p1/z, [x4, z31.s, sxtw 2]
        punpkhi p7.h, p7.b
        mul     z24.s, p5/m, z24.s, z28.s
        and     p7.b, p7/z, p1.b, p1.b
        mul     z26.s, p5/m, z26.s, z30.s
        ld1w    z25.s, p7/z, [x3, #1, mul vl]
        st1w    z27.s, p3, [x2, #3, mul vl]
        mul     z25.s, p5/m, z25.s, z29.s
        st1w    z24.s, p6, [x2, #2, mul vl]
        st1w    z25.s, p7, [x2, #1, mul vl]
        st1w    z26.s, p4, [x2]
        ...

With -fvect-cost-model=max you get more reasonable code:

foo:
        cmp     w4, 1
        ble     .L1
        ptrue   p7.s, vl3
        index   z0.s, #0, w4
        ld1b    z29.s, p7/z, [x0]
        ld1w    z30.s, p7/z, [x1, z0.s, sxtw 2]
        ptrue   p6.b, all
        cmpne   p7.b, p7/z, z29.b, #0
        ld1w    z31.s, p7/z, [x3]
        mul     z31.s, p6/m, z31.s, z30.s
        st1w    z31.s, p7, [x2]
.L1:
        ret

This model has been useful internally for performance exploration and cost-model
validation.  It allows us to force realistic vectorization overriding the cost
model to be able to tell whether it's correct wrt to profitability.

Bootstrapped Regtested on aarch64-none-linux-gnu,
arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
-m32, -m64 and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

        * common.opt (vect-cost-model, simd-cost-model): Add max cost model.
        * doc/invoke.texi: Document it.
        * flag-types.h (enum vect_cost_model): Add VECT_COST_MODEL_MAX.
        * tree-vect-data-refs.cc (vect_peeling_hash_insert,
        vect_peeling_hash_choose_best_peeling,
        vect_enhance_data_refs_alignment): Use it.
        * tree-vect-loop.cc (vect_analyze_loop_costing,
        vect_estimate_min_profitable_iters): Likewise.

---
diff --git a/gcc/common.opt b/gcc/common.opt
index 
88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9addc736e336cd0e4ee
 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on trees.
 
 fvect-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) 
Init(VECT_COST_MODEL_DEFAULT) Optimization
--fvect-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the cost 
model for vectorization.
+-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap]     Specifies the 
cost model for vectorization.
 
 fsimd-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) 
Init(VECT_COST_MODEL_UNLIMITED) Optimization
--fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the 
vectorization cost model for code marked with a simd directive.
+-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap]     Specifies the 
vectorization cost model for code marked with a simd directive.
 
 Enum
 Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown 
vectorizer cost model %qs)
@@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum vect_cost_model) 
UnknownError(unknown vectorizer
 EnumValue
 Enum(vect_cost_model) String(unlimited) Value(VECT_COST_MODEL_UNLIMITED)
 
+EnumValue
+Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
+
 EnumValue
 Enum(vect_cost_model) String(dynamic) Value(VECT_COST_MODEL_DYNAMIC)
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 
14a78fd236f64185fc129f18b52b20692d49305c..e7b242c9134ff17022c92f81c8b24762cfd59c6c
 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -14449,9 +14449,11 @@ With the @samp{unlimited} model the vectorized 
code-path is assumed
 to be profitable while with the @samp{dynamic} model a runtime check
 guards the vectorized code-path to enable it only for iteration
 counts that will likely execute faster than when executing the original
-scalar loop.  The @samp{cheap} model disables vectorization of
-loops where doing so would be cost prohibitive for example due to
-required runtime checks for data dependence or alignment but otherwise
+scalar loop.  The @samp{max} model similarly to the @samp{unlimited} model
+assumes all vector code is profitable over scalar within loops but does not
+disable the vector to vector costing.  The @samp{cheap} model disables
+vectorization of loops where doing so would be cost prohibitive for example due
+to required runtime checks for data dependence or alignment but otherwise
 is equal to the @samp{dynamic} model.  The @samp{very-cheap} model disables
 vectorization of loops when any runtime check for data dependence or alignment
 is required, it also disables vectorization of epilogue loops but otherwise is
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index 
db573768c23d9f6809ae115e71370960314f16ce..1c941c295a2e608eae58c3e3fb0eba1284f731ca
 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -277,9 +277,10 @@ enum scalar_storage_order_kind {
 /* Vectorizer cost-model.  Except for DEFAULT, the values are ordered from
    the most conservative to the least conservative.  */
 enum vect_cost_model {
-  VECT_COST_MODEL_VERY_CHEAP = -3,
-  VECT_COST_MODEL_CHEAP = -2,
-  VECT_COST_MODEL_DYNAMIC = -1,
+  VECT_COST_MODEL_VERY_CHEAP = -4,
+  VECT_COST_MODEL_CHEAP = -3,
+  VECT_COST_MODEL_DYNAMIC = -2,
+  VECT_COST_MODEL_MAX = -1,
   VECT_COST_MODEL_UNLIMITED = 0,
   VECT_COST_MODEL_DEFAULT = 1
 };
diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index 
c9395e33fcdfc7deedd979c764daae93b15abace..5c56956c2edcb76210c36b60526f031011c8e0c7
 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -1847,7 +1847,9 @@ vect_peeling_hash_insert (hash_table<peel_info_hasher> 
*peeling_htab,
   /* If this DR is not supported with unknown misalignment then bias
      this slot when the cost model is disabled.  */
   if (!supportable_if_not_aligned
-      && unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+      && (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
+         || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
+               == VECT_COST_MODEL_MAX))
     slot->count += VECT_MAX_COST;
 }
 
@@ -2002,7 +2004,8 @@ vect_peeling_hash_choose_best_peeling 
(hash_table<peel_info_hasher> *peeling_hta
    res.peel_info.dr_info = NULL;
    res.vinfo = loop_vinfo;
 
-   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
+       && loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) != 
VECT_COST_MODEL_MAX)
      {
        res.inside_cost = INT_MAX;
        res.outside_cost = INT_MAX;
@@ -2348,7 +2351,8 @@ vect_enhance_data_refs_alignment (loop_vec_info 
loop_vinfo)
                  We do this automatically for cost model, since we calculate
                 cost for every peeling option.  */
              poly_uint64 nscalars = npeel_tmp;
-              if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+              if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
+                 || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) == 
VECT_COST_MODEL_MAX)
                {
                  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
                  unsigned group_size = 1;
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
958b829fa8d1ad267fbde3be915719f3a51e6a38..5f3adc257f6581850f901c7747771f5931df942a
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2407,7 +2407,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
                                      &min_profitable_estimate,
                                      suggested_unroll_factor);
 
-  if (min_profitable_iters < 0)
+  if (min_profitable_iters < 0
+      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
     {
       if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -2430,7 +2431,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
 
   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
+      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th
+      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
     {
       if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -2490,6 +2492,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
        estimated_niter = likely_max_stmt_executions_int (loop);
     }
   if (estimated_niter != -1
+      && loop_cost_model (loop) != VECT_COST_MODEL_MAX
       && ((unsigned HOST_WIDE_INT) estimated_niter
          < MAX (th, (unsigned) min_profitable_estimate)))
     {
@@ -4638,7 +4641,7 @@ vect_estimate_min_profitable_iters (loop_vec_info 
loop_vinfo,
   vector_costs *target_cost_data = loop_vinfo->vector_costs;
 
   /* Cost model disabled.  */
-  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+   if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
     {
       if (dump_enabled_p ())
        dump_printf_loc (MSG_NOTE, vect_location, "cost model disabled.\n");


--

diff --git a/gcc/common.opt b/gcc/common.opt
index 88d987e6ab14d9f8df7aa686efffc43418dbb42d..bd5e2e951f9388b12206d9addc736e336cd0e4ee 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3442,11 +3442,11 @@ Enable basic block vectorization (SLP) on trees.
 
 fvect-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) Init(VECT_COST_MODEL_DEFAULT) Optimization
--fvect-cost-model=[unlimited|dynamic|cheap|very-cheap]	Specifies the cost model for vectorization.
+-fvect-cost-model=[unlimited|max|dynamic|cheap|very-cheap]	Specifies the cost model for vectorization.
 
 fsimd-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) Init(VECT_COST_MODEL_UNLIMITED) Optimization
--fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap]	Specifies the vectorization cost model for code marked with a simd directive.
+-fsimd-cost-model=[unlimited|max|dynamic|cheap|very-cheap]	Specifies the vectorization cost model for code marked with a simd directive.
 
 Enum
 Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown vectorizer cost model %qs)
@@ -3454,6 +3454,9 @@ Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown vectorizer
 EnumValue
 Enum(vect_cost_model) String(unlimited) Value(VECT_COST_MODEL_UNLIMITED)
 
+EnumValue
+Enum(vect_cost_model) String(max) Value(VECT_COST_MODEL_MAX)
+
 EnumValue
 Enum(vect_cost_model) String(dynamic) Value(VECT_COST_MODEL_DYNAMIC)
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 14a78fd236f64185fc129f18b52b20692d49305c..e7b242c9134ff17022c92f81c8b24762cfd59c6c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -14449,9 +14449,11 @@ With the @samp{unlimited} model the vectorized code-path is assumed
 to be profitable while with the @samp{dynamic} model a runtime check
 guards the vectorized code-path to enable it only for iteration
 counts that will likely execute faster than when executing the original
-scalar loop.  The @samp{cheap} model disables vectorization of
-loops where doing so would be cost prohibitive for example due to
-required runtime checks for data dependence or alignment but otherwise
+scalar loop.  The @samp{max} model similarly to the @samp{unlimited} model
+assumes all vector code is profitable over scalar within loops but does not
+disable the vector to vector costing.  The @samp{cheap} model disables
+vectorization of loops where doing so would be cost prohibitive for example due
+to required runtime checks for data dependence or alignment but otherwise
 is equal to the @samp{dynamic} model.  The @samp{very-cheap} model disables
 vectorization of loops when any runtime check for data dependence or alignment
 is required, it also disables vectorization of epilogue loops but otherwise is
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index db573768c23d9f6809ae115e71370960314f16ce..1c941c295a2e608eae58c3e3fb0eba1284f731ca 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -277,9 +277,10 @@ enum scalar_storage_order_kind {
 /* Vectorizer cost-model.  Except for DEFAULT, the values are ordered from
    the most conservative to the least conservative.  */
 enum vect_cost_model {
-  VECT_COST_MODEL_VERY_CHEAP = -3,
-  VECT_COST_MODEL_CHEAP = -2,
-  VECT_COST_MODEL_DYNAMIC = -1,
+  VECT_COST_MODEL_VERY_CHEAP = -4,
+  VECT_COST_MODEL_CHEAP = -3,
+  VECT_COST_MODEL_DYNAMIC = -2,
+  VECT_COST_MODEL_MAX = -1,
   VECT_COST_MODEL_UNLIMITED = 0,
   VECT_COST_MODEL_DEFAULT = 1
 };
diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index c9395e33fcdfc7deedd979c764daae93b15abace..5c56956c2edcb76210c36b60526f031011c8e0c7 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -1847,7 +1847,9 @@ vect_peeling_hash_insert (hash_table<peel_info_hasher> *peeling_htab,
   /* If this DR is not supported with unknown misalignment then bias
      this slot when the cost model is disabled.  */
   if (!supportable_if_not_aligned
-      && unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+      && (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
+	  || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
+		== VECT_COST_MODEL_MAX))
     slot->count += VECT_MAX_COST;
 }
 
@@ -2002,7 +2004,8 @@ vect_peeling_hash_choose_best_peeling (hash_table<peel_info_hasher> *peeling_hta
    res.peel_info.dr_info = NULL;
    res.vinfo = loop_vinfo;
 
-   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+   if (!unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
+	&& loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) != VECT_COST_MODEL_MAX)
      {
        res.inside_cost = INT_MAX;
        res.outside_cost = INT_MAX;
@@ -2348,7 +2351,8 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
                  We do this automatically for cost model, since we calculate
 		 cost for every peeling option.  */
 	      poly_uint64 nscalars = npeel_tmp;
-              if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+              if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo))
+		  || loop_cost_model (LOOP_VINFO_LOOP (loop_vinfo)) == VECT_COST_MODEL_MAX)
 		{
 		  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
 		  unsigned group_size = 1;
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 958b829fa8d1ad267fbde3be915719f3a51e6a38..5f3adc257f6581850f901c7747771f5931df942a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2407,7 +2407,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
 				      &min_profitable_estimate,
 				      suggested_unroll_factor);
 
-  if (min_profitable_iters < 0)
+  if (min_profitable_iters < 0
+      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -2430,7 +2431,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = th;
 
   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th)
+      && LOOP_VINFO_INT_NITERS (loop_vinfo) < th
+      && loop_cost_model (loop) != VECT_COST_MODEL_MAX)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -2490,6 +2492,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
 	estimated_niter = likely_max_stmt_executions_int (loop);
     }
   if (estimated_niter != -1
+      && loop_cost_model (loop) != VECT_COST_MODEL_MAX
       && ((unsigned HOST_WIDE_INT) estimated_niter
 	  < MAX (th, (unsigned) min_profitable_estimate)))
     {
@@ -4638,7 +4641,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
   vector_costs *target_cost_data = loop_vinfo->vector_costs;
 
   /* Cost model disabled.  */
-  if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
+   if (unlimited_cost_model (LOOP_VINFO_LOOP (loop_vinfo)))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location, "cost model disabled.\n");

[PATCH]middle-end: Add new "max" vector cost model

Reply via email to