[PATCH] Don't vectorize when vector stmts are only vec_contruct and stores

liuhongt Sun, 03 Dec 2023 21:32:29 -0800

.i.e. for below cases.
   a[0] = b1;
   a[1] = b2;
   ..
   a[n] = bn;

There're extra dependences when contructing the vector, but not for
scalar store. According to experiments, it's generally worse.


The patch adds an cut-off heuristic when vec_stmt is just
vec_construct and vector store. It improves SPEC2017 a little bit.

BenchMarks              Ratio
500.perlbench_r         2.60%
502.gcc_r               0.30%
505.mcf_r               0.40%
520.omnetpp_r           -1.00%
523.xalancbmk_r         0.90%
525.x264_r              0.00%
531.deepsjeng_r         0.30%
541.leela_r             0.90%
548.exchange2_r         3.20%
557.xz_r                1.40%
503.bwaves_r            0.00%
507.cactuBSSN_r         0.00%
508.namd_r              0.30%
510.parest_r            0.00%
511.povray_r            0.20%
519.lbm_r               SAME BIN
521.wrf_r               -0.30%
526.blender_r           -1.20%
527.cam4_r              -0.20%
538.imagick_r           4.00%
544.nab_r               0.40%
549.fotonik3d_r         0.00%
554.roms_r              0.00%
Geomean-int             0.90%
Geomean-fp              0.30%
Geomean-all             0.50%

And
Regressed testcases:

gcc.target/i386/part-vect-absneghf.c
gcc.target/i386/part-vect-copysignhf.c
gcc.target/i386/part-vect-xorsignhf.c

Regressed under -m32 since it generates 2 vector
.ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The
original testcases are used to test vectorization capability for
.ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT.

gcc.target/i386/pr111023-2.c
gcc.target/i386/pr111023.c
Regressed under -m32

testcase as below

void
v8hi_v8qi (v8hi *dst, v16qi src)
{
  short tem[8];
  tem[0] = src[0];
  tem[1] = src[1];
  tem[2] = src[2];
  tem[3] = src[3];
  tem[4] = src[4];
  tem[5] = src[5];
  tem[6] = src[6];
  tem[7] = src[7];
  dst[0] = *(v8hi *) tem;
}

under 64-bit target, vectorizer realize it's just permutation of
original src vector, but under -m32, vectorizer relies on
vec_construct for vectorization. I think optimziation for this case
under 32-bit target maynot impact much, so just add
-fno-vect-cost-model.

gcc.target/i386/pr91446.c: This testcase is guard for cost model of
vector store, not vectorization capability, so just adjust testcase.

gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to
optimize for bswap, like other optimziation vectorizer can't realize
optimization after it. So the current solution is add
-fno-vect-cost-model to the testcase.

costmodel-pr104582-1.c
costmodel-pr104582-2.c
costmodel-pr104582-4.c

Failed since it's now not vectorized, looked at the PR, it's exactly
what's wanted, so adjust testcase to scan-tree-dump-not.


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

        PR target/99881
        PR target/104582
        * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
        Check if kind is vec_construct or vector store.
        (ix86_vector_costs::finish_cost): Don't do vectorization when
        vector stmts are only vec_construct and stores.
        (ix86_vector_costs::ix86_vect_construct_store_only_p): New
        function.
        (ix86_vector_costs::ix86_vect_cut_off): Ditto.

gcc/testsuite/ChangeLog:

        * gcc.target/i386/part-vect-absneghf.c: Restrict testcase to
        TARGET_64BIT.
        * gcc.target/i386/part-vect-copysignhf.c: Ditto.
        * gcc.target/i386/part-vect-xorsignhf.c: Ditto.
        * gcc.target/i386/pr91446.c: Adjust testcase.
        * gcc.target/i386/pr108938-3.c: Add -fno-vect-cost-model.
        * gcc.target/i386/pr111023-2.c: Ditto.
        * gcc.target/i386/pr111023.c: Ditto.
        * gcc.target/i386/pr99881.c: Remove xfail.
        * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: Changed
        to Scan-tree-dump-not.
        * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Ditto.
        * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Ditto.
---
 gcc/config/i386/i386.cc                       | 81 ++++++++++++++++++-
 .../costmodel/x86_64/costmodel-pr104582-1.c   |  2 +-
 .../costmodel/x86_64/costmodel-pr104582-3.c   |  2 +-
 .../costmodel/x86_64/costmodel-pr104582-4.c   |  2 +-
 .../gcc.target/i386/part-vect-absneghf.c      |  4 +-
 .../gcc.target/i386/part-vect-copysignhf.c    |  4 +-
 .../gcc.target/i386/part-vect-xorsignhf.c     |  4 +-
 gcc/testsuite/gcc.target/i386/pr108938-3.c    |  2 +-
 gcc/testsuite/gcc.target/i386/pr111023-2.c    |  2 +-
 gcc/testsuite/gcc.target/i386/pr111023.c      |  2 +-
 gcc/testsuite/gcc.target/i386/pr91446.c       | 14 ++--
 gcc/testsuite/gcc.target/i386/pr99881.c       |  2 +-
 12 files changed, 99 insertions(+), 22 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index dcaea6c2096..a4b23e29eba 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -24573,6 +24573,10 @@ public:
 
 private:
 
+  /* Don't do vectorization for certain patterns.  */
+  void ix86_vect_cut_off ();
+
+  bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info);
   /* Estimate register pressure of the vectorized code.  */
   void ix86_vect_estimate_reg_pressure ();
   /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
@@ -24581,12 +24585,15 @@ private:
      where we know it's not loaded from memory.  */
   unsigned m_num_gpr_needed[3];
   unsigned m_num_sse_needed[3];
+
+  bool m_vec_construct_store_only;
 };
 
 ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar)
   : vector_costs (vinfo, costing_for_scalar),
     m_num_gpr_needed (),
-    m_num_sse_needed ()
+    m_num_sse_needed (),
+    m_vec_construct_store_only (true)
 {
 }
 
@@ -24609,6 +24616,10 @@ ix86_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
     = (kind == scalar_stmt || kind == scalar_load || kind == scalar_store);
   int stmt_cost = - 1;
 
+  if (m_vec_construct_store_only
+      && !ix86_vect_construct_store_only_p (kind, stmt_info))
+    m_vec_construct_store_only = false;
+
   bool fp = false;
   machine_mode mode = scalar_p ? SImode : TImode;
 
@@ -24865,8 +24876,45 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
     }
 }
 
+/* Return true if KIND and STMT_INFO is either vec_construct or vector
+   store, stmt_info could be promote/demote between vec_construct and
+   vector store, also count that.  */
+bool
+ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind,
+                                                   stmt_vec_info stmt_info)
+{
+  switch (kind)
+    {
+    case vec_construct:
+    case vector_store:
+    case unaligned_store:
+    case vector_scatter_store:
+    case vec_promote_demote:
+      return true;
+
+      /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes
+        char* a;
+        short b1, b2, b3, b4;
+        a[0] = b1;
+        a[1] = b2;
+        a[2] = b3;
+        a[3] = b4;
+        Also don't vectorized it.  */
+    case vector_stmt:
+      if (stmt_info && stmt_info->stmt
+         && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN
+         && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR)
+       return true;
+
+    default:
+      break;
+    }
+
+  return false;
+}
+
 void
-ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+ix86_vector_costs::ix86_vect_cut_off ()
 {
   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
   if (loop_vinfo && !m_costing_for_scalar)
@@ -24885,10 +24933,39 @@ ix86_vector_costs::finish_cost (const vector_costs 
*scalar_costs)
          && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
              > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
        m_costs[vect_body] = INT_MAX;
+      return;
+    }
+
+  /* Don't do vectorization for things like
+
+     a[0] = b1;
+     a[1] = b2;
+     ..
+     a[n] = bn;
+
+     There're extra dependences when contructing the vector, but not for
+     scalar store. According to experiments, it's generally worse.  */
+  if (m_vec_construct_store_only)
+    {
+      m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX;
+      if (dump_enabled_p ())
+       dump_printf_loc (MSG_NOTE, vect_location,
+                        "Skip vectorization for stmts which only contains"
+                        " vec_construct and vector store.\n");
+      return;
     }
 
+  return;
+}
+
+void
+ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+{
+
   ix86_vect_estimate_reg_pressure ();
 
+  ix86_vect_cut_off ();
+
   vector_costs::finish_cost (scalar_costs);
 }
 
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
index 992a845ad7a..f940af70b72 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
@@ -12,4 +12,4 @@ foo (unsigned long *a, unsigned long *b)
   s.b = b_;
 }
 
-/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
index 999c4905708..eff60b2a82a 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
@@ -10,4 +10,4 @@ foo (double a, double b)
   s.b = b;
 }
 
-/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
index cc471e1ed73..2f354338e96 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
@@ -12,4 +12,4 @@ foo (signed long *a, unsigned long *b)
   s.b = b_;
 }
 
-/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c 
b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
index 48aed14d604..4052210ec39 100644
--- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
+++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
@@ -85,7 +85,7 @@ do_test (void)
       abort ();
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 
"slp2" } } */
-/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 
"slp2" } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 
"slp2" { target { ! ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 
"slp2" { target { ! ia32 } } } } */
 /* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR <vect} 2 "optimized" { 
target { ! ia32 } } } } */
 /* { dg-final { scan-tree-dump-times {(?n)= -vect} 2 "optimized" { target { ! 
ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c 
b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
index 811617bc3dd..006941f6651 100644
--- a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
+++ b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
@@ -55,6 +55,6 @@ do_test (void)
       abort ();
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 
"slp2" } } */
-/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 
"slp2" } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 
"slp2" { target { ! ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 
"slp2" { target { ! ia32 } } } } */
 /* { dg-final { scan-tree-dump-times ".COPYSIGN" 2 "optimized" { target { ! 
ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c 
b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
index a8ec60a088a..a57dc5aba23 100644
--- a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
+++ b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
@@ -55,6 +55,6 @@ do_test (void)
       abort ();
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 
"slp2" } } */
-/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 
"slp2" } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 
"slp2" { target { ! ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 
"slp2" { target { ! ia32 } } } } */
 /* { dg-final { scan-tree-dump-times ".XORSIGN" 2 "optimized" { target { ! 
ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c 
b/gcc/testsuite/gcc.target/i386/pr108938-3.c
index 32ac544c7ed..24725a9ab1d 100644
--- a/gcc/testsuite/gcc.target/i386/pr108938-3.c
+++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -mno-movbe" } */
+/* { dg-options "-O2 -ftree-vectorize -mno-movbe -fno-vect-cost-model" } */
 /* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } 
} */
 /* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */
 
diff --git a/gcc/testsuite/gcc.target/i386/pr111023-2.c 
b/gcc/testsuite/gcc.target/i386/pr111023-2.c
index 6c69f947544..db25668a9ae 100644
--- a/gcc/testsuite/gcc.target/i386/pr111023-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr111023-2.c
@@ -1,6 +1,6 @@
 /* PR target/111023 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 
-mno-sse4.1" } */
+/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 
-fno-vect-cost-model" } */
 
 typedef char v16qi __attribute__((vector_size (16)));
 typedef short v8hi __attribute__((vector_size (16)));
diff --git a/gcc/testsuite/gcc.target/i386/pr111023.c 
b/gcc/testsuite/gcc.target/i386/pr111023.c
index 6144c371f32..18cd579d937 100644
--- a/gcc/testsuite/gcc.target/i386/pr111023.c
+++ b/gcc/testsuite/gcc.target/i386/pr111023.c
@@ -1,6 +1,6 @@
 /* PR target/111023 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 
-mno-sse4.1" } */
+/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 
-fno-vect-cost-model" } */
 
 typedef unsigned char v16qi __attribute__((vector_size (16)));
 typedef unsigned short v8hi __attribute__((vector_size (16)));
diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c 
b/gcc/testsuite/gcc.target/i386/pr91446.c
index 0243ca3ea68..3e0f8a4b315 100644
--- a/gcc/testsuite/gcc.target/i386/pr91446.c
+++ b/gcc/testsuite/gcc.target/i386/pr91446.c
@@ -10,15 +10,15 @@ typedef struct
 extern void bar (info *);
 
 void
-foo (unsigned long long width, unsigned long long height,
-     long long x, long long y)
+foo (unsigned long long* width,
+     long long* x)
 {
   info t;
-  t.width = width;
-  t.height = height;
-  t.x = x;
-  t.y = y;
+  t.width = width[0];
+  t.height = width[1];
+  t.x = x[0];
+  t.y = x[1];
   bar (&t);
 }
 
-/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]\[^\n\r\]*xmm\[0-9\]" 4 } } 
*/
diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c 
b/gcc/testsuite/gcc.target/i386/pr99881.c
index 3e087eb2ed7..1a59325ac1c 100644
--- a/gcc/testsuite/gcc.target/i386/pr99881.c
+++ b/gcc/testsuite/gcc.target/i386/pr99881.c
@@ -1,7 +1,7 @@
 /* PR target/99881.  */
 /* { dg-do compile { target { ! ia32 } } } */
 /* { dg-options "-Ofast -march=skylake" } */
-/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not "xmm\[0-9\]"  } } */
 
 void
 foo (int* __restrict a, int n, int c)
-- 
2.31.1

[PATCH] Don't vectorize when vector stmts are only vec_contruct and stores

Reply via email to