.i.e. for below cases. a[0] = b1; a[1] = b2; .. a[n] = bn; There're extra dependences when contructing the vector, but not for scalar store. According to experiments, it's generally worse.
The patch adds an cut-off heuristic when vec_stmt is just vec_construct and vector store. It improves SPEC2017 a little bit. BenchMarks Ratio 500.perlbench_r 2.60% 502.gcc_r 0.30% 505.mcf_r 0.40% 520.omnetpp_r -1.00% 523.xalancbmk_r 0.90% 525.x264_r 0.00% 531.deepsjeng_r 0.30% 541.leela_r 0.90% 548.exchange2_r 3.20% 557.xz_r 1.40% 503.bwaves_r 0.00% 507.cactuBSSN_r 0.00% 508.namd_r 0.30% 510.parest_r 0.00% 511.povray_r 0.20% 519.lbm_r SAME BIN 521.wrf_r -0.30% 526.blender_r -1.20% 527.cam4_r -0.20% 538.imagick_r 4.00% 544.nab_r 0.40% 549.fotonik3d_r 0.00% 554.roms_r 0.00% Geomean-int 0.90% Geomean-fp 0.30% Geomean-all 0.50% And Regressed testcases: gcc.target/i386/part-vect-absneghf.c gcc.target/i386/part-vect-copysignhf.c gcc.target/i386/part-vect-xorsignhf.c Regressed under -m32 since it generates 2 vector .ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The original testcases are used to test vectorization capability for .ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT. gcc.target/i386/pr111023-2.c gcc.target/i386/pr111023.c Regressed under -m32 testcase as below void v8hi_v8qi (v8hi *dst, v16qi src) { short tem[8]; tem[0] = src[0]; tem[1] = src[1]; tem[2] = src[2]; tem[3] = src[3]; tem[4] = src[4]; tem[5] = src[5]; tem[6] = src[6]; tem[7] = src[7]; dst[0] = *(v8hi *) tem; } under 64-bit target, vectorizer realize it's just permutation of original src vector, but under -m32, vectorizer relies on vec_construct for vectorization. I think optimziation for this case under 32-bit target maynot impact much, so just add -fno-vect-cost-model. gcc.target/i386/pr91446.c: This testcase is guard for cost model of vector store, not vectorization capability, so just adjust testcase. gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to optimize for bswap, like other optimziation vectorizer can't realize optimization after it. So the current solution is add -fno-vect-cost-model to the testcase. costmodel-pr104582-1.c costmodel-pr104582-2.c costmodel-pr104582-4.c Failed since it's now not vectorized, looked at the PR, it's exactly what's wanted, so adjust testcase to scan-tree-dump-not. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/99881 PR target/104582 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Check if kind is vec_construct or vector store. (ix86_vector_costs::finish_cost): Don't do vectorization when vector stmts are only vec_construct and stores. (ix86_vector_costs::ix86_vect_construct_store_only_p): New function. (ix86_vector_costs::ix86_vect_cut_off): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/part-vect-absneghf.c: Restrict testcase to TARGET_64BIT. * gcc.target/i386/part-vect-copysignhf.c: Ditto. * gcc.target/i386/part-vect-xorsignhf.c: Ditto. * gcc.target/i386/pr91446.c: Adjust testcase. * gcc.target/i386/pr108938-3.c: Add -fno-vect-cost-model. * gcc.target/i386/pr111023-2.c: Ditto. * gcc.target/i386/pr111023.c: Ditto. * gcc.target/i386/pr99881.c: Remove xfail. * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: Changed to Scan-tree-dump-not. * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Ditto. * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Ditto. --- gcc/config/i386/i386.cc | 81 ++++++++++++++++++- .../costmodel/x86_64/costmodel-pr104582-1.c | 2 +- .../costmodel/x86_64/costmodel-pr104582-3.c | 2 +- .../costmodel/x86_64/costmodel-pr104582-4.c | 2 +- .../gcc.target/i386/part-vect-absneghf.c | 4 +- .../gcc.target/i386/part-vect-copysignhf.c | 4 +- .../gcc.target/i386/part-vect-xorsignhf.c | 4 +- gcc/testsuite/gcc.target/i386/pr108938-3.c | 2 +- gcc/testsuite/gcc.target/i386/pr111023-2.c | 2 +- gcc/testsuite/gcc.target/i386/pr111023.c | 2 +- gcc/testsuite/gcc.target/i386/pr91446.c | 14 ++-- gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- 12 files changed, 99 insertions(+), 22 deletions(-) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index dcaea6c2096..a4b23e29eba 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -24573,6 +24573,10 @@ public: private: + /* Don't do vectorization for certain patterns. */ + void ix86_vect_cut_off (); + + bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info); /* Estimate register pressure of the vectorized code. */ void ix86_vect_estimate_reg_pressure (); /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for @@ -24581,12 +24585,15 @@ private: where we know it's not loaded from memory. */ unsigned m_num_gpr_needed[3]; unsigned m_num_sse_needed[3]; + + bool m_vec_construct_store_only; }; ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar) : vector_costs (vinfo, costing_for_scalar), m_num_gpr_needed (), - m_num_sse_needed () + m_num_sse_needed (), + m_vec_construct_store_only (true) { } @@ -24609,6 +24616,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, = (kind == scalar_stmt || kind == scalar_load || kind == scalar_store); int stmt_cost = - 1; + if (m_vec_construct_store_only + && !ix86_vect_construct_store_only_p (kind, stmt_info)) + m_vec_construct_store_only = false; + bool fp = false; machine_mode mode = scalar_p ? SImode : TImode; @@ -24865,8 +24876,45 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure () } } +/* Return true if KIND and STMT_INFO is either vec_construct or vector + store, stmt_info could be promote/demote between vec_construct and + vector store, also count that. */ +bool +ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind, + stmt_vec_info stmt_info) +{ + switch (kind) + { + case vec_construct: + case vector_store: + case unaligned_store: + case vector_scatter_store: + case vec_promote_demote: + return true; + + /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes + char* a; + short b1, b2, b3, b4; + a[0] = b1; + a[1] = b2; + a[2] = b3; + a[3] = b4; + Also don't vectorized it. */ + case vector_stmt: + if (stmt_info && stmt_info->stmt + && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN + && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR) + return true; + + default: + break; + } + + return false; +} + void -ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) +ix86_vector_costs::ix86_vect_cut_off () { loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); if (loop_vinfo && !m_costing_for_scalar) @@ -24885,10 +24933,39 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ()) > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) m_costs[vect_body] = INT_MAX; + return; + } + + /* Don't do vectorization for things like + + a[0] = b1; + a[1] = b2; + .. + a[n] = bn; + + There're extra dependences when contructing the vector, but not for + scalar store. According to experiments, it's generally worse. */ + if (m_vec_construct_store_only) + { + m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Skip vectorization for stmts which only contains" + " vec_construct and vector store.\n"); + return; } + return; +} + +void +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) +{ + ix86_vect_estimate_reg_pressure (); + ix86_vect_cut_off (); + vector_costs::finish_cost (scalar_costs); } diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c index 992a845ad7a..f940af70b72 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c @@ -12,4 +12,4 @@ foo (unsigned long *a, unsigned long *b) s.b = b_; } -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c index 999c4905708..eff60b2a82a 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c @@ -10,4 +10,4 @@ foo (double a, double b) s.b = b; } -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c index cc471e1ed73..2f354338e96 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c @@ -12,4 +12,4 @@ foo (signed long *a, unsigned long *b) s.b = b_; } -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c index 48aed14d604..4052210ec39 100644 --- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c +++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c @@ -85,7 +85,7 @@ do_test (void) abort (); } -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" } } */ -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ /* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR <vect} 2 "optimized" { target { ! ia32 } } } } */ /* { dg-final { scan-tree-dump-times {(?n)= -vect} 2 "optimized" { target { ! ia32 } } } } */ diff --git a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c index 811617bc3dd..006941f6651 100644 --- a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c +++ b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c @@ -55,6 +55,6 @@ do_test (void) abort (); } -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */ -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ /* { dg-final { scan-tree-dump-times ".COPYSIGN" 2 "optimized" { target { ! ia32 } } } } */ diff --git a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c index a8ec60a088a..a57dc5aba23 100644 --- a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c +++ b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c @@ -55,6 +55,6 @@ do_test (void) abort (); } -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */ -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ /* { dg-final { scan-tree-dump-times ".XORSIGN" 2 "optimized" { target { ! ia32 } } } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c b/gcc/testsuite/gcc.target/i386/pr108938-3.c index 32ac544c7ed..24725a9ab1d 100644 --- a/gcc/testsuite/gcc.target/i386/pr108938-3.c +++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize -mno-movbe" } */ +/* { dg-options "-O2 -ftree-vectorize -mno-movbe -fno-vect-cost-model" } */ /* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } } */ /* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr111023-2.c b/gcc/testsuite/gcc.target/i386/pr111023-2.c index 6c69f947544..db25668a9ae 100644 --- a/gcc/testsuite/gcc.target/i386/pr111023-2.c +++ b/gcc/testsuite/gcc.target/i386/pr111023-2.c @@ -1,6 +1,6 @@ /* PR target/111023 */ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */ +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */ typedef char v16qi __attribute__((vector_size (16))); typedef short v8hi __attribute__((vector_size (16))); diff --git a/gcc/testsuite/gcc.target/i386/pr111023.c b/gcc/testsuite/gcc.target/i386/pr111023.c index 6144c371f32..18cd579d937 100644 --- a/gcc/testsuite/gcc.target/i386/pr111023.c +++ b/gcc/testsuite/gcc.target/i386/pr111023.c @@ -1,6 +1,6 @@ /* PR target/111023 */ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */ +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */ typedef unsigned char v16qi __attribute__((vector_size (16))); typedef unsigned short v8hi __attribute__((vector_size (16))); diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c index 0243ca3ea68..3e0f8a4b315 100644 --- a/gcc/testsuite/gcc.target/i386/pr91446.c +++ b/gcc/testsuite/gcc.target/i386/pr91446.c @@ -10,15 +10,15 @@ typedef struct extern void bar (info *); void -foo (unsigned long long width, unsigned long long height, - long long x, long long y) +foo (unsigned long long* width, + long long* x) { info t; - t.width = width; - t.height = height; - t.x = x; - t.y = y; + t.width = width[0]; + t.height = width[1]; + t.x = x[0]; + t.y = x[1]; bar (&t); } -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ +/* { dg-final { scan-assembler-times "vmovdq\[au\]\[^\n\r\]*xmm\[0-9\]" 4 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c index 3e087eb2ed7..1a59325ac1c 100644 --- a/gcc/testsuite/gcc.target/i386/pr99881.c +++ b/gcc/testsuite/gcc.target/i386/pr99881.c @@ -1,7 +1,7 @@ /* PR target/99881. */ /* { dg-do compile { target { ! ia32 } } } */ /* { dg-options "-Ofast -march=skylake" } */ -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ void foo (int* __restrict a, int n, int c) -- 2.31.1