[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 06 Jun 2023 23:54:07 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
The target now has the ability to tell the vectorizer to choose a larger VF
based on the cost info it got for the default VF, so the x86 backend could
make use of that.  For example with the following patch we'll unroll the
vectorized loops 4 times (of course the actual check for small reduction
loops and a register pressure estimate is missing).  That generates

.L4:
        vaddps  (%rax), %zmm1, %zmm1
        vaddps  64(%rax), %zmm2, %zmm2
        addq    $256, %rax
        vaddps  -128(%rax), %zmm0, %zmm0
        vaddps  -64(%rax), %zmm3, %zmm3
        cmpq    %rcx, %rax
        jne     .L4
        movq    %rdx, %rax
        andq    $-64, %rax
        vaddps  %zmm3, %zmm0, %zmm0
        vaddps  %zmm2, %zmm1, %zmm1
        vaddps  %zmm1, %zmm0, %zmm1
... more epilog ...

with -march=znver4 on current trunk.

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index d4ff56ee8dd..53c09bb9d9c 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs
                              stmt_vec_info stmt_info, slp_tree node,
                              tree vectype, int misalign,
                              vect_cost_model_location where) override;
+  void finish_cost (const vector_costs *uncast_scalar_costs);
 };

+void
+ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
+{
+  auto *scalar_costs
+    = static_cast<const ix86_vector_costs *> (uncast_scalar_costs);
+  m_suggested_unroll_factor = 4;
+  vector_costs::finish_cost (scalar_costs);
+}
+
 /* Implement targetm.vectorize.create_costs.  */

 static vector_costs *

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

Reply via email to