[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

crazylht at gmail dot com via Gcc-bugs Wed, 07 Jun 2023 00:25:05 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414


--- Comment #14 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #13)
> The target now has the ability to tell the vectorizer to choose a larger VF
> based on the cost info it got for the default VF, so the x86 backend could
> make use of that.  For example with the following patch we'll unroll the
> vectorized loops 4 times (of course the actual check for small reduction
> loops and a register pressure estimate is missing).  That generates
> 
> .L4:
>         vaddps  (%rax), %zmm1, %zmm1
>         vaddps  64(%rax), %zmm2, %zmm2
>         addq    $256, %rax
>         vaddps  -128(%rax), %zmm0, %zmm0
>         vaddps  -64(%rax), %zmm3, %zmm3
>         cmpq    %rcx, %rax
>         jne     .L4
>         movq    %rdx, %rax
>         andq    $-64, %rax
>         vaddps  %zmm3, %zmm0, %zmm0
>         vaddps  %zmm2, %zmm1, %zmm1
>         vaddps  %zmm1, %zmm0, %zmm1
> ... more epilog ...
> 
> with -march=znver4 on current trunk.
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ff56ee8dd..53c09bb9d9c 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs
>                               stmt_vec_info stmt_info, slp_tree node,
>                               tree vectype, int misalign,
>                               vect_cost_model_location where) override;
> +  void finish_cost (const vector_costs *uncast_scalar_costs);
>  };
>  
> +void
> +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
> +{
> +  auto *scalar_costs
> +    = static_cast<const ix86_vector_costs *> (uncast_scalar_costs);
> +  m_suggested_unroll_factor = 4;
> +  vector_costs::finish_cost (scalar_costs);

I remember we have posted an patch for that
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604186.html

One regression observed is the VF of epilog loop will increase(from xmm to ymm)
after unroll the vectorized loops, and it regressed performance for
lower-tripcount loop(similar as -mprefer-vector-width=512).

Also for the case in the PR, I'm trying to enable
-fvariable-expansion-in-unroller when -funroll-loops, and the partial sum will
break reduction chain.

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

Reply via email to