[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

rguenther at suse dot de via Gcc-bugs Wed, 07 Jun 2023 00:44:56 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414


--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 7 Jun 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
> 
> --- Comment #14 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to Richard Biener from comment #13)
> > The target now has the ability to tell the vectorizer to choose a larger VF
> > based on the cost info it got for the default VF, so the x86 backend could
> > make use of that.  For example with the following patch we'll unroll the
> > vectorized loops 4 times (of course the actual check for small reduction
> > loops and a register pressure estimate is missing).  That generates
> > 
> > .L4:
> >         vaddps  (%rax), %zmm1, %zmm1
> >         vaddps  64(%rax), %zmm2, %zmm2
> >         addq    $256, %rax
> >         vaddps  -128(%rax), %zmm0, %zmm0
> >         vaddps  -64(%rax), %zmm3, %zmm3
> >         cmpq    %rcx, %rax
> >         jne     .L4
> >         movq    %rdx, %rax
> >         andq    $-64, %rax
> >         vaddps  %zmm3, %zmm0, %zmm0
> >         vaddps  %zmm2, %zmm1, %zmm1
> >         vaddps  %zmm1, %zmm0, %zmm1
> > ... more epilog ...
> > 
> > with -march=znver4 on current trunk.
> > 
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index d4ff56ee8dd..53c09bb9d9c 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs
> >                               stmt_vec_info stmt_info, slp_tree node,
> >                               tree vectype, int misalign,
> >                               vect_cost_model_location where) override;
> > +  void finish_cost (const vector_costs *uncast_scalar_costs);
> >  };
> >  
> > +void
> > +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
> > +{
> > +  auto *scalar_costs
> > +    = static_cast<const ix86_vector_costs *> (uncast_scalar_costs);
> > +  m_suggested_unroll_factor = 4;
> > +  vector_costs::finish_cost (scalar_costs);
> 
> I remember we have posted an patch for that
> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604186.html
> 
> One regression observed is the VF of epilog loop will increase(from xmm to 
> ymm)
> after unroll the vectorized loops, and it regressed performance for
> lower-tripcount loop(similar as -mprefer-vector-width=512).

Ah, yeah.  We could resort to check estimated_number_of_iterations
to guide us with profile feedback.  I'm also (again) working on
fully masked epilogues which should reduce the impact on low-trip
count loops.

> Also for the case in the PR, I'm trying to enable
> -fvariable-expansion-in-unroller when -funroll-loops, and the partial sum will
> break reduction chain.

Probably also a good idea - maybe -fvariable-expansion-in-unroller can
be made smarter and guided by register pressure?

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

Reply via email to