https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> --- On Wed, 7 Jun 2023, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 > > --- Comment #14 from Hongtao.liu <crazylht at gmail dot com> --- > (In reply to Richard Biener from comment #13) > > The target now has the ability to tell the vectorizer to choose a larger VF > > based on the cost info it got for the default VF, so the x86 backend could > > make use of that. For example with the following patch we'll unroll the > > vectorized loops 4 times (of course the actual check for small reduction > > loops and a register pressure estimate is missing). That generates > > > > .L4: > > vaddps (%rax), %zmm1, %zmm1 > > vaddps 64(%rax), %zmm2, %zmm2 > > addq $256, %rax > > vaddps -128(%rax), %zmm0, %zmm0 > > vaddps -64(%rax), %zmm3, %zmm3 > > cmpq %rcx, %rax > > jne .L4 > > movq %rdx, %rax > > andq $-64, %rax > > vaddps %zmm3, %zmm0, %zmm0 > > vaddps %zmm2, %zmm1, %zmm1 > > vaddps %zmm1, %zmm0, %zmm1 > > ... more epilog ... > > > > with -march=znver4 on current trunk. > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index d4ff56ee8dd..53c09bb9d9c 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs > > stmt_vec_info stmt_info, slp_tree node, > > tree vectype, int misalign, > > vect_cost_model_location where) override; > > + void finish_cost (const vector_costs *uncast_scalar_costs); > > }; > > > > +void > > +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) > > +{ > > + auto *scalar_costs > > + = static_cast<const ix86_vector_costs *> (uncast_scalar_costs); > > + m_suggested_unroll_factor = 4; > > + vector_costs::finish_cost (scalar_costs); > > I remember we have posted an patch for that > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604186.html > > One regression observed is the VF of epilog loop will increase(from xmm to > ymm) > after unroll the vectorized loops, and it regressed performance for > lower-tripcount loop(similar as -mprefer-vector-width=512). Ah, yeah. We could resort to check estimated_number_of_iterations to guide us with profile feedback. I'm also (again) working on fully masked epilogues which should reduce the impact on low-trip count loops. > Also for the case in the PR, I'm trying to enable > -fvariable-expansion-in-unroller when -funroll-loops, and the partial sum will > break reduction chain. Probably also a good idea - maybe -fvariable-expansion-in-unroller can be made smarter and guided by register pressure?