[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

linkw at gcc dot gnu.org via Gcc-bugs Sun, 30 May 2021 23:18:27 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794


--- Comment #9 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to rguent...@suse.de from comment #5)
> On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > 
> > --- Comment #4 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > (In reply to rguent...@suse.de from comment #3)
> > > On Fri, 28 May 2021, linkw at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
> > > > 
> > > > --- Comment #2 from Kewen Lin <linkw at gcc dot gnu.org> ---
> > > > (In reply to Richard Biener from comment #1)
> > > > 
> > > > Thanks for the comments!
> > > > 
> > > > > There's predictive commoning which can do similar transforms and runs 
> > > > > after
> > > > > vectorization.  It might be it doesn't handle these "simple" cases or 
> > > > > that
> > > > > loop dependence info is not up to the task there.
> > > > > 
> > > > 
> > > > pcom does fix this problem, but it's enabled by default at -O3. Could 
> > > > it be
> > > > considered to be run at O2? Or enabled at O2 at some conditions such 
> > > > as: only
> > > > for one loop which skips loop carried optimization and isn't vectorized
> > > > further?
> > > 
> > > I think pcom should be enabled when vectorization is due to the 
> > > interaction with PRE.  It could be tamed down (it can do 
> > > peeling/unrolling 
> > > which is why it is -O3) based on the vectorizer cost model active
> > > if only implicitely enabled ...   Things will get a bit messy I guess.
> > > 
> > 
> > Good point! I prefer this idea to the one guarding cost model in sccvn 
> > code. 
> > 
> > > > > Another option is to avoid the PRE guard with the (very) cheap cost 
> > > > > model
> > > > > at the expense of not vectorizing affected loops.
> > > > > 
> > > > 
> > > > OK, I will benchmark this to see its impact. For the particular loops in
> > > > 554.roms_r, they can be vectorized at cheap cost model, this bmk got 
> > > > improved
> > > > at cheap cost model on both Power8 and Power9 (a bit though). So I will 
> > > > just
> > > > test the impact on very cheap cost model.
> > > 
> > > So another thing to benchmark would be to enable pcom but make sure
> > > 
> > >   /* Determine the unroll factor, and if the loop should be unrolled, 
> > > ensure
> > >      that its number of iterations is divisible by the factor.  */
> > >   unroll_factor = determine_unroll_factor (chains);
> > >   scev_reset ();
> > >   unroll = (unroll_factor > 1
> > >             && can_unroll_loop_p (loop, unroll_factor, &desc));
> > > 
> > > is false for the cheap and very-cheap cost models unless
> > > flag_predictive_commoning is active.
> > > 
> > 
> > Thanks for the hints! One question: could we just enable non-unroll 
> > version of pcom if it's enabled by flag_tree_loop_vectorize implicitly 
> > without considering vect cost model? Although the very-cheap and cheap 
> > cost model are very likely associated to O2, users still can try dynamic 
> > or unlimited cost model at O2, or very-cheap/cheap cost model at O3, it 
> > seems not good to map cost model onto unroll decision here directly. Or 
> > maybe we check the optimization level? such as:
> > 
> >   virtual bool
> >   gate (function *)
> >   {
> >     if (flag_predictive_commoning != 0)
> >       return true;
> >     if (flag_tree_loop_vectorize)
> >       {
> >         allow_unroll_p = optimize > 2;
> 
> But what about -O2 -ftree-vectorize -fno-predictive-commoning?  IMHO
> we want to check global_options_set and for "implicit" pcom do not
> allow unrolling and for explicitely disabled pcom do not do any
> pcom at all.
> 
> Richard.

OK, the M1 patch followed the suggestion here, not allow unrolling if the pcom
is enabled implicitly by loop vect.

As the evaluation result shows, the way with implied pcom (M1) looks better.
For  very cheap cost model, it can make wrf a bit better. For cheap cost model,
the loop carried dependence (M2) did affect some bmks (especially wrf -9%). The
build time increment with M1 looks fine too.

Btw, commenting out two places with "update_ssa
(TODO_update_ssa_only_virtuals)", bootstrapped with the default way and
--with-build-config=bootstrap-O3 way. Will check with regression testing and
SPEC2017 build, but I guess it's another independent patch.

[Bug tree-optimization/100794] suboptimal code due to missing pre2 when vectorization fails

Reply via email to