https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123225

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #10)
> (In reply to Victor Do Nascimento from comment #9)
> > > I wonder if for now (w/o the ability to elide the epilog, w/o the ability
> > > to use first-fault loads) we should restrict this to PGO when we have
> > > a more reliable expected iteration count to work with?  Though as we
> > > do not have a histogram of actual loop iterations an estimated count
> > > of 10 can result from a mix of 1 and 20 loop iterations ...
> > > 
> > > Plus eventually handling loops marked as force_vectorize (we do not
> > > yet have a #pragma users can use, but OMP SIMD marks loops this way).
> > 
> > Yes, I do think that the poor handling of both prologue and epilogue at
> > present severely hurt the usefulness of this approach. As for the prologue,
> > AArch64 targets with SVE can considerably counter the performance hit by
> > implementing masking for alignment.  This, in particular, is something I am
> > working on as a follow up to this work and will be looking to submit once we
> > are back in stage 1.
> 
> Masking for alignment should work for all targets that can use a predicated
> loop, including x86 and risc-v.
> 
> For GCC 16 we can consider adding a new --param so targets could opt to
> disable uncounted loop vectorization alltogether.  I somehow had the
> impression that we'd land the code avoiding the scalar epilog re-doing
> the last vector iteration as well, but that didn't materialize.  Without
> that profitability is even worse for high VF.  The alignment prologue
> shouldn't be too bad in practice for not too small loops, it's really
> the epilog where we end up doing things twice that hurts for low iteration
> counts.

Simple cases as the above can avoid the epilogue quite easily. During analysis
of the loop we just have to determine if there are any non-early break forced
IVs.

If not the epilogue isn't needed and the code that forces the epilogue can just
be turned off. After which the loop won't be peeled and the exits are fine.

What delayed this is when you DO have a live value, for which you then need to
do masked based reductions which triggers a bunch of other issues to deal with. 

So rather than restricting to PGO we could just handle the cases above and
restrict uncounted loops to cases that don't require a forced epilogue.

That way when I finish the reductions next stage1 it just works.

The patches for the above are on my work machine, but I won't be back till the
23rd.

If you agree can extract them from the series and send.

Reply via email to