RE: [PATCH][tree-optimization/61338] - Optimize redundant reverse permutations in vectorized stores

Tamar Christina Wed, 22 Oct 2025 02:33:19 -0700

> -----Original Message-----
> From: Richard Biener <[email protected]>
> Sent: 22 October 2025 09:53
> To: Kugan Vivekanandarajah <[email protected]>
> Cc: [email protected]; Tamar Christina <[email protected]>
> Subject: Re: [PATCH][tree-optimization/61338] - Optimize redundant reverse
> permutations in vectorized stores
> 
> On Tue, Oct 21, 2025 at 11:57 PM Kugan Vivekanandarajah
> <[email protected]> wrote:
> >
> > Hi Richard,
> >
> > Thanks for the review.
> >
> > > On 15 Oct 2025, at 10:39 pm, Richard Biener
> <[email protected]> wrote:
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Wed, Oct 15, 2025 at 12:08 AM Kugan Vivekanandarajah
> > > <[email protected]> wrote:
> > >>
> > >> Hi,
> > >>
> > >> This patch eliminates redundant reverse permutations in vectorized
> reverse
> > >> loops by detecting and optimizing patterns during store vectorization.
> > >>
> > >> The reverse load (b[i]) generates PERM, operations are applied, then the
> > >> reverse store adds another PERM. This creates redundant permute pairs
> that
> > >> we now detect and eliminate.
> > >>
> > >> With the patch, for the example loop
> > >>  for (int i = N - 1; i >= 0; i--)
> > >>    {
> > >>      a[i] = b[i] + 1.0f;
> > >>    }
> > >> Changes to the following
> > >> -       ldr     q29, [x0, x2]
> > >> -       tbl     v29.16b, {v29.16b}, v31.16b
> > >> -       fadd    v29.4s, v29.4s, v30.4s
> > >> -       tbl     v29.16b, {v29.16b}, v31.16b
> > >> -       str     q29, [x3, x2]
> > >> +       ldr     q30, [x0, x2]
> > >> +       fadd    v30.4s, v30.4s, v31.4s
> > >> +       str     q30, [x3, x2]
> > >
> > > So this works basically as a post-processing optimization at the time
> > > we generate the
> > > vector store.  While that's in principle an OK optimization I'd rather
> > > have such post-processing
> > > implemented outside of the vectorizer because then also permutes not
> > > originating from
> > > vectorizer permuted store generation would benefit.
> > >
> > > As for implementing this in the vectorizer itself the more appropriate
> > > thing would be
> > > to expose these permutes to the permute optimization phase, because
> then it can
> > > be also taken into account during costing and a reverse load permute
> > > could be elided
> > > if it feeds an associatable reduction.
> > >
> > > There is, unfortunately, currently no good way to represent how we
> implement
> > > negative strided contiguous accesses with load permutations as the
> peculiarity
> > > only exposes itself after applying the VF and load/lane permutations are
> > > represented on the VF == 1 SLP graph.  One of my ideas what that once we
> > > settle on VF (and possibly vector types) we want to expand the SLP graph
> > > to cover all lanes of the vector loop so we can expose actual permutes and
> > > vector granularity.  This is a bit far off though.
> > >
> > > So in line with your patch but more appropriate for in-vectorizer
> > > operation would
> > > be an analysis on the SLP graph that simply marks reverse permutes that
> can
> > > be elided (for the back-to-back case).  This way both costing and code
> > > generation
> > > can take this into account and you wouldn't have to adjust any stmts.
> >
> > I  have now changed it to account for the costing. Bootstrapped and
> regression tested on aarch64-linux-gnu.
> >
> > Is this OK?
> 
> Same here?
>


Did you send the right version of the patch Kugan? It's identical to the one you
sent before and also has some changes in gcc/fortran/resolve.cc not specified
and your changelog seems to have an incorrect format, the files containing what
you changed aren't mentioned.

Thanks,
Tamar

> > Thanks,
> > Kugan
> >
> >
> > >
> > > Thanks,
> > > Richard.
> > >
> > >>        PR tree-optimization/61338
> > >>
> > >> gcc/ChangeLog:
> > >>        (get_vector_perm_operand): New.
> > >>        (vect_find_reverse_permute_operand): New  helper function
> > >>        to find reverse permutations through element-wise operation 
> > >> chains.
> > >>        Returns true only if ALL operands have reverse permutations.
> > >>        (vectorizable_store): Use recursive helper to eliminate redundant
> > >>        reverse permutations with configurable search depth.
> > >>
> > >> gcc/testsuite/ChangeLog:
> > >>
> > >>        * gcc.dg/vect/slp-permute-reverse-1.c: New test for basic
> > >>        reverse permute optimization (simple copy).
> > >>        * gcc.dg/vect/slp-permute-reverse-2.c: New runtime test for
> > >>        basic pattern.
> > >> Signed-off-by: Kugan Vivekanandarajah <[email protected]>
> > >>
> > >> Bootstrapped and regression tested on aarch64-linux-gcc. Is this OK?
> > >>
> > >> Thanks,
> > >> Kugan
> >
> >

RE: [PATCH][tree-optimization/61338] - Optimize redundant reverse permutations in vectorized stores

Reply via email to