Hi Richard, Thanks for the review.
> On 15 Oct 2025, at 10:39 pm, Richard Biener <[email protected]> > wrote: > > External email: Use caution opening links or attachments > > > On Wed, Oct 15, 2025 at 12:08 AM Kugan Vivekanandarajah > <[email protected]> wrote: >> >> Hi, >> >> This patch eliminates redundant reverse permutations in vectorized reverse >> loops by detecting and optimizing patterns during store vectorization. >> >> The reverse load (b[i]) generates PERM, operations are applied, then the >> reverse store adds another PERM. This creates redundant permute pairs that >> we now detect and eliminate. >> >> With the patch, for the example loop >> for (int i = N - 1; i >= 0; i--) >> { >> a[i] = b[i] + 1.0f; >> } >> Changes to the following >> - ldr q29, [x0, x2] >> - tbl v29.16b, {v29.16b}, v31.16b >> - fadd v29.4s, v29.4s, v30.4s >> - tbl v29.16b, {v29.16b}, v31.16b >> - str q29, [x3, x2] >> + ldr q30, [x0, x2] >> + fadd v30.4s, v30.4s, v31.4s >> + str q30, [x3, x2] > > So this works basically as a post-processing optimization at the time > we generate the > vector store. While that's in principle an OK optimization I'd rather > have such post-processing > implemented outside of the vectorizer because then also permutes not > originating from > vectorizer permuted store generation would benefit. > > As for implementing this in the vectorizer itself the more appropriate > thing would be > to expose these permutes to the permute optimization phase, because then it > can > be also taken into account during costing and a reverse load permute > could be elided > if it feeds an associatable reduction. > > There is, unfortunately, currently no good way to represent how we implement > negative strided contiguous accesses with load permutations as the peculiarity > only exposes itself after applying the VF and load/lane permutations are > represented on the VF == 1 SLP graph. One of my ideas what that once we > settle on VF (and possibly vector types) we want to expand the SLP graph > to cover all lanes of the vector loop so we can expose actual permutes and > vector granularity. This is a bit far off though. > > So in line with your patch but more appropriate for in-vectorizer > operation would > be an analysis on the SLP graph that simply marks reverse permutes that can > be elided (for the back-to-back case). This way both costing and code > generation > can take this into account and you wouldn't have to adjust any stmts. I have now changed it to account for the costing. Bootstrapped and regression tested on aarch64-linux-gnu. Is this OK? Thanks, Kugan > > Thanks, > Richard. > >> PR tree-optimization/61338 >> >> gcc/ChangeLog: >> (get_vector_perm_operand): New. >> (vect_find_reverse_permute_operand): New helper function >> to find reverse permutations through element-wise operation chains. >> Returns true only if ALL operands have reverse permutations. >> (vectorizable_store): Use recursive helper to eliminate redundant >> reverse permutations with configurable search depth. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.dg/vect/slp-permute-reverse-1.c: New test for basic >> reverse permute optimization (simple copy). >> * gcc.dg/vect/slp-permute-reverse-2.c: New runtime test for >> basic pattern. >> Signed-off-by: Kugan Vivekanandarajah <[email protected]> >> >> Bootstrapped and regression tested on aarch64-linux-gcc. Is this OK? >> >> Thanks, >> Kugan
0001-PATCH-tree-optimization-61338-Optimize-redundant-rev.patch
Description: 0001-PATCH-tree-optimization-61338-Optimize-redundant-rev.patch
