Hi Christoph, > -----Original Message----- > From: Christoph Müllner <christoph.muell...@vrull.eu> > Sent: Tuesday, October 15, 2024 3:57 PM > To: gcc-patches@gcc.gnu.org; Philipp Tomsich <philipp.toms...@vrull.eu>; Tamar > Christina <tamar.christ...@arm.com>; Richard Biener <rguent...@suse.de> > Cc: Jeff Law <jeffreya...@gmail.com>; Robin Dapp <rd...@ventanamicro.com>; > Christoph Müllner <christoph.muell...@vrull.eu> > Subject: [PATCH 1/2] Reduce lane utilization in VEC_PERM_EXPRs for > two_operator nodes > > When two_operator SLP nodes are built, the VEC_PERM_EXPR that merges the > result > selects a lane only based on the operator found. If the input nodes have > duplicate elements, there may be more than one way to choose. This commit > changes the policy to reuse an existing lane with the result that we can > possibly free up lanes entirely. > > For example, given two vectors with duplicates: > A = {a1, a1, a2, a2} > B = {b1, b1, b3, b2} > > A two_operator node with operators +, -, +, - is currently built as: > RES = VEC_PERM_EXPR<A, B>(0, 5, 2, 7) > With this patch, the permutation becomes: > RES = VEC_PERM_EXPR<A, B>(0, 4, 2, 6) > Lanes 0 and 2 are reused and lanes 1 and 3 are not utilized anymore. > > The direct effect of this change can be seen in the AArch64 test case, > where the simpler permutation allows to lower to a TRN1 instead of an > expensive TBL.
That makes sense, So this is trying to make the permutes EVEN/EVEN or ODD/ODD if possible. Just wondering if.... > > Bootstrapped and reg-tested on AArch64 (C, C++, Fortran). > > Manolis Tsamis was the patch's initial author before I took it over. > > gcc/ChangeLog: > > * tree-vect-slp.cc: Reduce lane utilization in VEC_PERM_EXPRs for > two_operators > > gcc/testsuite/ChangeLog: > > * gcc.dg/vect/slp-perm-13.c: New test. > * gcc.target/aarch64/sve/slp-perm-13.c: New test. > > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu> > --- > gcc/testsuite/gcc.dg/vect/slp-perm-13.c | 29 +++++++++++++++++++ > .../gcc.target/aarch64/sve/slp-perm-13.c | 4 +++ > gcc/tree-vect-slp.cc | 21 +++++++++++++- > 3 files changed, 53 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/slp-perm-13.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-13.c > b/gcc/testsuite/gcc.dg/vect/slp-perm-13.c > new file mode 100644 > index 00000000000..08639e72fb0 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-13.c > @@ -0,0 +1,29 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O3 -fdump-tree-slp2-details" } */ > + > +#define LOAD_VEC(e0, e1, e2, e3, p) \ > + int e0 = p[0]; \ > + int e1 = p[1]; \ > + int e2 = p[2]; \ > + int e3 = p[3]; > + > +#define STORE_VEC(p, e0, e1, e2, e3) \ > + p[0] = e0; \ > + p[1] = e1; \ > + p[2] = e2; \ > + p[3] = e3; > + > +void > +foo (int *p) > +{ > + LOAD_VEC(s0, s1, s2, s3, p); > + > + int t0 = s0 + s1; > + int t1 = s0 - s1; > + int t2 = s2 + s3; > + int t3 = s2 - s3; > + > + STORE_VEC(p, t0, t1, t2, t3); > +} > + > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 0, 4, 2, 6 }" "slp2" } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c > b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c > new file mode 100644 > index 00000000000..f5839f273e5 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c > @@ -0,0 +1,4 @@ > +#include "../../../gcc.dg/vect/slp-perm-13.c" > + > +/* { dg-final { scan-assembler-not {\ttbl\t} } } */ > + > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 16332e0b6d7..8794c94ef90 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -2921,7 +2921,26 @@ fail: > gassign *ostmt = as_a <gassign *> (ostmt_info->stmt); > if (gimple_assign_rhs_code (ostmt) != code0) > { > - SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (1, > i)); > + /* If the current element can be found in another lane that has > + been used previously then use that one instead. This can > + happen when the ONE and TWO contain duplicate elements and > + reduces the number of 'active' lanes. */ > + int idx = i; > + for (int alt_idx = (int) i - 1; alt_idx >= 0; alt_idx--) > + { > + gassign *alt_stmt = as_a <gassign *> (stmts[alt_idx]->stmt); > + if (gimple_assign_rhs_code (alt_stmt) == code0 > + && gimple_assign_rhs1 (ostmt) > + == gimple_assign_rhs1 (alt_stmt) > + && gimple_assign_rhs2 (ostmt) > + == gimple_assign_rhs2 (alt_stmt)) > + { .. we may want operand_equal_p (ostmt, alt_stmt, 0) etc here. This looks good to me though but can't approve. Lets see what Richi thinks. Cheers, Tamar > + idx = alt_idx; > + break; > + } > + } > + SLP_TREE_LANE_PERMUTATION (node) > + .safe_push (std::make_pair (1, idx)); > ocode = gimple_assign_rhs_code (ostmt); > j = i; > } > -- > 2.46.0