> -----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: Friday, October 18, 2024 11:03 AM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: Christoph Müllner <christoph.muell...@vrull.eu>; gcc-patches@gcc.gnu.org; > Philipp Tomsich <philipp.toms...@vrull.eu>; Jeff Law <jeffreya...@gmail.com>; > Robin Dapp <rd...@ventanamicro.com> > Subject: RE: [PATCH 2/2] Add a new permute optimization step in SLP > > On Thu, 17 Oct 2024, Tamar Christina wrote: > > > Hi Christoph, > > > > > -----Original Message----- > > > From: Christoph Müllner <christoph.muell...@vrull.eu> > > > Sent: Tuesday, October 15, 2024 3:57 PM > > > To: gcc-patches@gcc.gnu.org; Philipp Tomsich <philipp.toms...@vrull.eu>; > Tamar > > > Christina <tamar.christ...@arm.com>; Richard Biener <rguent...@suse.de> > > > Cc: Jeff Law <jeffreya...@gmail.com>; Robin Dapp > <rd...@ventanamicro.com>; > > > Christoph Müllner <christoph.muell...@vrull.eu> > > > Subject: [PATCH 2/2] Add a new permute optimization step in SLP > > > > > > This commit adds a new permute optimization step after running SLP > > > vectorization. > > > Although there are existing places where individual or nested permutes > > > can be optimized, there are cases where independent permutes can be > optimized, > > > which cannot be expressed in the current pattern matching framework. > > > The optimization step is run at the end so that permutes from completely > different > > > SLP builds can be optimized. > > > > > > The initial optimizations implemented can detect some cases where > > > different > > > "select permutes" (permutes that only use some of the incoming vector > > > lanes) > > > can be co-located in a single permute. This can optimize some cases where > > > two_operator SLP nodes have duplicate elements. > > > > > > Bootstrapped and reg-tested on AArch64 (C, C++, Fortran). > > > > > > Manolis Tsamis was the patch's initial author before I took it over. > > > > > > gcc/ChangeLog: > > > > > > * tree-vect-slp.cc (get_tree_def): Return the definition of a name. > > > (recognise_perm_binop_perm_pattern): Helper function. > > > (vect_slp_optimize_permutes): New permute optimization step. > > > (vect_slp_function): Run the new permute optimization step. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.dg/vect/slp-perm-14.c: New test. > > > * gcc.target/aarch64/sve/slp-perm-14.c: New test. > > > > > > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu> > > > --- > > > gcc/testsuite/gcc.dg/vect/slp-perm-14.c | 42 +++ > > > .../gcc.target/aarch64/sve/slp-perm-14.c | 3 + > > > gcc/tree-vect-slp.cc | 248 ++++++++++++++++++ > > > 3 files changed, 293 insertions(+) > > > create mode 100644 gcc/testsuite/gcc.dg/vect/slp-perm-14.c > > > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c > > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-14.c > > > b/gcc/testsuite/gcc.dg/vect/slp-perm-14.c > > > new file mode 100644 > > > index 00000000000..f56e3982a62 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-14.c > > > @@ -0,0 +1,42 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-additional-options "-O3 -fdump-tree-slp1-details" } */ > > > + > > > +#include <stdint.h> > > > + > > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > > > + int t0 = s0 + s1;\ > > > + int t1 = s0 - s1;\ > > > + int t2 = s2 + s3;\ > > > + int t3 = s2 - s3;\ > > > + d0 = t0 + t2;\ > > > + d1 = t1 + t3;\ > > > + d2 = t0 - t2;\ > > > + d3 = t1 - t3;\ > > > +} > > > + > > > +int > > > +x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t > > > *pix2, int > > > i_pix2) > > > +{ > > > + uint32_t tmp[4][4]; > > > + uint32_t a0, a1, a2, a3; > > > + int sum = 0; > > > + > > > + for (int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2) > > > + { > > > + a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > > > + a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > > > + a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > > > + a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > > > + HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, > > > a3); > > > + } > > > + > > > + for (int i = 0; i < 4; i++) > > > + { > > > + HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], > > > tmp[3][i]); > > > + sum += a0 + a1 + a2 + a3; > > > + } > > > + > > > + return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "slp1" } > > > } */ > > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c > > > b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c > > > new file mode 100644 > > > index 00000000000..4e0d5175be8 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c > > > @@ -0,0 +1,3 @@ > > > +#include "../../../gcc.dg/vect/slp-perm-14.c" > > > + > > > +/* { dg-final { scan-assembler-not {\ttbl\t} } } */ > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > > index 8794c94ef90..4bf5ccb9cdf 100644 > > > --- a/gcc/tree-vect-slp.cc > > > +++ b/gcc/tree-vect-slp.cc > > > @@ -9478,6 +9478,252 @@ vect_slp_if_converted_bb (basic_block bb, > loop_p > > > orig_loop) > > > return vect_slp_bbs (bbs, orig_loop); > > > } > > > > > > +/* If NAME is an SSA_NAME defined by an assignment, return that > assignment. > > > + If SINGLE_USE_ONLY is true and NAME has multiple uses, return NULL. > > > */ > > > + > > > +static gassign * > > > +get_tree_def (tree name, bool single_use_only) > > > +{ > > > + if (TREE_CODE (name) != SSA_NAME) > > > + return NULL; > > > + > > > + gimple *def_stmt = SSA_NAME_DEF_STMT (name); > > > + > > > + if (single_use_only && !has_single_use (name)) > > > + return NULL; > > > + > > > + if (!is_gimple_assign (def_stmt)) > > > + return NULL; > > > > Probably cheaper to test this before the single_use. But.. > > > + > > > + return dyn_cast <gassign *> (def_stmt); > > > > Why not just do the dyn_cast and assign that to a result and check > > If the dyn_cast succeeded. Saves you the second check of the type. > > > > > +} > > > + > > > +/* Helper function for vect_slp_optimize_permutes. Return true if STMT > > > is an > > > + expression of the form: > > > + > > > + src1_perm = VEC_PERM_EXPR <SRC1, SRC1, CONST_VEC> > > > + src2_perm = VEC_PERM_EXPR <SRC2, SRC2, CONST_VEC> > > > + bop1 = src1_perm BINOP1 src2_perm > > > + bop2 = src1_perm BINOP2 src2_perm > > > + STMT = VEC_PERM_EXPR <bop1, bop2, CONST_VEC> > > > + > > > + and src1_perm, src2_perm, bop1, bop2 are not used outside of STMT. > > > + Return the first two permute statements and the binops through the > > > + corresponding pointer arguments. */ > > > + > > > +static bool > > > +recognise_perm_binop_perm_pattern (gassign *stmt, > > > + gassign **bop1_out, gassign **bop2_out, > > > + gassign **perm1_out, gassign **perm2_out) > > > +{ > > > + if (gimple_assign_rhs_code (stmt) != VEC_PERM_EXPR) > > > + return false; > > > + > > > + gassign *bop1, *bop2; > > > + if (!(bop1 = get_tree_def (gimple_assign_rhs1 (stmt), true)) > > > + || !(bop2 = get_tree_def (gimple_assign_rhs2 (stmt), true)) > > > + || !VECTOR_CST_NELTS (gimple_assign_rhs3 (stmt)).is_constant () > > > + || TREE_CODE_CLASS (gimple_assign_rhs_code (bop1)) != tcc_binary > > > + || TREE_CODE_CLASS (gimple_assign_rhs_code (bop2)) != tcc_binary) > > > + return false; > > > + > > > + if (gimple_assign_rhs1 (bop1) != gimple_assign_rhs1 (bop2) > > > + || gimple_assign_rhs2 (bop1) != gimple_assign_rhs2 (bop2) > > > + || bop1 == bop2) > > > + return false; > > > + > > > + tree src1_perm = gimple_assign_rhs1 (bop1); > > > + tree src2_perm = gimple_assign_rhs2 (bop1); > > > + > > > + gassign *perm1, *perm2; > > > + if (!(perm1 = get_tree_def (src1_perm, false)) > > > + || !(perm2 = get_tree_def (src2_perm, false)) > > > + || num_imm_uses (src1_perm) != 2 > > > + || num_imm_uses (src2_perm) != 2 > > > + || gimple_assign_rhs_code (perm1) != VEC_PERM_EXPR > > > + || gimple_assign_rhs_code (perm2) != VEC_PERM_EXPR > > > + || gimple_assign_rhs1 (perm1) != gimple_assign_rhs2 (perm1) > > > + || gimple_assign_rhs1 (perm2) != gimple_assign_rhs2 (perm2) > > > + || !VECTOR_CST_NELTS (gimple_assign_rhs3 (perm1)).is_constant () > > > + || !VECTOR_CST_NELTS (gimple_assign_rhs3 (perm2)).is_constant ()) > > > + return false; > > > + > > > + *bop1_out = bop1; > > > + *bop2_out = bop2; > > > + *perm1_out = perm1; > > > + *perm2_out = perm2; > > > + > > > + return true; > > > +} > > > + > > > +typedef hash_map<int_hash<unsigned HOST_WIDE_INT, -1, -1>, > > > + unsigned HOST_WIDE_INT> lane_map; > > > + > > > +/* Iterate over the basic blocks of FUN and try to optimize VEC_PERM > > > + expressions. This is done at the end of vectorization so it can > > > optimize > > > + expressions that are part of multiple different vectorized > > > blocks/loops. */ > > > + > > > +static void > > > +vect_slp_optimize_permutes (function *fun) > > > +{ > > > + basic_block bb; > > > + FOR_EACH_BB_FN (bb, fun) > > > + for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);) > > > + { > > > + gimple_stmt_iterator gsi_stmt1 = gsi; > > > + gassign *stmt1 = dyn_cast <gassign *> (gsi_stmt (gsi)); > > > + gsi_next (&gsi); > > gsi_next_nondebug (&gsi); > > to avoid compare-debug issues. > > It seems quite limited that you restrict this to adjacent stmts. > > > > + > > > + if (gsi_end_p (gsi)) > > > + break; > > > + > > > + gassign *stmt2 = dyn_cast <gassign *> (gsi_stmt (gsi)); > > > + > > > + if (!stmt1 || !stmt2) > > > + continue; > > > + > > > + /* Try to identify select permutes which utilize only part of a > > > + vector and merge two of them into one. This case can arise from > > > + TWO_OPERATOR SLP patterns because the final permute uses only half > > > + of each input vector. */ > > > + gassign *bop1_1, *bop1_2, *bop2_1, *bop2_2; > > > + gassign *src1_1, *src1_2, *src2_1, *src2_2; > > > + > > > + if (!recognise_perm_binop_perm_pattern (stmt1, &bop1_1, &bop1_2, > > > + &src1_1, &src1_2)) > > > + continue; > > > + > > > + if (!recognise_perm_binop_perm_pattern (stmt2, &bop2_1, &bop2_2, > > > + &src2_1, &src2_2)) > > > + continue; > > > + > > > + if (gimple_assign_rhs_code (bop1_1) != gimple_assign_rhs_code > > > (bop2_1)) > > > + continue; > > > + > > > + if (gimple_assign_rhs_code (bop1_2) != gimple_assign_rhs_code > > > (bop2_2)) > > > + continue; > > Hmm, up to here we've only verified both "binop_perm" are using the > same binop operation code but we've not related their SRC1 or SRC2? > > From 1/2 in the series I guess we're trying to see whether we can > re-use "unused lanes" here? > > > > + > > > + tree mask1 = gimple_assign_rhs3 (stmt1); > > > + tree mask2 = gimple_assign_rhs3 (stmt2); > > > + > > > + /* Calculate the lanes that the first permute needs. */ > > > + lane_map mask1_lanes; > > > + unsigned HOST_WIDE_INT i; > > > + unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS > > > (mask1).to_constant (); > > > + for (i = 0; i < nelts; i++) > > > + { > > > + tree val = VECTOR_CST_ELT (mask1, i); > > > + gcc_assert (TREE_CODE (val) == INTEGER_CST); > > > + unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) & (nelts - 1); > > > + mask1_lanes.put (j, j); > > > + } > > > + > > > + /* Calculate the lanes that the second permute needs and rewrite > > > + them to use the lanes that are unused from the first permute. */ > > > + lane_map mask2_lanes; > > > + unsigned HOST_WIDE_INT lane_gen = 0; > > > + for (i = 0; i < nelts; i++) > > > + { > > > + tree val = VECTOR_CST_ELT (mask2, i); > > > + gcc_assert (TREE_CODE (val) == INTEGER_CST); > > > + unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) & (nelts - 1); > > > + bool existed; > > > + unsigned HOST_WIDE_INT &rewrite_lane > > > + = mask2_lanes.get_or_insert (j, &existed); > > > + if (!existed) > > > + { > > > + while (mask1_lanes.get (lane_gen)) > > > + lane_gen++; > > > + if (lane_gen >= nelts) > > > + break; > > > + rewrite_lane = lane_gen; > > > + lane_gen++; > > > + } > > > + } > > > + > > > + /* The requested lanes need more than one permute. */ > > > + if (i < nelts) > > > + continue; > > > + > > > + vec_perm_builder sel (nelts, nelts, 1); > > > + vec_perm_builder sel_1 (nelts, nelts, 1); > > > + vec_perm_builder sel_2 (nelts, nelts, 1); > > > + > > > + /* Rewrite the permutations based on MASK1_LANES/MASK2_LANES. */ > > > + for (i = 0; i < nelts; i++) > > > + { > > > + /* Calculate new mask for STMT2. */ > > > + tree val = VECTOR_CST_ELT (mask2, i); > > > + unsigned HOST_WIDE_INT lane > > > + = TREE_INT_CST_LOW (val) & (2 * nelts - 1); > > > + unsigned off = (lane < nelts) ? 0 : nelts; > > > + unsigned HOST_WIDE_INT new_lane > > > + = *mask2_lanes.get (lane - off) + off; > > > + sel.quick_push (new_lane); > > > + > > > + /* Calculate new masks for SRC1_1 and SRC1_2. */ > > > + bool use_src1 = mask1_lanes.get (i); > > > + tree mask1 = gimple_assign_rhs3 (use_src1 ? src1_1 : src2_1); > > > + tree mask2 = gimple_assign_rhs3 (use_src1 ? src1_2 : src2_2); > > > + unsigned HOST_WIDE_INT lane1 > > > + = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) & (nelts - 1); > > > + unsigned HOST_WIDE_INT lane2 > > > + = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) & (nelts - 1); > > > + sel_1.quick_push (lane1 + (use_src1 ? 0 : nelts)); > > > + sel_2.quick_push (lane2 + (use_src1 ? 0 : nelts)); > > > + } > > > + > > > + vec_perm_indices indices (sel, 2, nelts); > > > + vec_perm_indices indices1_1 (sel_1, 2, nelts); > > > + vec_perm_indices indices1_2 (sel_2, 2, nelts); > > > + > > > + tree vectype = TREE_TYPE (gimple_assign_lhs (stmt2)); > > > + if (!can_vec_perm_const_p (TYPE_MODE (vectype), > > > + TYPE_MODE (vectype), indices) > > > + || !can_vec_perm_const_p (TYPE_MODE (vectype), > > > + TYPE_MODE (vectype), indices1_1) > > > + || !can_vec_perm_const_p (TYPE_MODE (vectype), > > > + TYPE_MODE (vectype), indices1_2)) > > > + continue; > > > + > > > + /* Success. Update all the statements. */ > > > + gimple_assign_set_rhs1 (stmt2, gimple_assign_rhs1 (stmt1)); > > > + gimple_assign_set_rhs2 (stmt2, gimple_assign_rhs2 (stmt1)); > > > + tree m1 = vect_gen_perm_mask_checked (vectype, indices); > > > + gimple_assign_set_rhs3 (stmt2, m1); > > > + > > > + gimple_assign_set_rhs2 (src1_1, gimple_assign_rhs1 (src2_1)); > > > + gimple_assign_set_rhs2 (src1_2, gimple_assign_rhs1 (src2_2)); > > > + > > > + tree m2 = vect_gen_perm_mask_checked (vectype, indices1_1); > > > + gimple_assign_set_rhs3 (src1_1, m2); > > > + tree m3 = vect_gen_perm_mask_checked (vectype, indices1_2); > > > + gimple_assign_set_rhs3 (src1_2, m3); > > > + > > > + update_stmt (stmt2); > > > + update_stmt (src1_1); > > > + update_stmt (src1_2); > > > + > > > + /* We need to move the updated statements because otherwise they may > > > + come before some variable that they depend on. Since we know that > > > + they don't have uses outside the pattern, we can remove them and > > > + add them back in order. */ > > > + gimple_stmt_iterator gsi_rm = gsi_for_stmt (bop1_1); > > > + gsi_remove (&gsi_rm, false); > > > + gsi_rm = gsi_for_stmt (bop1_2); > > > + gsi_remove (&gsi_rm, false); > > > + gsi_rm = gsi_for_stmt (src1_1); > > > + gsi_remove (&gsi_rm, false); > > > + gsi_rm = gsi_for_stmt (src1_2); > > > + gsi_remove (&gsi_rm, false); > > > + > > > + gsi_insert_before (&gsi_stmt1, src1_1, GSI_SAME_STMT); > > > + gsi_insert_before (&gsi_stmt1, src1_2, GSI_SAME_STMT); > > > + gsi_insert_before (&gsi_stmt1, bop1_1, GSI_SAME_STMT); > > > + gsi_insert_before (&gsi_stmt1, bop1_2, GSI_SAME_STMT); > > > + } > > > +} > > > + > > > > So this is trying to recognize two back to back permutes? I'm having a bit > > of a > hard time > > seeing what it's trying to match and what it's optimizing into. Can you > > add an > example > > in the description of the function? > > > > I'm having trouble with what the relationship between src_1_* and src2_2_* > > are > or > > should be as they're never checked. > > > > Aside from that though this doesn't feel like it should be done in the > > vectorizer. > > It feels odd to have a post vectorization permute optimization pass in the > vectorizer. > > > > I think this fits better in tree-ssa-forwpop.cc in simplify_permutation. > > Since the stmt groups are completely unrelated I don't think this fits. >
I was thinking you could use forwprop to mark all existing permutes that have the properties you're looking to merge and then iterate over them after. Forwprop already walks all permutes. > I also see how in full generality such opportunities are difficult to > exploit directly with the way we do greedy SLP discovery right now. > With the idea of starting from SLP matching the SSA graph and instead > merging nodes to form larger groups it might be able to catch this > though. > > But yes, this doesn't seem to belong to tree-vect-slp.cc but elsewhere, > iff we want to have it. As said at the very beginning it seems to > be very much a "SPEC hack" thing given it's very narrow focus. > > For more general application I would assume that a forward simulation > of a function to compute lane equivalences in vector (those "unused" > or "duplicate" lanes) and then shrinking and re-combining (re-vectorizing) > based on that info sounds like a better thing to do than this hard > pattern matching approach. This would also sort-of mimic the > "new SLP discovery" way. > Isn't that a too broad problem though? Given, as you say, the permutes are unrelated. So if you generalize it to any permute that seems like it's computationally expensive. Tamar > Richard. > > > But if you're rewriting the permutes then there must be some kind of link, > > could > you > > not find them through the use/def chain instead of relying on them being > > next to > each other? > > > > An example of what It's trying to match and rewrite to would be helpful > > here. > > > > Thanks, > > Tamar > > > > > /* Main entry for the BB vectorizer. Analyze and transform BB, returns > > > true if anything in the basic-block was vectorized. */ > > > > > > @@ -9586,6 +9832,8 @@ vect_slp_function (function *fun) > > > if (!bbs.is_empty ()) > > > r |= vect_slp_bbs (bbs, NULL); > > > > > > + vect_slp_optimize_permutes (fun); > > > + > > > free (rpo); > > > > > > return r; > > > -- > > > 2.46.0 > > > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)