Re: [PATCH 2/2] Add a new permute optimization step in SLP

Christoph Müllner Mon, 04 Nov 2024 00:41:00 -0800

On Wed, Oct 30, 2024 at 11:26 AM Christoph Müllner
<christoph.muell...@vrull.eu> wrote:
>
> On Fri, Oct 18, 2024 at 1:08 PM Richard Biener <rguent...@suse.de> wrote:
> >
> > On Fri, 18 Oct 2024, Tamar Christina wrote:
> >
> > > > -----Original Message-----
> > > > From: Richard Biener <rguent...@suse.de>
> > > > Sent: Friday, October 18, 2024 11:03 AM
> > > > To: Tamar Christina <tamar.christ...@arm.com>
> > > > Cc: Christoph Müllner <christoph.muell...@vrull.eu>; 
> > > > gcc-patches@gcc.gnu.org;
> > > > Philipp Tomsich <philipp.toms...@vrull.eu>; Jeff Law 
> > > > <jeffreya...@gmail.com>;
> > > > Robin Dapp <rd...@ventanamicro.com>
> > > > Subject: RE: [PATCH 2/2] Add a new permute optimization step in SLP
> > > >
> > > > On Thu, 17 Oct 2024, Tamar Christina wrote:
> > > >
> > > > > Hi Christoph,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Christoph Müllner <christoph.muell...@vrull.eu>
> > > > > > Sent: Tuesday, October 15, 2024 3:57 PM
> > > > > > To: gcc-patches@gcc.gnu.org; Philipp Tomsich 
> > > > > > <philipp.toms...@vrull.eu>;
> > > > Tamar
> > > > > > Christina <tamar.christ...@arm.com>; Richard Biener 
> > > > > > <rguent...@suse.de>
> > > > > > Cc: Jeff Law <jeffreya...@gmail.com>; Robin Dapp
> > > > <rd...@ventanamicro.com>;
> > > > > > Christoph Müllner <christoph.muell...@vrull.eu>
> > > > > > Subject: [PATCH 2/2] Add a new permute optimization step in SLP
> > > > > >
> > > > > > This commit adds a new permute optimization step after running SLP
> > > > > > vectorization.
> > > > > > Although there are existing places where individual or nested 
> > > > > > permutes
> > > > > > can be optimized, there are cases where independent permutes can be
> > > > optimized,
> > > > > > which cannot be expressed in the current pattern matching framework.
> > > > > > The optimization step is run at the end so that permutes from 
> > > > > > completely
> > > > different
> > > > > > SLP builds can be optimized.
> > > > > >
> > > > > > The initial optimizations implemented can detect some cases where 
> > > > > > different
> > > > > > "select permutes" (permutes that only use some of the incoming 
> > > > > > vector lanes)
> > > > > > can be co-located in a single permute. This can optimize some cases 
> > > > > > where
> > > > > > two_operator SLP nodes have duplicate elements.
> > > > > >
> > > > > > Bootstrapped and reg-tested on AArch64 (C, C++, Fortran).
> > > > > >
> > > > > > Manolis Tsamis was the patch's initial author before I took it over.
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >         * tree-vect-slp.cc (get_tree_def): Return the definition of 
> > > > > > a name.
> > > > > >         (recognise_perm_binop_perm_pattern): Helper function.
> > > > > >         (vect_slp_optimize_permutes): New permute optimization step.
> > > > > >         (vect_slp_function): Run the new permute optimization step.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >         * gcc.dg/vect/slp-perm-14.c: New test.
> > > > > >         * gcc.target/aarch64/sve/slp-perm-14.c: New test.
> > > > > >
> > > > > > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu>
> > > > > > ---
> > > > > >  gcc/testsuite/gcc.dg/vect/slp-perm-14.c       |  42 +++
> > > > > >  .../gcc.target/aarch64/sve/slp-perm-14.c      |   3 +
> > > > > >  gcc/tree-vect-slp.cc                          | 248 
> > > > > > ++++++++++++++++++
> > > > > >  3 files changed, 293 insertions(+)
> > > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > > > > >  create mode 100644 
> > > > > > gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > > > > >
> > > > > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > > > > > b/gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..f56e3982a62
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > > > > > @@ -0,0 +1,42 @@
> > > > > > +/* { dg-do compile } */
> > > > > > +/* { dg-additional-options "-O3 -fdump-tree-slp1-details" } */
> > > > > > +
> > > > > > +#include <stdint.h>
> > > > > > +
> > > > > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
> > > > > > +    int t0 = s0 + s1;\
> > > > > > +    int t1 = s0 - s1;\
> > > > > > +    int t2 = s2 + s3;\
> > > > > > +    int t3 = s2 - s3;\
> > > > > > +    d0 = t0 + t2;\
> > > > > > +    d1 = t1 + t3;\
> > > > > > +    d2 = t0 - t2;\
> > > > > > +    d3 = t1 - t3;\
> > > > > > +}
> > > > > > +
> > > > > > +int
> > > > > > +x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t 
> > > > > > *pix2, int
> > > > > > i_pix2)
> > > > > > +{
> > > > > > +  uint32_t tmp[4][4];
> > > > > > +  uint32_t a0, a1, a2, a3;
> > > > > > +  int sum = 0;
> > > > > > +
> > > > > > +  for (int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2)
> > > > > > +    {
> > > > > > +      a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> > > > > > +      a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> > > > > > +      a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
> > > > > > +      a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
> > > > > > +      HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, 
> > > > > > a1, a2, a3);
> > > > > > +    }
> > > > > > +
> > > > > > +  for (int i = 0; i < 4; i++)
> > > > > > +    {
> > > > > > +      HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], 
> > > > > > tmp[3][i]);
> > > > > > +      sum += a0 + a1 + a2 + a3;
> > > > > > +    }
> > > > > > +
> > > > > > +  return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
> > > > > > +}
> > > > > > +
> > > > > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" 
> > > > > > "slp1" } } */
> > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > > > > > b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..4e0d5175be8
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > > > > > @@ -0,0 +1,3 @@
> > > > > > +#include "../../../gcc.dg/vect/slp-perm-14.c"
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-not {\ttbl\t} } } */
> > > > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > > > > > index 8794c94ef90..4bf5ccb9cdf 100644
> > > > > > --- a/gcc/tree-vect-slp.cc
> > > > > > +++ b/gcc/tree-vect-slp.cc
> > > > > > @@ -9478,6 +9478,252 @@ vect_slp_if_converted_bb (basic_block bb,
> > > > loop_p
> > > > > > orig_loop)
> > > > > >    return vect_slp_bbs (bbs, orig_loop);
> > > > > >  }
> > > > > >
> > > > > > +/* If NAME is an SSA_NAME defined by an assignment, return that
> > > > assignment.
> > > > > > +   If SINGLE_USE_ONLY is true and NAME has multiple uses, return 
> > > > > > NULL.  */
> > > > > > +
> > > > > > +static gassign *
> > > > > > +get_tree_def (tree name, bool single_use_only)
> > > > > > +{
> > > > > > +  if (TREE_CODE (name) != SSA_NAME)
> > > > > > +    return NULL;
> > > > > > +
> > > > > > +  gimple *def_stmt = SSA_NAME_DEF_STMT (name);
> > > > > > +
> > > > > > +  if (single_use_only && !has_single_use (name))
> > > > > > +    return NULL;
> > > > > > +
> > > > > > +  if (!is_gimple_assign (def_stmt))
> > > > > > +    return NULL;
> > > > >
> > > > > Probably cheaper to test this before the single_use. But..
> > > > > > +
> > > > > > +  return dyn_cast <gassign *> (def_stmt);
> > > > >
> > > > > Why not just do the dyn_cast and assign that to a result and check
> > > > > If the dyn_cast succeeded.  Saves you the second check of the type.
> > > > >
> > > > > > +}
> > > > > > +
> > > > > > +/* Helper function for vect_slp_optimize_permutes.  Return true if 
> > > > > > STMT is an
> > > > > > +   expression of the form:
> > > > > > +
> > > > > > +     src1_perm = VEC_PERM_EXPR <SRC1, SRC1, CONST_VEC>
> > > > > > +     src2_perm = VEC_PERM_EXPR <SRC2, SRC2, CONST_VEC>
> > > > > > +     bop1 = src1_perm BINOP1 src2_perm
> > > > > > +     bop2 = src1_perm BINOP2 src2_perm
> > > > > > +     STMT = VEC_PERM_EXPR <bop1, bop2, CONST_VEC>
> > > > > > +
> > > > > > +   and src1_perm, src2_perm, bop1, bop2 are not used outside of 
> > > > > > STMT.
> > > > > > +   Return the first two permute statements and the binops through 
> > > > > > the
> > > > > > +   corresponding pointer arguments.  */
> > > > > > +
> > > > > > +static bool
> > > > > > +recognise_perm_binop_perm_pattern (gassign *stmt,
> > > > > > +                                  gassign **bop1_out, gassign 
> > > > > > **bop2_out,
> > > > > > +                                  gassign **perm1_out, gassign 
> > > > > > **perm2_out)
> > > > > > +{
> > > > > > +  if (gimple_assign_rhs_code (stmt) != VEC_PERM_EXPR)
> > > > > > +    return false;
> > > > > > +
> > > > > > +  gassign *bop1, *bop2;
> > > > > > +  if (!(bop1 = get_tree_def (gimple_assign_rhs1 (stmt), true))
> > > > > > +      || !(bop2 = get_tree_def (gimple_assign_rhs2 (stmt), true))
> > > > > > +      || !VECTOR_CST_NELTS (gimple_assign_rhs3 (stmt)).is_constant 
> > > > > > ()
> > > > > > +      || TREE_CODE_CLASS (gimple_assign_rhs_code (bop1)) != 
> > > > > > tcc_binary
> > > > > > +      || TREE_CODE_CLASS (gimple_assign_rhs_code (bop2)) != 
> > > > > > tcc_binary)
> > > > > > +    return false;
> > > > > > +
> > > > > > +  if (gimple_assign_rhs1 (bop1) != gimple_assign_rhs1 (bop2)
> > > > > > +      || gimple_assign_rhs2 (bop1) != gimple_assign_rhs2 (bop2)
> > > > > > +      || bop1 == bop2)
> > > > > > +    return false;
> > > > > > +
> > > > > > +  tree src1_perm = gimple_assign_rhs1 (bop1);
> > > > > > +  tree src2_perm = gimple_assign_rhs2 (bop1);
> > > > > > +
> > > > > > +  gassign *perm1, *perm2;
> > > > > > +  if (!(perm1 = get_tree_def (src1_perm, false))
> > > > > > +      || !(perm2 = get_tree_def (src2_perm, false))
> > > > > > +      || num_imm_uses (src1_perm) != 2
> > > > > > +      || num_imm_uses (src2_perm) != 2
> > > > > > +      || gimple_assign_rhs_code (perm1) != VEC_PERM_EXPR
> > > > > > +      || gimple_assign_rhs_code (perm2) != VEC_PERM_EXPR
> > > > > > +      || gimple_assign_rhs1 (perm1) != gimple_assign_rhs2 (perm1)
> > > > > > +      || gimple_assign_rhs1 (perm2) != gimple_assign_rhs2 (perm2)
> > > > > > +      || !VECTOR_CST_NELTS (gimple_assign_rhs3 
> > > > > > (perm1)).is_constant ()
> > > > > > +      || !VECTOR_CST_NELTS (gimple_assign_rhs3 
> > > > > > (perm2)).is_constant ())
> > > > > > +    return false;
> > > > > > +
> > > > > > +  *bop1_out = bop1;
> > > > > > +  *bop2_out = bop2;
> > > > > > +  *perm1_out = perm1;
> > > > > > +  *perm2_out = perm2;
> > > > > > +
> > > > > > +  return true;
> > > > > > +}
> > > > > > +
> > > > > > +typedef hash_map<int_hash<unsigned HOST_WIDE_INT, -1, -1>,
> > > > > > +                unsigned HOST_WIDE_INT> lane_map;
> > > > > > +
> > > > > > +/* Iterate over the basic blocks of FUN and try to optimize 
> > > > > > VEC_PERM
> > > > > > +   expressions.  This is done at the end of vectorization so it 
> > > > > > can optimize
> > > > > > +   expressions that are part of multiple different vectorized 
> > > > > > blocks/loops.  */
> > > > > > +
> > > > > > +static void
> > > > > > +vect_slp_optimize_permutes (function *fun)
> > > > > > +{
> > > > > > +  basic_block bb;
> > > > > > +  FOR_EACH_BB_FN (bb, fun)
> > > > > > +    for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p 
> > > > > > (gsi);)
> > > > > > +      {
> > > > > > +       gimple_stmt_iterator gsi_stmt1 = gsi;
> > > > > > +       gassign *stmt1 = dyn_cast <gassign *> (gsi_stmt (gsi));
> > > > > > +       gsi_next (&gsi);
> > > >
> > > > gsi_next_nondebug (&gsi);
> > > >
> > > > to avoid compare-debug issues.
> > > >
> > > > It seems quite limited that you restrict this to adjacent stmts.
> > > >
> > > > > > +
> > > > > > +       if (gsi_end_p (gsi))
> > > > > > +         break;
> > > > > > +
> > > > > > +       gassign *stmt2 = dyn_cast <gassign *> (gsi_stmt (gsi));
> > > > > > +
> > > > > > +       if (!stmt1 || !stmt2)
> > > > > > +         continue;
> > > > > > +
> > > > > > +       /* Try to identify select permutes which utilize only part 
> > > > > > of a
> > > > > > +          vector and merge two of them into one.  This case can 
> > > > > > arise from
> > > > > > +          TWO_OPERATOR SLP patterns because the final permute uses 
> > > > > > only half
> > > > > > +          of each input vector.  */
> > > > > > +       gassign *bop1_1, *bop1_2, *bop2_1, *bop2_2;
> > > > > > +       gassign *src1_1, *src1_2, *src2_1, *src2_2;
> > > > > > +
> > > > > > +       if (!recognise_perm_binop_perm_pattern (stmt1, &bop1_1, 
> > > > > > &bop1_2,
> > > > > > +                                              &src1_1, &src1_2))
> > > > > > +         continue;
> > > > > > +
> > > > > > +       if (!recognise_perm_binop_perm_pattern (stmt2, &bop2_1, 
> > > > > > &bop2_2,
> > > > > > +                                              &src2_1, &src2_2))
> > > > > > +         continue;
> > > > > > +
> > > > > > +       if (gimple_assign_rhs_code (bop1_1) != 
> > > > > > gimple_assign_rhs_code
> > > > > > (bop2_1))
> > > > > > +         continue;
> > > > > > +
> > > > > > +       if (gimple_assign_rhs_code (bop1_2) != 
> > > > > > gimple_assign_rhs_code
> > > > > > (bop2_2))
> > > > > > +         continue;
> > > >
> > > > Hmm, up to here we've only verified both "binop_perm" are using the
> > > > same binop operation code but we've not related their SRC1 or SRC2?
> > > >
> > > > From 1/2 in the series I guess we're trying to see whether we can
> > > > re-use "unused lanes" here?
> > > >
> > > > > > +
> > > > > > +       tree mask1 = gimple_assign_rhs3 (stmt1);
> > > > > > +       tree mask2 = gimple_assign_rhs3 (stmt2);
> > > > > > +
> > > > > > +       /* Calculate the lanes that the first permute needs.  */
> > > > > > +       lane_map mask1_lanes;
> > > > > > +       unsigned HOST_WIDE_INT i;
> > > > > > +       unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS
> > > > > > (mask1).to_constant ();
> > > > > > +       for (i = 0; i < nelts; i++)
> > > > > > +         {
> > > > > > +           tree val = VECTOR_CST_ELT (mask1, i);
> > > > > > +           gcc_assert (TREE_CODE (val) == INTEGER_CST);
> > > > > > +           unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) & 
> > > > > > (nelts - 1);
> > > > > > +           mask1_lanes.put (j, j);
> > > > > > +         }
> > > > > > +
> > > > > > +       /* Calculate the lanes that the second permute needs and 
> > > > > > rewrite
> > > > > > +          them to use the lanes that are unused from the first 
> > > > > > permute.  */
> > > > > > +       lane_map mask2_lanes;
> > > > > > +       unsigned HOST_WIDE_INT lane_gen = 0;
> > > > > > +       for (i = 0; i < nelts; i++)
> > > > > > +         {
> > > > > > +           tree val = VECTOR_CST_ELT (mask2, i);
> > > > > > +           gcc_assert (TREE_CODE (val) == INTEGER_CST);
> > > > > > +           unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) & 
> > > > > > (nelts - 1);
> > > > > > +           bool existed;
> > > > > > +           unsigned HOST_WIDE_INT &rewrite_lane
> > > > > > +             = mask2_lanes.get_or_insert (j, &existed);
> > > > > > +           if (!existed)
> > > > > > +             {
> > > > > > +               while (mask1_lanes.get (lane_gen))
> > > > > > +                 lane_gen++;
> > > > > > +               if (lane_gen >= nelts)
> > > > > > +                 break;
> > > > > > +               rewrite_lane = lane_gen;
> > > > > > +               lane_gen++;
> > > > > > +             }
> > > > > > +         }
> > > > > > +
> > > > > > +       /* The requested lanes need more than one permute.  */
> > > > > > +       if (i < nelts)
> > > > > > +         continue;
> > > > > > +
> > > > > > +       vec_perm_builder sel (nelts, nelts, 1);
> > > > > > +       vec_perm_builder sel_1 (nelts, nelts, 1);
> > > > > > +       vec_perm_builder sel_2 (nelts, nelts, 1);
> > > > > > +
> > > > > > +       /* Rewrite the permutations based on 
> > > > > > MASK1_LANES/MASK2_LANES.  */
> > > > > > +       for (i = 0; i < nelts; i++)
> > > > > > +         {
> > > > > > +           /* Calculate new mask for STMT2.  */
> > > > > > +           tree val = VECTOR_CST_ELT (mask2, i);
> > > > > > +           unsigned HOST_WIDE_INT lane
> > > > > > +             = TREE_INT_CST_LOW (val) & (2 * nelts - 1);
> > > > > > +           unsigned off = (lane < nelts) ? 0 : nelts;
> > > > > > +           unsigned HOST_WIDE_INT new_lane
> > > > > > +             = *mask2_lanes.get (lane - off) + off;
> > > > > > +           sel.quick_push (new_lane);
> > > > > > +
> > > > > > +           /* Calculate new masks for SRC1_1 and SRC1_2.  */
> > > > > > +           bool use_src1 = mask1_lanes.get (i);
> > > > > > +           tree mask1 = gimple_assign_rhs3 (use_src1 ? src1_1 : 
> > > > > > src2_1);
> > > > > > +           tree mask2 = gimple_assign_rhs3 (use_src1 ? src1_2 : 
> > > > > > src2_2);
> > > > > > +           unsigned HOST_WIDE_INT lane1
> > > > > > +             = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) & 
> > > > > > (nelts - 1);
> > > > > > +           unsigned HOST_WIDE_INT lane2
> > > > > > +             = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) & 
> > > > > > (nelts - 1);
> > > > > > +           sel_1.quick_push (lane1 + (use_src1 ? 0 : nelts));
> > > > > > +           sel_2.quick_push (lane2 + (use_src1 ? 0 : nelts));
> > > > > > +         }
> > > > > > +
> > > > > > +       vec_perm_indices indices (sel, 2, nelts);
> > > > > > +       vec_perm_indices indices1_1 (sel_1, 2, nelts);
> > > > > > +       vec_perm_indices indices1_2 (sel_2, 2, nelts);
> > > > > > +
> > > > > > +       tree vectype = TREE_TYPE (gimple_assign_lhs (stmt2));
> > > > > > +       if (!can_vec_perm_const_p (TYPE_MODE (vectype),
> > > > > > +                                  TYPE_MODE (vectype), indices)
> > > > > > +           || !can_vec_perm_const_p (TYPE_MODE (vectype),
> > > > > > +                                     TYPE_MODE (vectype), 
> > > > > > indices1_1)
> > > > > > +           || !can_vec_perm_const_p (TYPE_MODE (vectype),
> > > > > > +                                     TYPE_MODE (vectype), 
> > > > > > indices1_2))
> > > > > > +         continue;
> > > > > > +
> > > > > > +       /* Success.  Update all the statements.  */
> > > > > > +       gimple_assign_set_rhs1 (stmt2, gimple_assign_rhs1 (stmt1));
> > > > > > +       gimple_assign_set_rhs2 (stmt2, gimple_assign_rhs2 (stmt1));
> > > > > > +       tree m1 = vect_gen_perm_mask_checked (vectype, indices);
> > > > > > +       gimple_assign_set_rhs3 (stmt2, m1);
> > > > > > +
> > > > > > +       gimple_assign_set_rhs2 (src1_1, gimple_assign_rhs1 
> > > > > > (src2_1));
> > > > > > +       gimple_assign_set_rhs2 (src1_2, gimple_assign_rhs1 
> > > > > > (src2_2));
> > > > > > +
> > > > > > +       tree m2 = vect_gen_perm_mask_checked (vectype, indices1_1);
> > > > > > +       gimple_assign_set_rhs3 (src1_1, m2);
> > > > > > +       tree m3 = vect_gen_perm_mask_checked (vectype, indices1_2);
> > > > > > +       gimple_assign_set_rhs3 (src1_2, m3);
> > > > > > +
> > > > > > +       update_stmt (stmt2);
> > > > > > +       update_stmt (src1_1);
> > > > > > +       update_stmt (src1_2);
> > > > > > +
> > > > > > +       /* We need to move the updated statements because otherwise 
> > > > > > they may
> > > > > > +          come before some variable that they depend on.  Since we 
> > > > > > know that
> > > > > > +          they don't have uses outside the pattern, we can remove 
> > > > > > them and
> > > > > > +          add them back in order.  */
> > > > > > +       gimple_stmt_iterator gsi_rm = gsi_for_stmt (bop1_1);
> > > > > > +       gsi_remove (&gsi_rm, false);
> > > > > > +       gsi_rm = gsi_for_stmt (bop1_2);
> > > > > > +       gsi_remove (&gsi_rm, false);
> > > > > > +       gsi_rm = gsi_for_stmt (src1_1);
> > > > > > +       gsi_remove (&gsi_rm, false);
> > > > > > +       gsi_rm = gsi_for_stmt (src1_2);
> > > > > > +       gsi_remove (&gsi_rm, false);
> > > > > > +
> > > > > > +       gsi_insert_before (&gsi_stmt1, src1_1, GSI_SAME_STMT);
> > > > > > +       gsi_insert_before (&gsi_stmt1, src1_2, GSI_SAME_STMT);
> > > > > > +       gsi_insert_before (&gsi_stmt1, bop1_1, GSI_SAME_STMT);
> > > > > > +       gsi_insert_before (&gsi_stmt1, bop1_2, GSI_SAME_STMT);
> > > > > > +      }
> > > > > > +}
> > > > > > +
> > > > >
> > > > > So this is trying to recognize two back to back permutes? I'm having 
> > > > > a bit of a
> > > > hard time
> > > > > seeing what it's trying to match and what it's optimizing into.  Can 
> > > > > you add an
> > > > example
> > > > > in the description of the function?
> > > > >
> > > > > I'm having trouble with what the relationship between src_1_* and 
> > > > > src2_2_* are
> > > > or
> > > > > should be as they're never checked.
> > > > >
> > > > > Aside from that though this doesn't feel like it should be done in 
> > > > > the vectorizer.
> > > > > It feels odd to have a post vectorization permute optimization pass 
> > > > > in the
> > > > vectorizer.
> > > > >
> > > > > I think this fits better in tree-ssa-forwpop.cc in 
> > > > > simplify_permutation.
> > > >
> > > > Since the stmt groups are completely unrelated I don't think this fits.
> > > >
> > >
> > > I was thinking you could use forwprop to mark all existing permutes that
> > > have the properties you're looking to merge and then iterate over them
> > > after.  Forwprop already walks all permutes.
> > >
> > > > I also see how in full generality such opportunities are difficult to
> > > > exploit directly with the way we do greedy SLP discovery right now.
> > > > With the idea of starting from SLP matching the SSA graph and instead
> > > > merging nodes to form larger groups it might be able to catch this
> > > > though.
> > > >
> > > > But yes, this doesn't seem to belong to tree-vect-slp.cc but elsewhere,
> > > > iff we want to have it.  As said at the very beginning it seems to
> > > > be very much a "SPEC hack" thing given it's very narrow focus.
> > > >
> > > > For more general application I would assume that a forward simulation
> > > > of a function to compute lane equivalences in vector (those "unused"
> > > > or "duplicate" lanes) and then shrinking and re-combining 
> > > > (re-vectorizing)
> > > > based on that info sounds like a better thing to do than this hard
> > > > pattern matching approach.  This would also sort-of mimic the
> > > > "new SLP discovery" way.
> > > >
> > >
> > > Isn't that a too broad problem though? Given, as you say, the permutes are
> > > unrelated.  So if you generalize it to any permute that seems like it's
> > > computationally expensive.
> >
> > For most generality you'd try to combine all PLUS operations in a
> > basic-block/function/program to a single vector operation and then
> > split along those pairs you cannot combine.  But sure, this is
> > computationally expensive unless you manage to quickly lookup
> > likely win candidates.  For the case in question you'd want O(quick)
> > lookup of same BINOP with each half unused lanes and key off validation
> > from the hopefully not too many of such candidates.
> >
> > That said - the patch as-is also tries to match up two operation chains
> > in the attempt to combine them, it just very much restricts itself
> > to a pair of adjacent stmts to identify those two chains.
>
> Thank you, Richard and Tamar, for the feedback.
> I'm responding with delay because I had to close some knowledge gaps
> about SLP to avoid making wrong assumptions/statements.
>
> So, what these two patches are doing is using the redundancy in the lane
> utilization to parallelize vectorized sequences.
> This redundancy in lane utilization comes from the way how specific
> scalar statements end up vectorized: two VEC_PERMs on top, binary operations
> on both of them, and a final VEC_PERM to create the result.
> Here is an example of this sequence:
>
> [...] // v_in = {e0, e1, e2, e3}
> v_1 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}> // v_1 = {e1, e3, e1, e3}
> v_2 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}> // v_2 = {e0, e2, e0, e2}
> v_y = v_2 - v_1                           // v_y = {e0-e1, e2-e3, e0-e1, 
> e2-e3}
> v_x = v_2 + v_1                           // v_x = {e0+e1, e2+e3, e0+e1, 
> e2+e3}
> v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}> // v_out = {e0+e1, e2+e3,
> e0-e1, e2-e3}
>
> The first patch frees up lanes 2 and 3 and changes the last statement to:
> v_out = VEC_PERM <v_x, v_y, {0, 1, 4, 5}> // _v5 = {e0+e1, e2+e3, e0-e1, 
> e2-e3}
>
> The second patch searches for a pair of such lane-compressed sequences in a BB
> (with identical vector operations) and merges them into a single sequence.
>
> Overall, on success, we replace 2x (2x VEC_PERM + 2x BINOP + 1x VEC_PERM)
> instructions by 2x VEC_PERM + 2x BINOP + 2x VEC_PERM.
> 2x VEC_PERM and 2x BINOP get eliminated.
>
> So, the two optimizations are:
> * lane compression in a certain sequence of vector statements
> * merge two lane-compressed sequences of vector statements
>
> Richard stated that the lane compression is a possibly pessimizing
> change. Therefore, I've merged the two optimizations into a new
> pass ("slp_perm_simplify") that runs after slp1.
> The pass now searches for lane-compression candidates and calculates
> how the resulting selector in the final VEC_PERM could look like.
> Then, we attempt to pair two sequences (which don't need to be adjacent)
> up for merging. If we find a pair, then the two optimizations are applied.
>
> This targets x264_pixel_satd_8x4, which calculates the sum of absolute
> transformed differences (SATD) using Hadamard transformation.
> We have seen 7% speedup on SPEC's x264 (on an AArch64 machine).
>
> I still need to clean up the code and optimize the analysis part of it
> before sending out a patch, but I wanted to show a sign of life and
> provide the missing context that was requested.


A reworked patch can be found here:
  https://gcc.gnu.org/pipermail/gcc-patches/2024-November/667355.html

I tried to address all comments from the replies.
However, I need some guidance for making this optimization more generic.
E.g. I was not able to find an example with longer calculation chains, that
could still be merged (which would justify walking a use-def chain).

I'm aware that creating a pass is likely not the right thing, but I was not
sure where else to put the code.  It should be relatively easy to move the
code into another pass that iterate over all gimple statements in all BBs.

>
>
>
> >
> > Richard.
> >
> > > Tamar
> > >
> > > > Richard.
> > > >
> > > > > But if you're rewriting the permutes then there must be some kind of 
> > > > > link, could
> > > > you
> > > > > not find them through the use/def chain instead of relying on them 
> > > > > being next to
> > > > each other?
> > > > >
> > > > > An example of what It's trying to match and rewrite to would be 
> > > > > helpful here.
> > > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > >  /* Main entry for the BB vectorizer.  Analyze and transform BB, 
> > > > > > returns
> > > > > >     true if anything in the basic-block was vectorized.  */
> > > > > >
> > > > > > @@ -9586,6 +9832,8 @@ vect_slp_function (function *fun)
> > > > > >    if (!bbs.is_empty ())
> > > > > >      r |= vect_slp_bbs (bbs, NULL);
> > > > > >
> > > > > > +  vect_slp_optimize_permutes (fun);
> > > > > > +
> > > > > >    free (rpo);
> > > > > >
> > > > > >    return r;
> > > > > > --
> > > > > > 2.46.0
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguent...@suse.de>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG 
> > > > Nuernberg)
> > >
> >
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH 2/2] Add a new permute optimization step in SLP

Reply via email to