Re: PING: [PATCH] x86: Add a pass to remove redundant all 0s/1s vector load

H.J. Lu Mon, 28 Apr 2025 01:37:36 -0700

On Tue, Apr 22, 2025 at 10:01 AM Hongtao Liu <[email protected]> wrote:
>
> On Mon, Apr 21, 2025 at 4:30 PM H.J. Lu <[email protected]> wrote:
> >
> > On Mon, Apr 21, 2025 at 11:29 AM Hongtao Liu <[email protected]> wrote:
> > >
> > > On Sat, Apr 19, 2025 at 1:25 PM H.J. Lu <[email protected]> wrote:
> > > >
> > > > On Sun, Dec 1, 2024 at 7:50 AM H.J. Lu <[email protected]> wrote:
> > > > >
> > > > > For all different modes of all 0s/1s vectors, we can use the single 
> > > > > widest
> > > > > all 0s/1s vector register for all 0s/1s vector uses in the whole 
> > > > > function.
> > > > > Add a pass to generate a single widest all 0s/1s vector set 
> > > > > instruction at
> > > > > entry of the nearest common dominator for basic blocks with all 0s/1s
> > > > > vector uses.  On Linux/x86-64, in cc1plus, this patch reduces the 
> > > > > number
> > > > > of vector xor instructions from 4803 to 4714 and pcmpeq instructions 
> > > > > from
> > > > > 144 to 142.
> > > > >
> > > > > This change causes a regression:
> > > > >
> > > > > FAIL: gcc.dg/rtl/x86_64/vector_eq.c
> > > > >
> > > > > without the fix for
> > > > >
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117863
> > > > >
> > > > > NB: PR target/92080 and PR target/117839 aren't same.  PR 
> > > > > target/117839
> > > > > is for vectors of all 0s and all 1s with different sizes and different
> > > > > components.  PR target/92080 is for broadcast of the same component to
> > > > > different vector sizes.  This patch covers only all 0s and all 1s 
> > > > > cases
> > > > > of PR target/92080.
> > > > >
> > > > > gcc/
> > > > >
> > > > >         PR target/92080
> > > > >         PR target/117839
> > > > >         * config/i386/i386-features.cc (ix86_rrvl_gate): New.
> > > > >         (ix86_place_single_vector_set): Likewise.
> > > > >         (ix86_get_vector_load_mode): Likewise.
> > > > >         (remove_redundant_vector_load): Likewise.
> > > > >         (pass_data_remove_redundant_vector_load): Likewise.
> > > > >         (pass_remove_redundant_vector_load): Likewise.
> > > > >         (make_pass_remove_redundant_vector_load): Likewise.
> > > > >         * config/i386/i386-passes.def: Add
> > > > >         pass_remove_redundant_vector_load after
> > > > >         pass_remove_partial_avx_dependency.
> > > > >         * config/i386/i386-protos.h
> > > > >         (make_pass_remove_redundant_vector_load): New.
> > > > >
> > > > > gcc/testsuite/
> > > > >
> > > > >         PR target/92080
> > > > >         PR target/117839
> > > > >         * gcc.target/i386/pr117839-1a.c: New test.
> > > > >         * gcc.target/i386/pr117839-1b.c: Likewise.
> > > > >         * gcc.target/i386/pr117839-2.c: Likewise.
> > > > >         * gcc.target/i386/pr92080-1.c: Likewise.
> > > > >         * gcc.target/i386/pr92080-2.c: Likewise.
> > > > >         * gcc.target/i386/pr92080-3.c: Likewise.
> > > > >
> > > > > Signed-off-by: H.J. Lu <[email protected]>
> > > > > ---
> > > > >  gcc/config/i386/i386-features.cc            | 308 
> > > > > ++++++++++++++++++++
> > > > >  gcc/config/i386/i386-passes.def             |   1 +
> > > > >  gcc/config/i386/i386-protos.h               |   2 +
> > > > >  gcc/testsuite/gcc.target/i386/pr117839-1a.c |  35 +++
> > > > >  gcc/testsuite/gcc.target/i386/pr117839-1b.c |   5 +
> > > > >  gcc/testsuite/gcc.target/i386/pr117839-2.c  |  40 +++
> > > > >  gcc/testsuite/gcc.target/i386/pr92080-1.c   |  54 ++++
> > > > >  gcc/testsuite/gcc.target/i386/pr92080-2.c   |  59 ++++
> > > > >  gcc/testsuite/gcc.target/i386/pr92080-3.c   |  48 +++
> > > > >  9 files changed, 552 insertions(+)
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr117839-1a.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr117839-1b.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr117839-2.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-1.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-2.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-3.c
> > > > >
> > > > > diff --git a/gcc/config/i386/i386-features.cc 
> > > > > b/gcc/config/i386/i386-features.cc
> > > > > index 003b003e09c..7d8d260750d 100644
> > > > > --- a/gcc/config/i386/i386-features.cc
> > > > > +++ b/gcc/config/i386/i386-features.cc
> > > > > @@ -3288,6 +3288,314 @@ make_pass_remove_partial_avx_dependency 
> > > > > (gcc::context *ctxt)
> > > > >    return new pass_remove_partial_avx_dependency (ctxt);
> > > > >  }
> > > > >
> > > > > +static bool
> > > > > +ix86_rrvl_gate ()
> > > > > +{
> > > > > +  return (TARGET_SSE2
> > > > > +         && optimize
> > > > > +         && optimize_function_for_speed_p (cfun));
> > > > > +}
> > > > > +
> > > > > +/* Generate a vector set, DEST = SRC, at entry of the nearest 
> > > > > dominator
> > > > > +   for basic block map BBS, which is in the fake loop that contains 
> > > > > the
> > > > > +   whole function, so that there is only a single vector set in the
> > > > > +   whole function.   */
> > > > > +
> > > > > +static void
> > > > > +ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs)
> > > > > +{
> > > > > +  basic_block bb = nearest_common_dominator_for_set (CDI_DOMINATORS, 
> > > > > bbs);
> > > > > +  while (bb->loop_father->latch
> > > > > +        != EXIT_BLOCK_PTR_FOR_FN (cfun))
> > > > > +    bb = get_immediate_dominator (CDI_DOMINATORS,
> > > > > +                                 bb->loop_father->header);
> > > > > +
> > > > > +  rtx set = gen_rtx_SET (dest, src);
> > > > > +
> > > > > +  rtx_insn *insn = BB_HEAD (bb);
> > > > > +  while (insn && !NONDEBUG_INSN_P (insn))
> > > > > +    {
> > > > > +      if (insn == BB_END (bb))
> > > > > +       {
> > > > > +         insn = NULL;
> > > > > +         break;
> > > > > +       }
> > > > > +      insn = NEXT_INSN (insn);
> > > > > +    }
> > > > > +
> > > > > +  rtx_insn *set_insn;
> > > > > +  if (insn == BB_HEAD (bb))
> > > > > +    set_insn = emit_insn_before (set, insn);
> > > > > +  else
> > > > > +    set_insn = emit_insn_after (set,
> > > > > +                               insn ? PREV_INSN (insn) : BB_END 
> > > > > (bb));
> > > > > +  df_insn_rescan (set_insn);
> > > > > +}
> > > > > +
> > > > > +/* Return a machine mode suitable for vector SIZE.  */
> > > > > +
> > > > > +static machine_mode
> > > > > +ix86_get_vector_load_mode (unsigned int size)
> > > > > +{
> > > > > +  machine_mode mode;
> > > > > +  if (size == 64)
> > > > > +    mode = V64QImode;
> > > > > +  else if (size == 32)
> > > > > +    mode = V32QImode;
> > > > > +  else
> > > > > +    mode = V16QImode;
> > > > > +  return mode;
> > > > > +}
> > > > > +
> > > > > +/* At entry of the nearest common dominator for basic blocks with 
> > > > > vector
> > > > > +   CONST0_RTX and integer CONSTM1_RTX uses, generate a single widest
> > > > > +   vector set instruction for all CONST0_RTX and integer CONSTM1_RTX
> > > > > +   uses.
> > > > > +
> > > > > +   NB: We want to generate only a single widest vector set to cover 
> > > > > the
> > > > > +   whole function.  The LCM algorithm isn't appropriate here since it
> > > > > +   may place a vector set inside the loop.  */
> > > > > +
> > > > > +static unsigned int
> > > > > +remove_redundant_vector_load (void)
> > > > > +{
> > > > > +  timevar_push (TV_MACH_DEP);
> > > > > +
> > > > > +  bitmap_obstack_initialize (NULL);
> > > > > +  bitmap zero_bbs = BITMAP_ALLOC (NULL);
> > > > > +  bitmap m1_bbs = BITMAP_ALLOC (NULL);
> > > > > +  bitmap vector_insns = BITMAP_ALLOC (NULL);
> > > Use auto_bitmap?
> >
> > Will fix it.
> >
> > > > > +
> > > > > +  basic_block bb;
> > > > > +  rtx_insn *insn;
> > > > > +  rtx set;
> > > > > +  unsigned HOST_WIDE_INT zero_count = 0;
> > > > > +  unsigned HOST_WIDE_INT m1_count = 0;
> > > > > +  unsigned int zero_size = 0;
> > > > > +  unsigned int m1_size = 0;
> > > > > +
> > > > > +  df_set_flags (DF_DEFER_INSN_RESCAN);
> > > > > +
> > > > > +  FOR_EACH_BB_FN (bb, cfun)
> > > > > +    {
> > > > > +      FOR_BB_INSNS (bb, insn)
> > > > > +       {
> > > > > +         if (!NONDEBUG_INSN_P (insn))
> > > > > +           continue;
> > > > > +
> > > > > +         set = single_set (insn);
> > > > > +         if (!set)
> > > > > +           continue;
> > > > > +
> > > > > +         rtx dest = SET_DEST (set);
> > > > > +         machine_mode mode = GET_MODE (dest);
> > > > > +         /* Skip non-vector instruction.  */
> > > > > +         if (!VECTOR_MODE_P (mode))
> > > > > +           continue;
> > > > > +
> > > > > +         rtx src = SET_SRC (set);
> > > > > +         if (!REG_P (dest)
> > > > > +             || (src != CONST0_RTX (mode)
> > > > > +                 && !(GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> > > > > +                      && src == CONSTM1_RTX (mode))))
> > > > > +           {
> > > > > +             /* Record non-CONST0_RTX/CONSTM1_RTX vector 
> > > > > instruction.  */
> > > vector_insns only records a single_set, but not all vector_insns which
> > > could use constm1/zero.
> >
> > Will fix it.
> >
> > > > > +             bitmap_set_bit (vector_insns, INSN_UID (insn));
> > > > > +             continue;
> > > > > +           }
> > > > > +
> > > > > +         if (src == CONST0_RTX (mode))
> > > > > +           {
> > > > > +             /* Record the maximum vector size.  */
> > > > > +             if (zero_size < GET_MODE_SIZE (mode))
> > > > > +               zero_size = GET_MODE_SIZE (mode);
> > > > > +
> > > > > +             /* Record the basic block with CONST0_RTX.  */
> > > > > +             bitmap_set_bit (zero_bbs, bb->index);
> > > > > +             zero_count++;
> > > > > +           }
> > > > > +         else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> > > > > +                  && src == CONSTM1_RTX (mode))
> > > > > +           {
> > > > > +             /* Record the maximum vector size.  */
> > > > > +             if (m1_size < GET_MODE_SIZE (mode))
> > > > > +               m1_size = GET_MODE_SIZE (mode);
> > > > > +
> > > > > +             /* Record the basic block with CONSTM1_RTX.  */
> > > > > +             bitmap_set_bit (m1_bbs, bb->index);
> > > > > +             m1_count++;
> > > > > +           }
> > > > > +       }
> > > > > +    }
> > > > > +
> > > > > +  if (zero_count > 1 || m1_count > 1)
> > > > > +    {
> > > > > +      machine_mode zero_mode, m1_mode;
> > > > > +      rtx vector_const0, vector_constm1;
> > > > > +      if (zero_count > 1)
> > > > > +       {
> > > > > +         zero_mode = ix86_get_vector_load_mode (zero_size);
> > > > > +         vector_const0 = gen_reg_rtx (zero_mode);
> > > > > +       }
> > > > > +      else
> > > > > +       {
> > > > > +         zero_mode = VOIDmode;
> > > > > +         vector_const0 = nullptr;
> > > > > +       }
> > > > > +      if (m1_count > 1)
> > > > > +       {
> > > > > +         m1_mode = ix86_get_vector_load_mode (m1_size);
> > > > > +         vector_constm1 = gen_reg_rtx (m1_mode);
> > > > > +       }
> > > > > +      else
> > > > > +       {
> > > > > +         m1_mode = VOIDmode;
> > > > > +         vector_constm1 = nullptr;
> > > > > +       }
> > > > > +
> > > > > +      bool zero_replaced = false;
> > > > > +      bool m1_replaced = false;
> > > > > +
> > > > > +      bitmap_iterator bi;
> > > > > +      unsigned id;
> > > > > +      EXECUTE_IF_SET_IN_BITMAP (vector_insns, 0, id, bi)
> > > Could we just record those zero/m1 insn, and replace the src of those
> > > insn with subreg (...), I think LRA can eliminate those redundant
> > > moves?
> >
> > This is what my patch does:
> But it iterates through vector_insns, using a def-ref chain to find
> those insns. I think we can just record those single_set with src as
> const_m1/zero, and replace src for them.


Will fix it.

> >
> >              /* Check the single definition of CONST0_RTX and integer
> >                  CONSTM1_RTX.  */
> >               rtx src = SET_SRC (set);
> >               rtx replace;
> >               if (vector_const0 && src == CONST0_RTX (mode))
> >                 {
> >                   /* Replace REG with VECTOR_CONST0.  */
> >                   if (SUBREG_P (reg) || mode == zero_mode)
> >                     replace = vector_const0;
> >                   else
> >                     replace = gen_rtx_SUBREG (mode, vector_const0, 0);
> >                   *DF_REF_REAL_LOC (ref) = replace;
> >                   replaced = true;
> >                   zero_replaced = true;
> >                 }
> >
> > It changed the source to a subreg directly.
> >
> > > Also we also need to change ix86_modes_tieable_p to make sure those
> > > inserted subreg can be handled by LRA and other passes?
> >
> > ix86_modes_tieable_p is OK:
> >
> >  /* If MODE2 is only appropriate for an SSE register, then tie with
> >      any other mode acceptable to SSE registers.  */
> >   if (GET_MODE_SIZE (mode2) == 64
> >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> >     return (GET_MODE_SIZE (mode1) == 64
> >             && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> >   if (GET_MODE_SIZE (mode2) == 32
> >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> >     return (GET_MODE_SIZE (mode1) == 32
> >             && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> >   if (GET_MODE_SIZE (mode2) == 16
> >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> >     return (GET_MODE_SIZE (mode1) == 16
> >             && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> >
> It's ok only size of mode1 is equal to size of mode2.
> But in the testcase, there are different size vectors(32-bytes, 16-bytes).
>
> So it would be better as, for mode2 >= 16 bytes, it can only be put
> into SSE_REGS(except for TImode, but TImode still can be tied to
> <=16bytes mode1 which can be put into SSE_REGS) , if mode1 can also be
> put into SSE_REGS, then mode2 tie with mode1.
>
>    /* If MODE2 is only appropriate for an SSE register, then tie with
>       any other mode acceptable to SSE registers.  */
> -  if (GET_MODE_SIZE (mode2) == 64
> -      && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> -    return (GET_MODE_SIZE (mode1) == 64
> -           && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> -  if (GET_MODE_SIZE (mode2) == 32
> -      && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> -    return (GET_MODE_SIZE (mode1) == 32
> -           && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> -  if (GET_MODE_SIZE (mode2) == 16
> +  if (GET_MODE_SIZE (mode2) >= 16
> +      && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2)
>        && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> -    return (GET_MODE_SIZE (mode1) == 16
> -           && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> +    return ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1);
>

This caused:

FAIL: gcc.target/i386/pr111267.c scan-assembler-not movd
FAIL: gcc.target/i386/pr111267.c scan-assembler-not movq
FAIL: gcc.target/i386/pr82580.c scan-assembler-not \\mmovzb

since GCC thinks it is cheap to get  QI/HI/SI/DI from TI in XMM.
I am testing:

 /* If MODE2 is only appropriate for an SSE register, then tie with
     any other mode acceptable to SSE registers, excluding
        (subreg:QI (reg:TI 99) 0))
        (subreg:HI (reg:TI 99) 0))
        (subreg:SI (reg:TI 99) 0))
        (subreg:DI (reg:TI 99) 0))
     to avoid unnecessary move from SSE register to integer register.
   */
  if (GET_MODE_SIZE (mode2) >= 16
      && (GET_MODE_SIZE (mode1) == GET_MODE_SIZE (mode2)
          || (!INTEGRAL_MODE_P (mode1)
              && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2)))
      && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
    return ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1);

-- 
H.J.

Re: PING: [PATCH] x86: Add a pass to remove redundant all 0s/1s vector load

Reply via email to