Re: [PATCH v2] x86: Add a pass to remove redundant all 0s/1s vector load

Hongtao Liu Mon, 28 Apr 2025 19:09:01 -0700

On Mon, Apr 28, 2025 at 5:07 PM H.J. Lu <[email protected]> wrote:
>
> On Mon, Apr 28, 2025 at 4:26 PM H.J. Lu <[email protected]> wrote:
> >
>
> > > > This is what my patch does:
> > > But it iterates through vector_insns, using a def-ref chain to find
> > > those insns. I think we can just record those single_set with src as
> > > const_m1/zero, and replace src for them.
> >
> > Will fix it.
>
> Fixed in the v2 patch.
>
> > > >
> > > >              /* Check the single definition of CONST0_RTX and integer
> > > >                  CONSTM1_RTX.  */
> > > >               rtx src = SET_SRC (set);
> > > >               rtx replace;
> > > >               if (vector_const0 && src == CONST0_RTX (mode))
> > > >                 {
> > > >                   /* Replace REG with VECTOR_CONST0.  */
> > > >                   if (SUBREG_P (reg) || mode == zero_mode)
> > > >                     replace = vector_const0;
> > > >                   else
> > > >                     replace = gen_rtx_SUBREG (mode, vector_const0, 0);
> > > >                   *DF_REF_REAL_LOC (ref) = replace;
> > > >                   replaced = true;
> > > >                   zero_replaced = true;
> > > >                 }
> > > >
> > > > It changed the source to a subreg directly.
> > > >
> > > > > Also we also need to change ix86_modes_tieable_p to make sure those
> > > > > inserted subreg can be handled by LRA and other passes?
> > > >
> > > > ix86_modes_tieable_p is OK:
> > > >
> > > >  /* If MODE2 is only appropriate for an SSE register, then tie with
> > > >      any other mode acceptable to SSE registers.  */
> > > >   if (GET_MODE_SIZE (mode2) == 64
> > > >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > >     return (GET_MODE_SIZE (mode1) == 64
> > > >             && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > >   if (GET_MODE_SIZE (mode2) == 32
> > > >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > >     return (GET_MODE_SIZE (mode1) == 32
> > > >             && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > >   if (GET_MODE_SIZE (mode2) == 16
> > > >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > >     return (GET_MODE_SIZE (mode1) == 16
> > > >             && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > >
> > > It's ok only size of mode1 is equal to size of mode2.
> > > But in the testcase, there are different size vectors(32-bytes, 16-bytes).
> > >
> > > So it would be better as, for mode2 >= 16 bytes, it can only be put
> > > into SSE_REGS(except for TImode, but TImode still can be tied to
> > > <=16bytes mode1 which can be put into SSE_REGS) , if mode1 can also be
> > > put into SSE_REGS, then mode2 tie with mode1.
> > >
> > >    /* If MODE2 is only appropriate for an SSE register, then tie with
> > >       any other mode acceptable to SSE registers.  */
> > > -  if (GET_MODE_SIZE (mode2) == 64
> > > -      && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > -    return (GET_MODE_SIZE (mode1) == 64
> > > -           && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > -  if (GET_MODE_SIZE (mode2) == 32
> > > -      && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > -    return (GET_MODE_SIZE (mode1) == 32
> > > -           && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > -  if (GET_MODE_SIZE (mode2) == 16
> > > +  if (GET_MODE_SIZE (mode2) >= 16
> > > +      && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2)
> > >        && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > -    return (GET_MODE_SIZE (mode1) == 16
> > > -           && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > +    return ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1);
> > >
> >
> > This caused:
> >
> > FAIL: gcc.target/i386/pr111267.c scan-assembler-not movd
> > FAIL: gcc.target/i386/pr111267.c scan-assembler-not movq
> > FAIL: gcc.target/i386/pr82580.c scan-assembler-not \\mmovzb
> >
> > since GCC thinks it is cheap to get  QI/HI/SI/DI from TI in XMM.
> > I am testing:
> >
> >  /* If MODE2 is only appropriate for an SSE register, then tie with
> >      any other mode acceptable to SSE registers, excluding
> >         (subreg:QI (reg:TI 99) 0))
> >         (subreg:HI (reg:TI 99) 0))
> >         (subreg:SI (reg:TI 99) 0))
> >         (subreg:DI (reg:TI 99) 0))
> >      to avoid unnecessary move from SSE register to integer register.
> >    */
> >   if (GET_MODE_SIZE (mode2) >= 16
> >       && (GET_MODE_SIZE (mode1) == GET_MODE_SIZE (mode2)
> >           || (!INTEGRAL_MODE_P (mode1)
> >               && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2)))
> >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
It looks like only scalar_int_mode_p should be exclude, not vector_mode?
so, how about
   if (GET_MODE_SIZE (mode2) >= 16
       && (GET_MODE_SIZE (mode1) == GET_MODE_SIZE (mode2)
           || ((VECTOR_MODE_P (mode1) || SCALAR_FLOAT_MODE_P (mode1))
               && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2)))
       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))


Only allow vector mode or scalar floating point in mode1 to be tied to
mode2 in SSE_REGS?


>+      /* NB: Don't run recog_memoized here since vector SUBREG may not
>+ be valid.  Let LRA handle vector SUBREG.  */
It's tricky, we can insert a move pattern with dest same size as
SET_SRC (set), but with same component mode as constm1/constm0, so it
should be something like
mov v8qi, const0(v32qimode)
mov v2sf, subreg:v2sf (v8qi)

I think LRA could eliminate the redundant move.

>+static void
>+ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs)
>+{
For the convenience of maintain, can we also replace the corresponding
code in the remove_partial_avx_dependency function with a call to this
ix86_place_single_vector_set function

+}

>+      SET_SRC (set) = replace;
>+      /* Drop possible dead definitions.  */
>+      PATTERN (insn) = set;
>+      df_insn_rescan (insn);

> >     return ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1);
> >
>
> Tested.  There is no regression.
>
> Here is the v2 patch.   Ok for master?
>
> Thanks.
>
> --
> H.J.



-- 
BR,
Hongtao

Re: [PATCH v2] x86: Add a pass to remove redundant all 0s/1s vector load

Reply via email to