RE: [PATCH v3] x86: Add a pass to remove redundant all 0s/1s vector load

Liu, Hongtao Tue, 29 Apr 2025 00:05:17 -0700

> -----Original Message-----
> From: H.J. Lu <[email protected]>
> Sent: Tuesday, April 29, 2025 2:59 PM
> To: Hongtao Liu <[email protected]>
> Cc: GCC Patches <[email protected]>; Liu, Hongtao
> <[email protected]>; Uros Bizjak <[email protected]>
> Subject: [PATCH v3] x86: Add a pass to remove redundant all 0s/1s vector
> load
> 
> On Tue, Apr 29, 2025 at 11:27 AM H.J. Lu <[email protected]> wrote:
> >
> > On Tue, Apr 29, 2025 at 10:08 AM Hongtao Liu <[email protected]>
> wrote:
> > >
> > > On Mon, Apr 28, 2025 at 5:07 PM H.J. Lu <[email protected]> wrote:
> > > >
> > > > On Mon, Apr 28, 2025 at 4:26 PM H.J. Lu <[email protected]> wrote:
> > > > >
> > > >
> > > > > > > This is what my patch does:
> > > > > > But it iterates through vector_insns, using a def-ref chain to
> > > > > > find those insns. I think we can just record those single_set
> > > > > > with src as const_m1/zero, and replace src for them.
> > > > >
> > > > > Will fix it.
> > > >
> > > > Fixed in the v2 patch.
> > > >
> > > > > > >
> > > > > > >              /* Check the single definition of CONST0_RTX and 
> > > > > > > integer
> > > > > > >                  CONSTM1_RTX.  */
> > > > > > >               rtx src = SET_SRC (set);
> > > > > > >               rtx replace;
> > > > > > >               if (vector_const0 && src == CONST0_RTX (mode))
> > > > > > >                 {
> > > > > > >                   /* Replace REG with VECTOR_CONST0.  */
> > > > > > >                   if (SUBREG_P (reg) || mode == zero_mode)
> > > > > > >                     replace = vector_const0;
> > > > > > >                   else
> > > > > > >                     replace = gen_rtx_SUBREG (mode, 
> > > > > > > vector_const0, 0);
> > > > > > >                   *DF_REF_REAL_LOC (ref) = replace;
> > > > > > >                   replaced = true;
> > > > > > >                   zero_replaced = true;
> > > > > > >                 }
> > > > > > >
> > > > > > > It changed the source to a subreg directly.
> > > > > > >
> > > > > > > > Also we also need to change ix86_modes_tieable_p to make
> > > > > > > > sure those inserted subreg can be handled by LRA and other
> passes?
> > > > > > >
> > > > > > > ix86_modes_tieable_p is OK:
> > > > > > >
> > > > > > >  /* If MODE2 is only appropriate for an SSE register, then tie 
> > > > > > > with
> > > > > > >      any other mode acceptable to SSE registers.  */
> > > > > > >   if (GET_MODE_SIZE (mode2) == 64
> > > > > > >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > > > > >     return (GET_MODE_SIZE (mode1) == 64
> > > > > > >             && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > > > > >   if (GET_MODE_SIZE (mode2) == 32
> > > > > > >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > > > > >     return (GET_MODE_SIZE (mode1) == 32
> > > > > > >             && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > > > > >   if (GET_MODE_SIZE (mode2) == 16
> > > > > > >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > > > > >     return (GET_MODE_SIZE (mode1) == 16
> > > > > > >             && ix86_hard_regno_mode_ok (FIRST_SSE_REG,
> > > > > > > mode1));
> > > > > > >
> > > > > > It's ok only size of mode1 is equal to size of mode2.
> > > > > > But in the testcase, there are different size vectors(32-bytes, 16-
> bytes).
> > > > > >
> > > > > > So it would be better as, for mode2 >= 16 bytes, it can only
> > > > > > be put into SSE_REGS(except for TImode, but TImode still can
> > > > > > be tied to <=16bytes mode1 which can be put into SSE_REGS) ,
> > > > > > if mode1 can also be put into SSE_REGS, then mode2 tie with mode1.
> > > > > >
> > > > > >    /* If MODE2 is only appropriate for an SSE register, then tie 
> > > > > > with
> > > > > >       any other mode acceptable to SSE registers.  */
> > > > > > -  if (GET_MODE_SIZE (mode2) == 64
> > > > > > -      && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > > > > -    return (GET_MODE_SIZE (mode1) == 64
> > > > > > -           && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > > > > -  if (GET_MODE_SIZE (mode2) == 32
> > > > > > -      && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > > > > -    return (GET_MODE_SIZE (mode1) == 32
> > > > > > -           && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > > > > -  if (GET_MODE_SIZE (mode2) == 16
> > > > > > +  if (GET_MODE_SIZE (mode2) >= 16
> > > > > > +      && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2)
> > > > > >        && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > > > > -    return (GET_MODE_SIZE (mode1) == 16
> > > > > > -           && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1));
> > > > > > +    return ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1);
> > > > > >
> > > > >
> > > > > This caused:
> > > > >
> > > > > FAIL: gcc.target/i386/pr111267.c scan-assembler-not movd
> > > > > FAIL: gcc.target/i386/pr111267.c scan-assembler-not movq
> > > > > FAIL: gcc.target/i386/pr82580.c scan-assembler-not \\mmovzb
> > > > >
> > > > > since GCC thinks it is cheap to get  QI/HI/SI/DI from TI in XMM.
> > > > > I am testing:
> > > > >
> > > > >  /* If MODE2 is only appropriate for an SSE register, then tie with
> > > > >      any other mode acceptable to SSE registers, excluding
> > > > >         (subreg:QI (reg:TI 99) 0))
> > > > >         (subreg:HI (reg:TI 99) 0))
> > > > >         (subreg:SI (reg:TI 99) 0))
> > > > >         (subreg:DI (reg:TI 99) 0))
> > > > >      to avoid unnecessary move from SSE register to integer register.
> > > > >    */
> > > > >   if (GET_MODE_SIZE (mode2) >= 16
> > > > >       && (GET_MODE_SIZE (mode1) == GET_MODE_SIZE (mode2)
> > > > >           || (!INTEGRAL_MODE_P (mode1)
> > > > >               && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2)))
> > > > >       && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > > It looks like only scalar_int_mode_p should be exclude, not vector_mode?
> > > so, how about
> > >    if (GET_MODE_SIZE (mode2) >= 16
> > >        && (GET_MODE_SIZE (mode1) == GET_MODE_SIZE (mode2)
> > >            || ((VECTOR_MODE_P (mode1) || SCALAR_FLOAT_MODE_P
> (mode1))
> > >                && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2)))
> > >        && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2))
> > >
> > > Only allow vector mode or scalar floating point in mode1 to be tied
> > > to
> > > mode2 in SSE_REGS?
> >
> > Will fix.
> 
> Fixed in v3.
> 
> > >
> > > >+      /* NB: Don't run recog_memoized here since vector SUBREG may
> > > >+ not be valid.  Let LRA handle vector SUBREG.  */
> > > It's tricky, we can insert a move pattern with dest same size as
> > > SET_SRC (set), but with same component mode as constm1/constm0, so
> > > it should be something like mov v8qi, const0(v32qimode) mov v2sf,
> > > subreg:v2sf (v8qi)
> >
> > IIt is exactly what my patch generates:
> >
> > (insn 31 2 5 2 (set (reg:V32QI 110)
> >         (const_vector:V32QI [
> >                 (const_int 0 [0]) repeated x32
> >             ])) -1
> >      (nil))
> > (insn 5 31 6 2 (set (reg:V2DF 98)
> >         (subreg:V2DF (reg:V32QI 110) 0)) "z2.c":35:6 2421 {movv2df_internal}
> >      (nil))
> > (insn 6 5 7 2 (set (mem/c:V2DF (symbol_ref:DI ("d1") [flags 0x2]
> > <var_decl 0x7fd11d6341c8 d1>) [1 d1+0 S16 A128])
> >         (reg:V2DF 98)) "z2.c":35:6 2421 {movv2df_internal}
> >      (expr_list:REG_DEAD (reg:V2DF 98)
> >         (nil)))
> > (insn 7 6 8 2 (set (reg:V4SF 99)
> >         (subreg:V4SF (reg:V32QI 110) 0)) "z2.c":36:6 2418 {movv4sf_internal}
> >      (nil))
> > (insn 8 7 9 2 (set (mem/c:V4SF (symbol_ref:DI ("f1") [flags 0x2]
> > <var_decl 0x7fd11d634130 f1>) [2 f1+0 S16 A128])
> >         (reg:V4SF 99)) "z2.c":36:6 2418 {movv4sf_internal}
> >      (expr_list:REG_DEAD (reg:V4SF 99)
> >         (nil)))
> > ...
> >
> > and LRA does eliminate the redundant moves.
> >
> > > I think LRA could eliminate the redundant move.
> > >
> > > >+static void
> > > >+ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs) {
> > > For the convenience of maintain, can we also replace the
> > > corresponding code in the remove_partial_avx_dependency function
> > > with a call to this ix86_place_single_vector_set function
> >
> > Will fix.
> 
> Fixed in v3.
> 
> Here is the v3 patch.  OK for master?
LGTM.
> 
> Thanks.
> 
> --
> H.J.
> ---
> For all different modes of all 0s/1s vectors, we can use the single widest all
> 0s/1s vector register for all 0s/1s vector uses in the whole function.
> Add a pass to generate a single widest all 0s/1s vector set instruction at 
> entry
> of the nearest common dominator for basic blocks with all 0s/1s vector uses.
> On Linux/x86-64, in cc1plus, this patch reduces the number of vector xor
> instructions from 4803 to 4714 and pcmpeq instructions from
> 144 to 142.
> 
> NB: PR target/92080 and PR target/117839 aren't same.  PR target/117839 is
> for vectors of all 0s and all 1s with different sizes and different 
> components.
> PR target/92080 is for broadcast of the same component to different vector
> sizes.  This patch covers only all 0s and all 1s cases of PR target/92080.
> 
> gcc/
> 
> PR target/92080
> PR target/117839
> * config/i386/i386-features.cc (ix86_place_single_vector_set):
> New function.
> (remove_partial_avx_dependency): Use it.
> (ix86_get_vector_load_mode): New function.
> (replace_vector_const): Likewise.
> (remove_redundant_vector_load): Likewise.
> (pass_data_remove_redundant_vector_load): Likewise.
> (pass_remove_redundant_vector_load): Likewise.
> (make_pass_remove_redundant_vector_load): Likewise.
> * config/i386/i386-passes.def: Add
> pass_remove_redundant_vector_load after
> pass_remove_partial_avx_dependency.
> * config/i386/i386-protos.h
> (make_pass_remove_redundant_vector_load): New.
> * config/i386/i386.cc (ix86_modes_tieable_p): Return true for narrower non-
> scalar-integer modes in SSE registers.
> 
> gcc/testsuite/
> 
> PR target/92080
> PR target/117839
> * gcc.target/i386/pr117839-1a.c: New test.
> * gcc.target/i386/pr117839-1b.c: Likewise.
> * gcc.target/i386/pr117839-2.c: Likewise.
> * gcc.target/i386/pr92080-1.c: Likewise.
> * gcc.target/i386/pr92080-2.c: Likewise.
> * gcc.target/i386/pr92080-3.c: Likewise.
RE: [PATCH v3] x86: Add a pass to remove redundant all 0s/1s vector load

Reply via email to