On Tue, Apr 22, 2025 at 10:01 AM Hongtao Liu <crazy...@gmail.com> wrote: > > On Mon, Apr 21, 2025 at 4:30 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > On Mon, Apr 21, 2025 at 11:29 AM Hongtao Liu <crazy...@gmail.com> wrote: > > > > > > On Sat, Apr 19, 2025 at 1:25 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > On Sun, Dec 1, 2024 at 7:50 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > > > For all different modes of all 0s/1s vectors, we can use the single > > > > > widest > > > > > all 0s/1s vector register for all 0s/1s vector uses in the whole > > > > > function. > > > > > Add a pass to generate a single widest all 0s/1s vector set > > > > > instruction at > > > > > entry of the nearest common dominator for basic blocks with all 0s/1s > > > > > vector uses. On Linux/x86-64, in cc1plus, this patch reduces the > > > > > number > > > > > of vector xor instructions from 4803 to 4714 and pcmpeq instructions > > > > > from > > > > > 144 to 142. > > > > > > > > > > This change causes a regression: > > > > > > > > > > FAIL: gcc.dg/rtl/x86_64/vector_eq.c > > > > > > > > > > without the fix for > > > > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117863 > > > > > > > > > > NB: PR target/92080 and PR target/117839 aren't same. PR > > > > > target/117839 > > > > > is for vectors of all 0s and all 1s with different sizes and different > > > > > components. PR target/92080 is for broadcast of the same component to > > > > > different vector sizes. This patch covers only all 0s and all 1s > > > > > cases > > > > > of PR target/92080. > > > > > > > > > > gcc/ > > > > > > > > > > PR target/92080 > > > > > PR target/117839 > > > > > * config/i386/i386-features.cc (ix86_rrvl_gate): New. > > > > > (ix86_place_single_vector_set): Likewise. > > > > > (ix86_get_vector_load_mode): Likewise. > > > > > (remove_redundant_vector_load): Likewise. > > > > > (pass_data_remove_redundant_vector_load): Likewise. > > > > > (pass_remove_redundant_vector_load): Likewise. > > > > > (make_pass_remove_redundant_vector_load): Likewise. > > > > > * config/i386/i386-passes.def: Add > > > > > pass_remove_redundant_vector_load after > > > > > pass_remove_partial_avx_dependency. > > > > > * config/i386/i386-protos.h > > > > > (make_pass_remove_redundant_vector_load): New. > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > PR target/92080 > > > > > PR target/117839 > > > > > * gcc.target/i386/pr117839-1a.c: New test. > > > > > * gcc.target/i386/pr117839-1b.c: Likewise. > > > > > * gcc.target/i386/pr117839-2.c: Likewise. > > > > > * gcc.target/i386/pr92080-1.c: Likewise. > > > > > * gcc.target/i386/pr92080-2.c: Likewise. > > > > > * gcc.target/i386/pr92080-3.c: Likewise. > > > > > > > > > > Signed-off-by: H.J. Lu <hjl.to...@gmail.com> > > > > > --- > > > > > gcc/config/i386/i386-features.cc | 308 > > > > > ++++++++++++++++++++ > > > > > gcc/config/i386/i386-passes.def | 1 + > > > > > gcc/config/i386/i386-protos.h | 2 + > > > > > gcc/testsuite/gcc.target/i386/pr117839-1a.c | 35 +++ > > > > > gcc/testsuite/gcc.target/i386/pr117839-1b.c | 5 + > > > > > gcc/testsuite/gcc.target/i386/pr117839-2.c | 40 +++ > > > > > gcc/testsuite/gcc.target/i386/pr92080-1.c | 54 ++++ > > > > > gcc/testsuite/gcc.target/i386/pr92080-2.c | 59 ++++ > > > > > gcc/testsuite/gcc.target/i386/pr92080-3.c | 48 +++ > > > > > 9 files changed, 552 insertions(+) > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr117839-1a.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr117839-1b.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr117839-2.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-1.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-2.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-3.c > > > > > > > > > > diff --git a/gcc/config/i386/i386-features.cc > > > > > b/gcc/config/i386/i386-features.cc > > > > > index 003b003e09c..7d8d260750d 100644 > > > > > --- a/gcc/config/i386/i386-features.cc > > > > > +++ b/gcc/config/i386/i386-features.cc > > > > > @@ -3288,6 +3288,314 @@ make_pass_remove_partial_avx_dependency > > > > > (gcc::context *ctxt) > > > > > return new pass_remove_partial_avx_dependency (ctxt); > > > > > } > > > > > > > > > > +static bool > > > > > +ix86_rrvl_gate () > > > > > +{ > > > > > + return (TARGET_SSE2 > > > > > + && optimize > > > > > + && optimize_function_for_speed_p (cfun)); > > > > > +} > > > > > + > > > > > +/* Generate a vector set, DEST = SRC, at entry of the nearest > > > > > dominator > > > > > + for basic block map BBS, which is in the fake loop that contains > > > > > the > > > > > + whole function, so that there is only a single vector set in the > > > > > + whole function. */ > > > > > + > > > > > +static void > > > > > +ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs) > > > > > +{ > > > > > + basic_block bb = nearest_common_dominator_for_set (CDI_DOMINATORS, > > > > > bbs); > > > > > + while (bb->loop_father->latch > > > > > + != EXIT_BLOCK_PTR_FOR_FN (cfun)) > > > > > + bb = get_immediate_dominator (CDI_DOMINATORS, > > > > > + bb->loop_father->header); > > > > > + > > > > > + rtx set = gen_rtx_SET (dest, src); > > > > > + > > > > > + rtx_insn *insn = BB_HEAD (bb); > > > > > + while (insn && !NONDEBUG_INSN_P (insn)) > > > > > + { > > > > > + if (insn == BB_END (bb)) > > > > > + { > > > > > + insn = NULL; > > > > > + break; > > > > > + } > > > > > + insn = NEXT_INSN (insn); > > > > > + } > > > > > + > > > > > + rtx_insn *set_insn; > > > > > + if (insn == BB_HEAD (bb)) > > > > > + set_insn = emit_insn_before (set, insn); > > > > > + else > > > > > + set_insn = emit_insn_after (set, > > > > > + insn ? PREV_INSN (insn) : BB_END > > > > > (bb)); > > > > > + df_insn_rescan (set_insn); > > > > > +} > > > > > + > > > > > +/* Return a machine mode suitable for vector SIZE. */ > > > > > + > > > > > +static machine_mode > > > > > +ix86_get_vector_load_mode (unsigned int size) > > > > > +{ > > > > > + machine_mode mode; > > > > > + if (size == 64) > > > > > + mode = V64QImode; > > > > > + else if (size == 32) > > > > > + mode = V32QImode; > > > > > + else > > > > > + mode = V16QImode; > > > > > + return mode; > > > > > +} > > > > > + > > > > > +/* At entry of the nearest common dominator for basic blocks with > > > > > vector > > > > > + CONST0_RTX and integer CONSTM1_RTX uses, generate a single widest > > > > > + vector set instruction for all CONST0_RTX and integer CONSTM1_RTX > > > > > + uses. > > > > > + > > > > > + NB: We want to generate only a single widest vector set to cover > > > > > the > > > > > + whole function. The LCM algorithm isn't appropriate here since it > > > > > + may place a vector set inside the loop. */ > > > > > + > > > > > +static unsigned int > > > > > +remove_redundant_vector_load (void) > > > > > +{ > > > > > + timevar_push (TV_MACH_DEP); > > > > > + > > > > > + bitmap_obstack_initialize (NULL); > > > > > + bitmap zero_bbs = BITMAP_ALLOC (NULL); > > > > > + bitmap m1_bbs = BITMAP_ALLOC (NULL); > > > > > + bitmap vector_insns = BITMAP_ALLOC (NULL); > > > Use auto_bitmap? > > > > Will fix it. > > > > > > > + > > > > > + basic_block bb; > > > > > + rtx_insn *insn; > > > > > + rtx set; > > > > > + unsigned HOST_WIDE_INT zero_count = 0; > > > > > + unsigned HOST_WIDE_INT m1_count = 0; > > > > > + unsigned int zero_size = 0; > > > > > + unsigned int m1_size = 0; > > > > > + > > > > > + df_set_flags (DF_DEFER_INSN_RESCAN); > > > > > + > > > > > + FOR_EACH_BB_FN (bb, cfun) > > > > > + { > > > > > + FOR_BB_INSNS (bb, insn) > > > > > + { > > > > > + if (!NONDEBUG_INSN_P (insn)) > > > > > + continue; > > > > > + > > > > > + set = single_set (insn); > > > > > + if (!set) > > > > > + continue; > > > > > + > > > > > + rtx dest = SET_DEST (set); > > > > > + machine_mode mode = GET_MODE (dest); > > > > > + /* Skip non-vector instruction. */ > > > > > + if (!VECTOR_MODE_P (mode)) > > > > > + continue; > > > > > + > > > > > + rtx src = SET_SRC (set); > > > > > + if (!REG_P (dest) > > > > > + || (src != CONST0_RTX (mode) > > > > > + && !(GET_MODE_CLASS (mode) == MODE_VECTOR_INT > > > > > + && src == CONSTM1_RTX (mode)))) > > > > > + { > > > > > + /* Record non-CONST0_RTX/CONSTM1_RTX vector > > > > > instruction. */ > > > vector_insns only records a single_set, but not all vector_insns which > > > could use constm1/zero. > > > > Will fix it. > > > > > > > + bitmap_set_bit (vector_insns, INSN_UID (insn)); > > > > > + continue; > > > > > + } > > > > > + > > > > > + if (src == CONST0_RTX (mode)) > > > > > + { > > > > > + /* Record the maximum vector size. */ > > > > > + if (zero_size < GET_MODE_SIZE (mode)) > > > > > + zero_size = GET_MODE_SIZE (mode); > > > > > + > > > > > + /* Record the basic block with CONST0_RTX. */ > > > > > + bitmap_set_bit (zero_bbs, bb->index); > > > > > + zero_count++; > > > > > + } > > > > > + else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT > > > > > + && src == CONSTM1_RTX (mode)) > > > > > + { > > > > > + /* Record the maximum vector size. */ > > > > > + if (m1_size < GET_MODE_SIZE (mode)) > > > > > + m1_size = GET_MODE_SIZE (mode); > > > > > + > > > > > + /* Record the basic block with CONSTM1_RTX. */ > > > > > + bitmap_set_bit (m1_bbs, bb->index); > > > > > + m1_count++; > > > > > + } > > > > > + } > > > > > + } > > > > > + > > > > > + if (zero_count > 1 || m1_count > 1) > > > > > + { > > > > > + machine_mode zero_mode, m1_mode; > > > > > + rtx vector_const0, vector_constm1; > > > > > + if (zero_count > 1) > > > > > + { > > > > > + zero_mode = ix86_get_vector_load_mode (zero_size); > > > > > + vector_const0 = gen_reg_rtx (zero_mode); > > > > > + } > > > > > + else > > > > > + { > > > > > + zero_mode = VOIDmode; > > > > > + vector_const0 = nullptr; > > > > > + } > > > > > + if (m1_count > 1) > > > > > + { > > > > > + m1_mode = ix86_get_vector_load_mode (m1_size); > > > > > + vector_constm1 = gen_reg_rtx (m1_mode); > > > > > + } > > > > > + else > > > > > + { > > > > > + m1_mode = VOIDmode; > > > > > + vector_constm1 = nullptr; > > > > > + } > > > > > + > > > > > + bool zero_replaced = false; > > > > > + bool m1_replaced = false; > > > > > + > > > > > + bitmap_iterator bi; > > > > > + unsigned id; > > > > > + EXECUTE_IF_SET_IN_BITMAP (vector_insns, 0, id, bi) > > > Could we just record those zero/m1 insn, and replace the src of those > > > insn with subreg (...), I think LRA can eliminate those redundant > > > moves? > > > > This is what my patch does: > But it iterates through vector_insns, using a def-ref chain to find > those insns. I think we can just record those single_set with src as > const_m1/zero, and replace src for them.
Will fix it. > > > > /* Check the single definition of CONST0_RTX and integer > > CONSTM1_RTX. */ > > rtx src = SET_SRC (set); > > rtx replace; > > if (vector_const0 && src == CONST0_RTX (mode)) > > { > > /* Replace REG with VECTOR_CONST0. */ > > if (SUBREG_P (reg) || mode == zero_mode) > > replace = vector_const0; > > else > > replace = gen_rtx_SUBREG (mode, vector_const0, 0); > > *DF_REF_REAL_LOC (ref) = replace; > > replaced = true; > > zero_replaced = true; > > } > > > > It changed the source to a subreg directly. > > > > > Also we also need to change ix86_modes_tieable_p to make sure those > > > inserted subreg can be handled by LRA and other passes? > > > > ix86_modes_tieable_p is OK: > > > > /* If MODE2 is only appropriate for an SSE register, then tie with > > any other mode acceptable to SSE registers. */ > > if (GET_MODE_SIZE (mode2) == 64 > > && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2)) > > return (GET_MODE_SIZE (mode1) == 64 > > && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1)); > > if (GET_MODE_SIZE (mode2) == 32 > > && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2)) > > return (GET_MODE_SIZE (mode1) == 32 > > && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1)); > > if (GET_MODE_SIZE (mode2) == 16 > > && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2)) > > return (GET_MODE_SIZE (mode1) == 16 > > && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1)); > > > It's ok only size of mode1 is equal to size of mode2. > But in the testcase, there are different size vectors(32-bytes, 16-bytes). > > So it would be better as, for mode2 >= 16 bytes, it can only be put > into SSE_REGS(except for TImode, but TImode still can be tied to > <=16bytes mode1 which can be put into SSE_REGS) , if mode1 can also be > put into SSE_REGS, then mode2 tie with mode1. > > /* If MODE2 is only appropriate for an SSE register, then tie with > any other mode acceptable to SSE registers. */ > - if (GET_MODE_SIZE (mode2) == 64 > - && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2)) > - return (GET_MODE_SIZE (mode1) == 64 > - && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1)); > - if (GET_MODE_SIZE (mode2) == 32 > - && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2)) > - return (GET_MODE_SIZE (mode1) == 32 > - && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1)); > - if (GET_MODE_SIZE (mode2) == 16 > + if (GET_MODE_SIZE (mode2) >= 16 > + && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2) > && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2)) > - return (GET_MODE_SIZE (mode1) == 16 > - && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1)); > + return ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1); > This caused: FAIL: gcc.target/i386/pr111267.c scan-assembler-not movd FAIL: gcc.target/i386/pr111267.c scan-assembler-not movq FAIL: gcc.target/i386/pr82580.c scan-assembler-not \\mmovzb since GCC thinks it is cheap to get QI/HI/SI/DI from TI in XMM. I am testing: /* If MODE2 is only appropriate for an SSE register, then tie with any other mode acceptable to SSE registers, excluding (subreg:QI (reg:TI 99) 0)) (subreg:HI (reg:TI 99) 0)) (subreg:SI (reg:TI 99) 0)) (subreg:DI (reg:TI 99) 0)) to avoid unnecessary move from SSE register to integer register. */ if (GET_MODE_SIZE (mode2) >= 16 && (GET_MODE_SIZE (mode1) == GET_MODE_SIZE (mode2) || (!INTEGRAL_MODE_P (mode1) && GET_MODE_SIZE (mode1) <= GET_MODE_SIZE (mode2))) && ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode2)) return ix86_hard_regno_mode_ok (FIRST_SSE_REG, mode1); -- H.J.