On Wed, Aug 18, 2021 at 4:32 PM Richard Biener <rguent...@suse.de> wrote: > > On Wed, 18 Aug 2021, Hongtao Liu wrote: > > > On Wed, Aug 18, 2021 at 11:24 AM Hongtao Liu <crazy...@gmail.com> wrote: > > > > > > On Tue, Aug 17, 2021 at 10:43 PM Richard Biener via Gcc-patches > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > On Tue, Aug 17, 2021 at 3:29 PM Richard Biener via Gcc-patches > > > > <gcc-patches@gcc.gnu.org> wrote: > > > > > > > > > > This is an attempt to start moving the x86 backend to use > > > > > standard pattern names for [mask_]gather_load and [mask_]scatter_store > > > > > rather than using the builtin_{gather,scatter} target hooks. > > > > > > > > > > I've started with AVX2 gathers and given x86 only supports masked > > > > > gather I only implemented mask_gather_load. Note while for > > > > > the builtin_gather case the vectorizer will provide an all-true > > > > > mask operand for non-masked gathers this capability does not > > > > > exist for the IFN path yet, so only testcases with actual masked > > > > > gathers will work. > > > > > > > > > > If this looks reasonable on the backend side I'll see to first > > > > > complete the vectorizer part, ripping out the target hook and > > > > > arranging for the missing pieces. Another one is the support > > > > > for SImode indices with DFmode data which requires unpacking > > > > > the index vector and actually recognizing the IFN. > > > > > > > > > > 2021-08-17 Richard Biener <rguent...@suse.de> > > > > > > > > > > * tree-vect-data-refs.c (vect_check_gather_scatter): > > > > > Always use internal functions. > > > > > * config/i386/sse.md > > > > > (mask_gather_load<mode><vec_gather_idxsi>): New expander. > > > > > (mask_gather_load<mode><vec_gather_idxdi>): Likewise. > > > > > --- > > > > > gcc/config/i386/sse.md | 56 > > > > > +++++++++++++++++++++++++++++++++++++++ > > > > > gcc/tree-vect-data-refs.c | 4 +-- > > > > > 2 files changed, 57 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > > > > > index 3957c86c3df..40bec98d9f7 100644 > > > > > --- a/gcc/config/i386/sse.md > > > > > +++ b/gcc/config/i386/sse.md > > > > > @@ -23232,12 +23232,22 @@ > > > > > (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI") > > > > > (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI") > > > > > (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")]) > > > > > +(define_mode_attr vec_gather_idxsi > > > > > + [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si") > > > > > + (V2DF "v4si") (V4DF "v4si") (V8DF "v8si") > > > > > + (V4SI "v4si") (V8SI "v8si") (V16SI "v16si") > > > > > + (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")]) > > > > > > > > > > (define_mode_attr VEC_GATHER_IDXDI > > > > > [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI") > > > > > (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI") > > > > > (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI") > > > > > (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")]) > > > > > +(define_mode_attr vec_gather_idxdi > > > > > + [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di") > > > > > + (V2DF "v2di") (V4DF "v4di") (V8DF "v8di") > > > > > + (V4SI "v2di") (V8SI "v4di") (V16SI "v8di") > > > > > + (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")]) > > > > > > > > > > (define_mode_attr VEC_GATHER_SRCDI > > > > > [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI") > > > > > @@ -23245,6 +23255,29 @@ > > > > > (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI") > > > > > (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")]) > > > > > > > > > > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>" > > > > > + [(parallel [(set (match_operand:VEC_GATHER_MODE 0 > > > > > "register_operand") > > > > > + (unspec:VEC_GATHER_MODE > > > > > + [(pc) > > > > > + (mem:<ssescalarmode> > > > > > + (match_par_dup 6 > > > > > + [(match_operand 1 "vsib_address_operand") > > > > > + (match_operand:<VEC_GATHER_IDXSI> > > > > > + 2 "register_operand") > > > > > + (match_operand:SI 4 "const1248_operand") > > > > > + (match_operand:SI 3 "const0_operand")])) > > > > > + (mem:BLK (scratch)) > > > > > + (match_operand:<sseintvecmode> 5 > > > > > "register_operand")] > > > > > > > > One problem of these is that when AVX512[VL] is enabled we get a AVX512 > > > > mask > > > > mode here and while the internal function expansion check succeeds (it > > > > never checks > > > > the mask operand!), RTL expansion fails unexpectedly because of this > > > > mismatch. > > > > > > > > I suppose a more complicated define_mode_attr might do the trick or do I > > > I don't think define_mode_attr supports conditional selection based on > > > target feature. > > > > need to add && !TARGET_AVX512F to these expanders? > > > There'll be a duplicated definition of mask_loadv8sfv8si for avx512 > > > and no-avx512 versions. > > Note gather_loadmn can be shared by avx512 and non-avx512 version, we > > can handle mask mode in the preparation statements based on target > > feature. > > > > > > > The best way is like maskloadmn to accept different mask modes in its > > > name, but that may create too much complexity in the middle-end to > > > choose between avx2 mask_gather_load and avx512 mask_gather_load. > > So when we have -mavx512f only then we can gather V8DF using a QImode > maks with AVX512 gather but we have to use AVX2 for V4DF gather > using a V4DImode mask. With -mavx512vl we use a QImode mask for that > case as well. > > Note the > > operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]); > > is because the define_insns for the gathers somehow expect odd mask > modes (the vectorizer generates an integer vector mode matching the > layout of the result vector). I suppose we could somehow fix that > but I wanted to change as little as possible for now. > > The following seems to "work" in generating AVX512 variants for > -mavx512f -mavx512vl -mprefer-vector-width=512 and AVX2 variants > when I drop the -mavx512vl arg: > Interesting. > (define_expand "mask_gather_load<mode><vec_gather_idxdi>" > [(match_operand:VEC_GATHER_MODE 0 "register_operand") > (match_operand 1 "vsib_address_operand") > (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand") > (match_operand:SI 3 "const0_operand") > (match_operand:SI 4 "const1248_operand") > (match_operand 5 "register_operand")] > "TARGET_AVX2 && TARGET_USE_GATHER" > { > if (TARGET_AVX512VL) > { > rtx scratch = gen_reg_rtx (<MODE>mode); > emit_move_insn (scratch, CONST0_RTX(<MODE>mode)); > emit_insn (gen_avx512vl_gatherdi<mode> > (operands[0], scratch, > operands[1], operands[2], operands[5], operands[4])); > } > else > { > operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, > operands[5]); > rtx scratch = gen_reg_rtx (<MODE>mode); > emit_move_insn (scratch, CONST0_RTX(<MODE>mode)); > emit_insn (gen_avx2_gatherdi<mode> > (operands[0], scratch, > operands[1], operands[2], operands[5], operands[4])); > } > DONE; > }) > > But I see quite horrible code for an always-true mask. From GIMPLE > > # ivtmp.17_34 = PHI <ivtmp.17_19(3), 0(2)> > vect__4.5_24 = MEM <vector(8) int> [(int *)idx_12(D) + ivtmp.17_34 * 1]; > vect_patt_28.6_23 = [vec_unpack_lo_expr] vect__4.5_24; > vect_patt_28.6_22 = [vec_unpack_hi_expr] vect__4.5_24; > vect_patt_27.7_17 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_23, 8, { 0.0, > 0.0, 0.0, 0.0 }, { -1, -1, -1, -1 }); > vect_patt_27.8_16 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_22, 8, { 0.0, > 0.0, 0.0, 0.0 }, { -1, -1, -1, -1 }); > MEM <vector(4) double> [(double *)&a + ivtmp.17_34 * 2] = > vect_patt_27.7_17; > MEM <vector(4) double> [(double *)&a + 32B + ivtmp.17_34 * 2] = > vect_patt_27.8_16; > ivtmp.17_19 = ivtmp.17_34 + 32; > if (ivtmp.17_19 != 4096) > > we end up with > > movl $-1, %edx > xorl %eax, %eax > vxorpd %xmm1, %xmm1, %xmm1 > kmovw %edx, %k1 > .p2align 4,,10 > .p2align 3 > .L2: > vpmovsxdq (%rsi,%rax), %ymm0 > vmovdqu (%rsi,%rax), %ymm4 > vmovapd %ymm1, %ymm3 > kmovw %k1, %k2 > vmovapd %ymm1, %ymm2 > kmovw %k1, %k3 > vgatherqpd (%rdi,%ymm0,8), %ymm3{%k2} > vextracti128 $0x1, %ymm4, %xmm0 > vpmovsxdq %xmm0, %ymm0 > vgatherqpd (%rdi,%ymm0,8), %ymm2{%k3} > vmovapd %ymm3, a(%rax,%rax) > vmovapd %ymm2, a+32(%rax,%rax) > addq $32, %rax > cmpq $4096, %rax > jne .L2 > > I wonder why we have those extra kmovw here. I see there's
(define_insn "*avx512f_gathersi<VI48F:mode>" [(set (match_operand:VI48F 0 "register_operand" "=&v") (unspec:VI48F [(match_operand:VI48F 1 "register_operand" "0") (match_operand:<avx512fmaskmode> 7 "register_operand" "2") (match_operator:<ssescalarmode> 6 "vsib_mem_operator" [(unspec:P [(match_operand:P 4 "vsib_address_operand" "Tv") (match_operand:<VEC_GATHER_IDXSI> 3 "register_operand" "v") (match_operand:SI 5 "const1248_operand" "n")] UNSPEC_VSIBADDR)])] UNSPEC_GATHER)) (clobber (match_scratch:<avx512fmaskmode> 2 "=&Yk"))] I think it's because this clobber, and the clobber here is because mask register will be set to zero by this instruction. cut from intel sdm ----cut first------- VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword ......The entire mask register will be set to zero by this instruction unless it trig- gers an exception. ----cut end------ Similar for the avx2 version. > extra (pc) patterns for always-true masks(?) like in the AVX2 > case. But in the end we have to use {%kN} instructions, no? > We can always provide non-masked gather_load expanders if that's > better here. > > Richard. -- BR, Hongtao