Re: [PATCH] move x86 to use gather/scatter internal functions

Richard Biener via Gcc-patches Wed, 18 Aug 2021 02:21:47 -0700

On Wed, 18 Aug 2021, Hongtao Liu wrote:

> On Wed, Aug 18, 2021 at 4:32 PM Richard Biener <[email protected]> wrote:
> >
> > On Wed, 18 Aug 2021, Hongtao Liu wrote:
> >
> > > On Wed, Aug 18, 2021 at 11:24 AM Hongtao Liu <[email protected]> wrote:
> > > >
> > > > On Tue, Aug 17, 2021 at 10:43 PM Richard Biener via Gcc-patches
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Tue, Aug 17, 2021 at 3:29 PM Richard Biener via Gcc-patches
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > This is an attempt to start moving the x86 backend to use
> > > > > > standard pattern names for [mask_]gather_load and 
> > > > > > [mask_]scatter_store
> > > > > > rather than using the builtin_{gather,scatter} target hooks.
> > > > > >
> > > > > > I've started with AVX2 gathers and given x86 only supports masked
> > > > > > gather I only implemented mask_gather_load.  Note while for
> > > > > > the builtin_gather case the vectorizer will provide an all-true
> > > > > > mask operand for non-masked gathers this capability does not
> > > > > > exist for the IFN path yet, so only testcases with actual masked
> > > > > > gathers will work.
> > > > > >
> > > > > > If this looks reasonable on the backend side I'll see to first
> > > > > > complete the vectorizer part, ripping out the target hook and
> > > > > > arranging for the missing pieces.  Another one is the support
> > > > > > for SImode indices with DFmode data which requires unpacking
> > > > > > the index vector and actually recognizing the IFN.
> > > > > >
> > > > > > 2021-08-17  Richard Biener  <[email protected]>
> > > > > >
> > > > > >         * tree-vect-data-refs.c (vect_check_gather_scatter):
> > > > > >         Always use internal functions.
> > > > > >         * config/i386/sse.md
> > > > > >         (mask_gather_load<mode><vec_gather_idxsi>): New expander.
> > > > > >         (mask_gather_load<mode><vec_gather_idxdi>): Likewise.
> > > > > > ---
> > > > > >  gcc/config/i386/sse.md    | 56 
> > > > > > +++++++++++++++++++++++++++++++++++++++
> > > > > >  gcc/tree-vect-data-refs.c |  4 +--
> > > > > >  2 files changed, 57 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > > > > index 3957c86c3df..40bec98d9f7 100644
> > > > > > --- a/gcc/config/i386/sse.md
> > > > > > +++ b/gcc/config/i386/sse.md
> > > > > > @@ -23232,12 +23232,22 @@
> > > > > >                        (V2DF "V4SI") (V4DF "V4SI") (V8DF "V8SI")
> > > > > >                        (V4SI "V4SI") (V8SI "V8SI") (V16SI "V16SI")
> > > > > >                        (V4SF "V4SI") (V8SF "V8SI") (V16SF "V16SI")])
> > > > > > +(define_mode_attr vec_gather_idxsi
> > > > > > +                     [(V2DI "v4si") (V4DI "v4si") (V8DI "v8si")
> > > > > > +                      (V2DF "v4si") (V4DF "v4si") (V8DF "v8si")
> > > > > > +                      (V4SI "v4si") (V8SI "v8si") (V16SI "v16si")
> > > > > > +                      (V4SF "v4si") (V8SF "v8si") (V16SF "v16si")])
> > > > > >
> > > > > >  (define_mode_attr VEC_GATHER_IDXDI
> > > > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > > > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > > > > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > > > > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > > > > > +(define_mode_attr vec_gather_idxdi
> > > > > > +                     [(V2DI "v2di") (V4DI "v4di") (V8DI "v8di")
> > > > > > +                      (V2DF "v2di") (V4DF "v4di") (V8DF "v8di")
> > > > > > +                      (V4SI "v2di") (V8SI "v4di") (V16SI "v8di")
> > > > > > +                      (V4SF "v2di") (V8SF "v4di") (V16SF "v8di")])
> > > > > >
> > > > > >  (define_mode_attr VEC_GATHER_SRCDI
> > > > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > > > > @@ -23245,6 +23255,29 @@
> > > > > >                        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> > > > > >                        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> > > > > >
> > > > > > +(define_expand "mask_gather_load<mode><vec_gather_idxsi>"
> > > > > > +  [(parallel [(set (match_operand:VEC_GATHER_MODE 0 
> > > > > > "register_operand")
> > > > > > +                  (unspec:VEC_GATHER_MODE
> > > > > > +                    [(pc)
> > > > > > +                     (mem:<ssescalarmode>
> > > > > > +                       (match_par_dup 6
> > > > > > +                         [(match_operand 1 "vsib_address_operand")
> > > > > > +                          (match_operand:<VEC_GATHER_IDXSI>
> > > > > > +                             2 "register_operand")
> > > > > > +                          (match_operand:SI 4 "const1248_operand")
> > > > > > +                          (match_operand:SI 3 "const0_operand")]))
> > > > > > +                     (mem:BLK (scratch))
> > > > > > +                     (match_operand:<sseintvecmode> 5 
> > > > > > "register_operand")]
> > > > >
> > > > > One problem of these is that when AVX512[VL] is enabled we get a 
> > > > > AVX512 mask
> > > > > mode here and while the internal function expansion check succeeds (it
> > > > > never checks
> > > > > the mask operand!), RTL expansion fails unexpectedly because of this 
> > > > > mismatch.
> > > > >
> > > > > I suppose a more complicated define_mode_attr might do the trick or 
> > > > > do I
> > > > I don't think define_mode_attr supports conditional selection based on
> > > > target feature.
> > > > > need to add && !TARGET_AVX512F to these expanders?
> > > > There'll be a duplicated definition of  mask_loadv8sfv8si for avx512
> > > > and no-avx512 versions.
> > > Note gather_loadmn can be shared by avx512 and non-avx512 version, we
> > > can handle mask mode in the preparation statements based on target
> > > feature.
> > > > >
> > > > The best way is like maskloadmn to accept different mask modes in its
> > > > name, but that may create too much complexity in the middle-end to
> > > > choose between avx2 mask_gather_load and avx512 mask_gather_load.
> >
> > So when we have -mavx512f only then we can gather V8DF using a QImode
> > maks with AVX512 gather but we have to use AVX2 for V4DF gather
> > using a V4DImode mask.  With -mavx512vl we use a QImode mask for that
> > case as well.
> >
> > Note the
> >
> >   operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode, operands[5]);
> >
> > is because the define_insns for the gathers somehow expect odd mask
> > modes (the vectorizer generates an integer vector mode matching the
> > layout of the result vector).  I suppose we could somehow fix that
> > but I wanted to change as little as possible for now.
> >
> > The following seems to "work" in generating AVX512 variants for
> > -mavx512f -mavx512vl -mprefer-vector-width=512 and AVX2 variants
> > when I drop the -mavx512vl arg:
> >
> Interesting.
> > (define_expand "mask_gather_load<mode><vec_gather_idxdi>"
> >   [(match_operand:VEC_GATHER_MODE 0 "register_operand")
> >    (match_operand 1 "vsib_address_operand")
> >    (match_operand:<VEC_GATHER_IDXDI> 2 "register_operand")
> >    (match_operand:SI 3 "const0_operand")
> >    (match_operand:SI 4 "const1248_operand")
> >    (match_operand 5 "register_operand")]
> >   "TARGET_AVX2 && TARGET_USE_GATHER"
> > {
> >   if (TARGET_AVX512VL)
> >     {
> >       rtx scratch = gen_reg_rtx (<MODE>mode);
> >       emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> >       emit_insn (gen_avx512vl_gatherdi<mode>
> >         (operands[0], scratch,
> >          operands[1], operands[2], operands[5], operands[4]));
> >     }
> >   else
> >     {
> >       operands[5] = gen_lowpart_SUBREG (<VEC_GATHER_SRCDI>mode,
> > operands[5]);
> >       rtx scratch = gen_reg_rtx (<MODE>mode);
> >       emit_move_insn (scratch, CONST0_RTX(<MODE>mode));
> >       emit_insn (gen_avx2_gatherdi<mode>
> >         (operands[0], scratch,
> >          operands[1], operands[2], operands[5], operands[4]));
> >     }
> >   DONE;
> > })
> >
> > But I see quite horrible code for an always-true mask.  From GIMPLE
> >
> >   # ivtmp.17_34 = PHI <ivtmp.17_19(3), 0(2)>
> >   vect__4.5_24 = MEM <vector(8) int> [(int *)idx_12(D) + ivtmp.17_34 * 1];
> >   vect_patt_28.6_23 = [vec_unpack_lo_expr] vect__4.5_24;
> >   vect_patt_28.6_22 = [vec_unpack_hi_expr] vect__4.5_24;
> >   vect_patt_27.7_17 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_23, 8, { 0.0,
> > 0.0, 0.0, 0.0 }, { -1, -1, -1, -1 });
> >   vect_patt_27.8_16 = .MASK_GATHER_LOAD (_18, vect_patt_28.6_22, 8, { 0.0,
> > 0.0, 0.0, 0.0 }, { -1, -1, -1, -1 });
> >   MEM <vector(4) double> [(double *)&a + ivtmp.17_34 * 2] =
> > vect_patt_27.7_17;
> >   MEM <vector(4) double> [(double *)&a + 32B + ivtmp.17_34 * 2] =
> > vect_patt_27.8_16;
> >   ivtmp.17_19 = ivtmp.17_34 + 32;
> >   if (ivtmp.17_19 != 4096)
> >
> > we end up with
> >
> >         movl    $-1, %edx
> >         xorl    %eax, %eax
> >         vxorpd  %xmm1, %xmm1, %xmm1
> >         kmovw   %edx, %k1
> >         .p2align 4,,10
> >         .p2align 3
> > .L2:
> >         vpmovsxdq       (%rsi,%rax), %ymm0
> >         vmovdqu (%rsi,%rax), %ymm4
> >         vmovapd %ymm1, %ymm3
> >         kmovw   %k1, %k2
> >         vmovapd %ymm1, %ymm2
> >         kmovw   %k1, %k3
> >         vgatherqpd      (%rdi,%ymm0,8), %ymm3{%k2}
> >         vextracti128    $0x1, %ymm4, %xmm0
> >         vpmovsxdq       %xmm0, %ymm0
> >         vgatherqpd      (%rdi,%ymm0,8), %ymm2{%k3}
> >         vmovapd %ymm3, a(%rax,%rax)
> >         vmovapd %ymm2, a+32(%rax,%rax)
> >         addq    $32, %rax
> >         cmpq    $4096, %rax
> >         jne     .L2
> >
> > I wonder why we have those extra kmovw here.  I see there's
> 
> (define_insn "*avx512f_gathersi<VI48F:mode>"
>   [(set (match_operand:VI48F 0 "register_operand" "=&v")
>         (unspec:VI48F
>           [(match_operand:VI48F 1 "register_operand" "0")
>            (match_operand:<avx512fmaskmode> 7 "register_operand" "2")
>            (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
>              [(unspec:P
>                 [(match_operand:P 4 "vsib_address_operand" "Tv")
>                  (match_operand:<VEC_GATHER_IDXSI> 3 "register_operand" "v")
>                  (match_operand:SI 5 "const1248_operand" "n")]
>                 UNSPEC_VSIBADDR)])]
>           UNSPEC_GATHER))
>    (clobber (match_scratch:<avx512fmaskmode> 2 "=&Yk"))]
> I think it's because this clobber, and the clobber here is because
> mask register will be set to zero by this instruction.
> cut from intel sdm
> ----cut first-------
> VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword
> ......The entire mask register will be set to zero by this instruction
> unless it trig-
> gers an exception.
> ----cut end------
> Similar for the avx2 version.


Oh, how (not) useful ...  I verified the same code is generated
when using the old builtins path btw, so while it looks inefficient
it doesn't look like it will be a regression.  Maybe a
kxnorw %k2, %k2 would be more efficient than keeping a mask register
to move -1 from around.  Not sure how we'd arrange for that
though.

Richard.

Re: [PATCH] move x86 to use gather/scatter internal functions

Reply via email to