Re: [PATCH] move x86 to use gather/scatter internal functions

Richard Biener via Gcc-patches Wed, 18 Aug 2021 06:43:58 -0700

On Wed, 18 Aug 2021, Hongtao Liu wrote:

> On Wed, Aug 18, 2021 at 7:37 PM Hongtao Liu <crazy...@gmail.com> wrote:
> >
> > On Wed, Aug 18, 2021 at 7:30 PM Hongtao Liu <crazy...@gmail.com> wrote:
> > >
> > > On Wed, Aug 18, 2021 at 6:28 PM Richard Biener <rguent...@suse.de> wrote:
> > > >
> > > > On Wed, 18 Aug 2021, Richard Biener wrote:
> > > >
> > > > >
> > > > > So in the end I seem to be able to combine AVX & AVX512 arriving
> > > > > at the following which passes basic testing.  I will now see to
> > > > > teach the vectorizer the required "promotion" to handle
> > > > > mask_gather_loadv4dfv4si and mask_gather_loadv4sfv4di.
> > > > >
> > > > > Meanwhile, do you see any hole in the below?  If not I'll
> > > > > do mask_scatter_store accordingly (that will be simpler since
> > > > > AVX doesn't have scatter).
> > > >
> > > > There seems to be one more complication ... we have
> > > >
> > > > (define_expand "avx2_gatherdi<mode>"
> > > >   [(parallel [(set (match_operand:VEC_GATHER_MODE 0 "register_operand")
> > > >                    (unspec:VEC_GATHER_MODE
> > > >                      [(match_operand:<VEC_GATHER_SRCDI> 1
> > > > "register_operand")
> > > >                       (mem:<ssescalarmode>
> > > >                         (match_par_dup 6
> > > >                           [(match_operand 2 "vsib_address_operand")
> > > >                            (match_operand:<VEC_GATHER_IDXDI>
> > > >                               3 "register_operand")
> > > >
> > > > but VEC_GATHER_IDXDI is
> > > >
> > > > (define_mode_attr VEC_GATHER_IDXDI
> > > >                       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> > > >                        (V2DF "V2DI") (V4DF "V4DI") (V8DF "V8DI")
> > > >                        (V4SI "V2DI") (V8SI "V4DI") (V16SI "V8DI")
> > > >                        (V4SF "V2DI") (V8SF "V4DI") (V16SF "V8DI")])
> > > >
> > > > I'd have expected (V4SF "V4DI") for example, or (V8SF "V8DI").
> > > VEX.128 version: For dword indices, the instruction will gather four
> > > single-precision floating-point values. For
> > > qword indices, the instruction will gather two values and zero the
> > > upper 64 bits of the destination.
> > > VEX.256 version: For dword indices, the instruction will gather eight
> > > single-precision floating-point values. For
> > > qword indices, the instruction will gather four values and zero the
> > > upper 128 bits of the destination.
> > >
> > > So, for expander name, it should be v2sfv2di and v4sfv4di for IDXDI
> > > under avx2, and v8sfv8di under avx512.
> > >
> > > ----cut pattern----
> > > (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>_2"
> > >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> > > (unspec:VEC_GATHER_MODE
> > >   [(pc)
> > >    (match_operator:<ssescalarmode> 6 "vsib_mem_operator"
> > >      [(unspec:P
> > > [(match_operand:P 2 "vsib_address_operand" "Tv")
> > > (match_operand:<VEC_GATHER_IDXDI> 3 "register_operand" "x")
> > > (match_operand:SI 5 "const1248_operand" "n")]
> > > UNSPEC_VSIBADDR)])
> > >    (mem:BLK (scratch))
> > >    (match_operand:<VEC_GATHER_SRCDI> 4 "register_operand" "1")]
> > >   UNSPEC_GATHER))
> > >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> > >   "TARGET_AVX2"
> > > {
> > >   if (<VEC_GATHER_MODE:MODE>mode != <VEC_GATHER_SRCDI>mode)
> > >     return "%M2v<sseintprefix>gatherq<ssemodesuffix>\t{%4, %6,
> > > %x0|%x0, %6, %4}";
> > > ----cut end---------------
> > > We are using the trick of the operand modifier %x0 to force print xmm.
> > (define_mode_attr VEC_GATHER_SRCDI
> >       [(V2DI "V2DI") (V4DI "V4DI") (V8DI "V8DI")
> >        (V2DF "V2DF") (V4DF "V4DF") (V8DF "V8DF")
> >        (V4SI "V4SI") (V8SI "V4SI") (V16SI "V8SI")
> >        (V4SF "V4SF") (V8SF "V4SF") (V16SF "V8SF")])
> >
> > (define_insn "*avx2_gathersi<VEC_GATHER_MODE:mode>"
> >   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> > (unspec:VEC_GATHER_MODE
> >   [(match_operand:VEC_GATHER_MODE 2 "register_operand" "0")
> >    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
> >      [(unspec:P
> > [(match_operand:P 3 "vsib_address_operand" "Tv")
> > (match_operand:<VEC_GATHER_IDXSI> 4 "register_operand" "x")
> > (match_operand:SI 6 "const1248_operand" "n")]
> > UNSPEC_VSIBADDR)])
> >    (mem:BLK (scratch))
> >    (match_operand:VEC_GATHER_MODE 5 "register_operand" "1")]
> >   UNSPEC_GATHER))
> >    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
> >   "TARGET_AVX2"
> >   "%M3v<sseintprefix>gatherd<ssemodesuffix>\t{%1, %7, %0|%0, %7, %1}"
> >
> > Or only print operands[1] which has mode VEC_GATHER_SRCDI as index.
> 
> Typo should be operands[2] in the below one
> (define_insn "*avx2_gatherdi<VEC_GATHER_MODE:mode>"
>   [(set (match_operand:VEC_GATHER_MODE 0 "register_operand" "=&x")
> (unspec:VEC_GATHER_MODE
>   [(match_operand:<VEC_GATHER_SRCDI> 2 "register_operand" "0")
>    (match_operator:<ssescalarmode> 7 "vsib_mem_operator"
>      [(unspec:P
> [(match_operand:P 3 "vsib_address_operand" "Tv")
> (match_operand:<VEC_GATHER_IDXDI> 4 "register_operand" "x")
> (match_operand:SI 6 "const1248_operand" "n")]
> UNSPEC_VSIBADDR)])
>    (mem:BLK (scratch))
>    (match_operand:<VEC_GATHER_SRCDI> 5 "register_operand" "1")]
>   UNSPEC_GATHER))
>    (clobber (match_scratch:VEC_GATHER_MODE 1 "=&x"))]
>   "TARGET_AVX2"
>   "%M3v<sseintprefix>gatherq<ssemodesuffix>\t{%5, %7, %2|%2, %7, %5}"


It's somewhat of a mess.  When we expose

  mask_gather_loadv8sfv8di

for example then the vectorizer will with -mavx512f generate a
vector mask for v8sf which will be a AVX2 style mask.  But in
the end this will use gen_avx512f_gatherdiv16sf and require a
AVX512 style mask.

So I'm not sure whether mask_gather_load is a good fit for x86
for the instruction variants where the index and data vector
have different size.  For mask_gather_loadv8sfv8di we could
instead have mask_gather_load_lowv16sfv8di and for
mask_gather_loadv8dfv8si we'd have mask_gather_load_idxlov8dfv16si?

Or simply do not constrain the index/data vectors and document
mask_gather_load to load as many elements as min(data,index)
vector elements ... (curiously they are currently only losely
constrained - seemingly allowing a larger index vector at least).

For the builtins we have eye candy like

  def_builtin_pure (OPTION_MASK_ISA_AVX512F, 0, "__builtin_ia32_gatherdiv16sf",
                    V8SF_FTYPE_V8SF_PCVOID_V8DI_QI_INT,
                    IX86_BUILTIN_GATHER3DIV16SF);

note the DIV16SF name but the V8SF/V8DI types.  But there's also

  def_builtin_pure (OPTION_MASK_ISA_AVX512F, 0, 
"__builtin_ia32_gather3altdiv16sf ",
                    V16SF_FTYPE_V16SF_PCFLOAT_V8DI_HI_INT,
                    IX86_BUILTIN_GATHER3ALTDIV16SF);

which is what seems to be used by the vectorizer and that magically
knows that only the lower part of the data is populated.

The vectorizer seems to key the cases on

  if (known_eq (nunits, gather_off_nunits))
    modifier = NONE;
  else if (known_eq (nunits * 2, gather_off_nunits))
    {
      modifier = WIDEN;
...
    }
  else if (known_eq (nunits, gather_off_nunits * 2))
    {
      modifier = NARROW;

in vect_build_gather_load_calls.  Note that for the IFN case we
rely on pattern recognition to produce a "scalar" gather IFN call
and we do not recover the offset vector type from the machine
description but from the actual vector type of the vector def.

Richard, do you have any advice here?

Thanks,
Richard.

Re: [PATCH] move x86 to use gather/scatter internal functions

Reply via email to