Re: [PATCH][i386] Split reductions (was: Re: [PATCH][RFC][i386] Change sminmax reduction patterns)

Uros Bizjak Tue, 09 Oct 2018 07:44:51 -0700

On Tue, Oct 9, 2018 at 3:28 PM Richard Biener <rguent...@suse.de> wrote:
>
> On Mon, 8 Oct 2018, Richard Biener wrote:
>
> > On Fri, 5 Oct 2018, Uros Bizjak wrote:
> >
> > > On Thu, Oct 4, 2018 at 2:05 PM Richard Biener <rguent...@suse.de> wrote:
> > > >
> > > >
> > > > This tries to apply the same trick to sminmax reduction patterns
> > > > as for the reduc_plus_scal ones, namely reduce %zmm -> %ymm -> %xmm
> > > > first.  On a microbenchmark this improves performance on Zen
> > > > by ~30% for AVX2 and on Skylake-SP by ~10% for AVX512 (for AVX2
> > > > there's no measurable difference).
> > > >
> > > > I guess I'm mostly looking for feedback on the approach I took
> > > > in not rewriting ix86_expand_reduc but instead "recurse" on the
> > > > expanders as well as the need to define recursion stops for
> > > > SSE modes previously not covered.
> > > >
> > > > I'll throw this on a bootstrap & regtest on x86_64-unknown-linux-gnu
> > > > later.
> > > >
> > > > Any comments sofar?  Writing .md patterns is new for me ;)
> > >
> > > LGTM for the implementation.
> >
> > So this applies the method to all AVX2/AVX512 reduc_*_scal_* patterns
> > we have.  I've inspected the assembly and it looks as expected.
> >
> > Note I tried relying on the vectorizer to perform this, thus delete
> > the patterns.  Trying that for the reduc_plus_* ones reveals that
> > the final (SSE width) reduction step looks different and
> > unconditionally uses psrldq as the vectorizer expands the final
> > reduction step using whole-vector shifts.  Well, it actually
> > generates permutes like
> >
> >   vect_res_6.10_22 = VEC_PERM_EXPR <_21, { 0.0, 0.0, 0.0, 0.0 }, { 2, 3,
> > 4, 5 }>;
> >
> > but those end up using the vec_shr_<mode> expander which puns
> > everything to V1TImode.  Removing the vec_shr_<mode> expander
> > also doesn't get us the desired movhlps instruction for the
> > above (but a vshufps).  I'm not sure where to best fix that
> > (I think with good vec_perm expanders we shouldn't neeed the
> > vec_shr one - at least we'd keep the float/int domain), so I
> > kept all the reduc_* expanders we have now.  But I'll note
> > those we do not have (luckily all non-FP) use that
> > "inefficient"(?) instructions.  Like on Zen psrldq has a reciprocal
> > throughput of 1 while movhlps has one of 0.5, so using movhlps
> > would help loops with two reductions, on Skylake the opposite
> > seems to be the case...
> >
> > Boostrapped on x86_64-unknown-linux-gnu, testing in progress.
>
> Went fine after fixing ...
>
> > @@ -2585,9 +2556,14 @@ (define_expand "reduc_<code>_scal_<mode>
> >       (match_operand:VI_256 1 "register_operand"))]
> >    "TARGET_AVX2"
> >  {
> > -  rtx tmp = gen_reg_rtx (<MODE>mode);
> > -  ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]);
> > -  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
> > +  rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode);
> > +  emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1]));
> > +  rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode);
> > +  emit_insn (gen_<code><ssehalfvecmodelower>3
> > +    (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1])));
> > +  rtx tmp3 = gen_reg_rtx (<MODE>mode);
>  ^^^
>
> > +  ix86_expand_reduc (gen_<code><ssehalfvecmodelower>3, tmp3, tmp2);
> > +  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp3,
> ^^^
>
> to use the proper mode (fixed patch re-attached below).
>
> OK?
>
> Richard.
>
> 2018-10-08  Richard Biener  <rguent...@suse.de>
>
>         * config/i386/sse.md (reduc_plus_scal_v8df, reduc_plus_scal_v4df,
>         reduc_plus_scal_v2df, reduc_plus_scal_v16sf, reduc_plus_scal_v8sf,
>         reduc_plus_scal_v4sf): Merge into pattern reducing to half width
>         and recursing and pattern terminating the recursion on SSE
>         vector width using ix86_expand_reduc.
>         (reduc_sminmax_scal_<mode>): Split into part reducing to half
>         width and recursing and SSE2 vector variant doing the final
>         reduction with ix86_expand_reduc.
>         (reduc_uminmax_scal_<mode>): Likewise for the AVX512 variants
>         with terminating the recursion at AVX level, splitting that
>         to SSE there.


OK.

Thanks,
Uros.

> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 692959b1666..9fc5819a863 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -2457,98 +2457,65 @@
>     (set_attr "prefix_rep" "1,*")
>     (set_attr "mode" "V4SF")])
>
> -(define_expand "reduc_plus_scal_v8df"
> -  [(match_operand:DF 0 "register_operand")
> -   (match_operand:V8DF 1 "register_operand")]
> -  "TARGET_AVX512F"
> -{
> -  rtx tmp = gen_reg_rtx (V8DFmode);
> -  ix86_expand_reduc (gen_addv8df3, tmp, operands[1]);
> -  emit_insn (gen_vec_extractv8dfdf (operands[0], tmp, const0_rtx));
> -  DONE;
> -})
> +(define_mode_iterator REDUC_SSE_PLUS_MODE
> + [(V2DF "TARGET_SSE") (V4SF "TARGET_SSE")])
>
> -(define_expand "reduc_plus_scal_v4df"
> -  [(match_operand:DF 0 "register_operand")
> -   (match_operand:V4DF 1 "register_operand")]
> -  "TARGET_AVX"
> +(define_expand "reduc_plus_scal_<mode>"
> + [(plus:REDUC_SSE_PLUS_MODE
> +   (match_operand:<ssescalarmode> 0 "register_operand")
> +   (match_operand:REDUC_SSE_PLUS_MODE 1 "register_operand"))]
> + ""
>  {
> -  rtx tmp = gen_reg_rtx (V2DFmode);
> -  emit_insn (gen_vec_extract_hi_v4df (tmp, operands[1]));
> -  rtx tmp2 = gen_reg_rtx (V2DFmode);
> -  emit_insn (gen_addv2df3 (tmp2, tmp, gen_lowpart (V2DFmode, operands[1])));
> -  rtx tmp3 = gen_reg_rtx (V2DFmode);
> -  emit_insn (gen_vec_interleave_highv2df (tmp3, tmp2, tmp2));
> -  emit_insn (gen_adddf3 (operands[0],
> -                        gen_lowpart (DFmode, tmp2),
> -                        gen_lowpart (DFmode, tmp3)));
> -  DONE;
> -})
> -
> -(define_expand "reduc_plus_scal_v2df"
> -  [(match_operand:DF 0 "register_operand")
> -   (match_operand:V2DF 1 "register_operand")]
> -  "TARGET_SSE2"
> -{
> -  rtx tmp = gen_reg_rtx (V2DFmode);
> -  emit_insn (gen_vec_interleave_highv2df (tmp, operands[1], operands[1]));
> -  emit_insn (gen_adddf3 (operands[0],
> -                        gen_lowpart (DFmode, tmp),
> -                        gen_lowpart (DFmode, operands[1])));
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  ix86_expand_reduc (gen_add<mode>3, tmp, operands[1]);
> +  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
> +                                                        const0_rtx));
>    DONE;
>  })
>
> -(define_expand "reduc_plus_scal_v16sf"
> -  [(match_operand:SF 0 "register_operand")
> -   (match_operand:V16SF 1 "register_operand")]
> -  "TARGET_AVX512F"
> -{
> -  rtx tmp = gen_reg_rtx (V16SFmode);
> -  ix86_expand_reduc (gen_addv16sf3, tmp, operands[1]);
> -  emit_insn (gen_vec_extractv16sfsf (operands[0], tmp, const0_rtx));
> +(define_mode_iterator REDUC_PLUS_MODE
> + [(V4DF "TARGET_AVX") (V8SF "TARGET_AVX")
> +  (V8DF "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
> +
> +(define_expand "reduc_plus_scal_<mode>"
> + [(plus:REDUC_PLUS_MODE
> +   (match_operand:<ssescalarmode> 0 "register_operand")
> +   (match_operand:REDUC_PLUS_MODE 1 "register_operand"))]
> + ""
> +{
> +  rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1]));
> +  rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_add<ssehalfvecmodelower>3
> +    (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1])));
> +  emit_insn (gen_reduc_plus_scal_<ssehalfvecmodelower> (operands[0], tmp2));
>    DONE;
>  })
>
> -(define_expand "reduc_plus_scal_v8sf"
> -  [(match_operand:SF 0 "register_operand")
> -   (match_operand:V8SF 1 "register_operand")]
> -  "TARGET_AVX"
> -{
> -  rtx tmp = gen_reg_rtx (V8SFmode);
> -  rtx tmp2 = gen_reg_rtx (V8SFmode);
> -  rtx vec_res = gen_reg_rtx (V8SFmode);
> -  emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1]));
> -  emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp));
> -  emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1)));
> -  emit_insn (gen_addv8sf3 (vec_res, tmp, tmp2));
> -  emit_insn (gen_vec_extractv8sfsf (operands[0], vec_res, const0_rtx));
> -  DONE;
> -})
> +;; Modes handled by reduc_sm{in,ax}* patterns.
> +(define_mode_iterator REDUC_SSE_SMINMAX_MODE
> +  [(V4SF "TARGET_SSE") (V2DF "TARGET_SSE")
> +   (V2DI "TARGET_SSE") (V4SI "TARGET_SSE") (V8HI "TARGET_SSE")
> +   (V16QI "TARGET_SSE")])
>
> -(define_expand "reduc_plus_scal_v4sf"
> -  [(match_operand:SF 0 "register_operand")
> -   (match_operand:V4SF 1 "register_operand")]
> -  "TARGET_SSE"
> +(define_expand "reduc_<code>_scal_<mode>"
> +  [(smaxmin:REDUC_SSE_SMINMAX_MODE
> +     (match_operand:<ssescalarmode> 0 "register_operand")
> +     (match_operand:REDUC_SSE_SMINMAX_MODE 1 "register_operand"))]
> +  ""
>  {
> -  rtx vec_res = gen_reg_rtx (V4SFmode);
> -  if (TARGET_SSE3)
> -    {
> -      rtx tmp = gen_reg_rtx (V4SFmode);
> -      emit_insn (gen_sse3_haddv4sf3 (tmp, operands[1], operands[1]));
> -      emit_insn (gen_sse3_haddv4sf3 (vec_res, tmp, tmp));
> -    }
> -  else
> -    ix86_expand_reduc (gen_addv4sf3, vec_res, operands[1]);
> -  emit_insn (gen_vec_extractv4sfsf (operands[0], vec_res, const0_rtx));
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]);
> +  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
> +                                                       const0_rtx));
>    DONE;
>  })
>
> -;; Modes handled by reduc_sm{in,ax}* patterns.
>  (define_mode_iterator REDUC_SMINMAX_MODE
>    [(V32QI "TARGET_AVX2") (V16HI "TARGET_AVX2")
>     (V8SI "TARGET_AVX2") (V4DI "TARGET_AVX2")
>     (V8SF "TARGET_AVX") (V4DF "TARGET_AVX")
> -   (V4SF "TARGET_SSE") (V64QI "TARGET_AVX512BW")
> +   (V64QI "TARGET_AVX512BW")
>     (V32HI "TARGET_AVX512BW") (V16SI "TARGET_AVX512F")
>     (V8DI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")
>     (V8DF "TARGET_AVX512F")])
> @@ -2559,10 +2526,12 @@
>       (match_operand:REDUC_SMINMAX_MODE 1 "register_operand"))]
>    ""
>  {
> -  rtx tmp = gen_reg_rtx (<MODE>mode);
> -  ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]);
> -  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
> -                                                       const0_rtx));
> +  rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1]));
> +  rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_<code><ssehalfvecmodelower>3
> +    (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1])));
> +  emit_insn (gen_reduc_<code>_scal_<ssehalfvecmodelower> (operands[0], 
> tmp2));
>    DONE;
>  })
>
> @@ -2572,10 +2541,12 @@
>       (match_operand:VI_AVX512BW 1 "register_operand"))]
>    "TARGET_AVX512F"
>  {
> -  rtx tmp = gen_reg_rtx (<MODE>mode);
> -  ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]);
> -  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
> -                                                       const0_rtx));
> +  rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1]));
> +  rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_<code><ssehalfvecmodelower>3
> +    (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1])));
> +  emit_insn (gen_reduc_<code>_scal_<ssehalfvecmodelower> (operands[0], 
> tmp2));
>    DONE;
>  })
>
> @@ -2585,10 +2556,15 @@
>       (match_operand:VI_256 1 "register_operand"))]
>    "TARGET_AVX2"
>  {
> -  rtx tmp = gen_reg_rtx (<MODE>mode);
> -  ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]);
> -  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
> -                                                       const0_rtx));
> +  rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1]));
> +  rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_<code><ssehalfvecmodelower>3
> +    (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1])));
> +  rtx tmp3 = gen_reg_rtx (<ssehalfvecmode>mode);
> +  ix86_expand_reduc (gen_<code><ssehalfvecmodelower>3, tmp3, tmp2);
> +  emit_insn (gen_vec_extract<ssehalfvecmodelower><ssescalarmodelower>
> +               (operands[0], tmp3, const0_rtx));
>    DONE;
>  })
>

Re: [PATCH][i386] Split reductions (was: Re: [PATCH][RFC][i386] Change sminmax reduction patterns)

Reply via email to