On Tue, Oct 9, 2018 at 3:28 PM Richard Biener <rguent...@suse.de> wrote: > > On Mon, 8 Oct 2018, Richard Biener wrote: > > > On Fri, 5 Oct 2018, Uros Bizjak wrote: > > > > > On Thu, Oct 4, 2018 at 2:05 PM Richard Biener <rguent...@suse.de> wrote: > > > > > > > > > > > > This tries to apply the same trick to sminmax reduction patterns > > > > as for the reduc_plus_scal ones, namely reduce %zmm -> %ymm -> %xmm > > > > first. On a microbenchmark this improves performance on Zen > > > > by ~30% for AVX2 and on Skylake-SP by ~10% for AVX512 (for AVX2 > > > > there's no measurable difference). > > > > > > > > I guess I'm mostly looking for feedback on the approach I took > > > > in not rewriting ix86_expand_reduc but instead "recurse" on the > > > > expanders as well as the need to define recursion stops for > > > > SSE modes previously not covered. > > > > > > > > I'll throw this on a bootstrap & regtest on x86_64-unknown-linux-gnu > > > > later. > > > > > > > > Any comments sofar? Writing .md patterns is new for me ;) > > > > > > LGTM for the implementation. > > > > So this applies the method to all AVX2/AVX512 reduc_*_scal_* patterns > > we have. I've inspected the assembly and it looks as expected. > > > > Note I tried relying on the vectorizer to perform this, thus delete > > the patterns. Trying that for the reduc_plus_* ones reveals that > > the final (SSE width) reduction step looks different and > > unconditionally uses psrldq as the vectorizer expands the final > > reduction step using whole-vector shifts. Well, it actually > > generates permutes like > > > > vect_res_6.10_22 = VEC_PERM_EXPR <_21, { 0.0, 0.0, 0.0, 0.0 }, { 2, 3, > > 4, 5 }>; > > > > but those end up using the vec_shr_<mode> expander which puns > > everything to V1TImode. Removing the vec_shr_<mode> expander > > also doesn't get us the desired movhlps instruction for the > > above (but a vshufps). I'm not sure where to best fix that > > (I think with good vec_perm expanders we shouldn't neeed the > > vec_shr one - at least we'd keep the float/int domain), so I > > kept all the reduc_* expanders we have now. But I'll note > > those we do not have (luckily all non-FP) use that > > "inefficient"(?) instructions. Like on Zen psrldq has a reciprocal > > throughput of 1 while movhlps has one of 0.5, so using movhlps > > would help loops with two reductions, on Skylake the opposite > > seems to be the case... > > > > Boostrapped on x86_64-unknown-linux-gnu, testing in progress. > > Went fine after fixing ... > > > @@ -2585,9 +2556,14 @@ (define_expand "reduc_<code>_scal_<mode> > > (match_operand:VI_256 1 "register_operand"))] > > "TARGET_AVX2" > > { > > - rtx tmp = gen_reg_rtx (<MODE>mode); > > - ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]); > > - emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, > > + rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode); > > + emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1])); > > + rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode); > > + emit_insn (gen_<code><ssehalfvecmodelower>3 > > + (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1]))); > > + rtx tmp3 = gen_reg_rtx (<MODE>mode); > ^^^ > > > + ix86_expand_reduc (gen_<code><ssehalfvecmodelower>3, tmp3, tmp2); > > + emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp3, > ^^^ > > to use the proper mode (fixed patch re-attached below). > > OK? > > Richard. > > 2018-10-08 Richard Biener <rguent...@suse.de> > > * config/i386/sse.md (reduc_plus_scal_v8df, reduc_plus_scal_v4df, > reduc_plus_scal_v2df, reduc_plus_scal_v16sf, reduc_plus_scal_v8sf, > reduc_plus_scal_v4sf): Merge into pattern reducing to half width > and recursing and pattern terminating the recursion on SSE > vector width using ix86_expand_reduc. > (reduc_sminmax_scal_<mode>): Split into part reducing to half > width and recursing and SSE2 vector variant doing the final > reduction with ix86_expand_reduc. > (reduc_uminmax_scal_<mode>): Likewise for the AVX512 variants > with terminating the recursion at AVX level, splitting that > to SSE there.
OK. Thanks, Uros. > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index 692959b1666..9fc5819a863 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -2457,98 +2457,65 @@ > (set_attr "prefix_rep" "1,*") > (set_attr "mode" "V4SF")]) > > -(define_expand "reduc_plus_scal_v8df" > - [(match_operand:DF 0 "register_operand") > - (match_operand:V8DF 1 "register_operand")] > - "TARGET_AVX512F" > -{ > - rtx tmp = gen_reg_rtx (V8DFmode); > - ix86_expand_reduc (gen_addv8df3, tmp, operands[1]); > - emit_insn (gen_vec_extractv8dfdf (operands[0], tmp, const0_rtx)); > - DONE; > -}) > +(define_mode_iterator REDUC_SSE_PLUS_MODE > + [(V2DF "TARGET_SSE") (V4SF "TARGET_SSE")]) > > -(define_expand "reduc_plus_scal_v4df" > - [(match_operand:DF 0 "register_operand") > - (match_operand:V4DF 1 "register_operand")] > - "TARGET_AVX" > +(define_expand "reduc_plus_scal_<mode>" > + [(plus:REDUC_SSE_PLUS_MODE > + (match_operand:<ssescalarmode> 0 "register_operand") > + (match_operand:REDUC_SSE_PLUS_MODE 1 "register_operand"))] > + "" > { > - rtx tmp = gen_reg_rtx (V2DFmode); > - emit_insn (gen_vec_extract_hi_v4df (tmp, operands[1])); > - rtx tmp2 = gen_reg_rtx (V2DFmode); > - emit_insn (gen_addv2df3 (tmp2, tmp, gen_lowpart (V2DFmode, operands[1]))); > - rtx tmp3 = gen_reg_rtx (V2DFmode); > - emit_insn (gen_vec_interleave_highv2df (tmp3, tmp2, tmp2)); > - emit_insn (gen_adddf3 (operands[0], > - gen_lowpart (DFmode, tmp2), > - gen_lowpart (DFmode, tmp3))); > - DONE; > -}) > - > -(define_expand "reduc_plus_scal_v2df" > - [(match_operand:DF 0 "register_operand") > - (match_operand:V2DF 1 "register_operand")] > - "TARGET_SSE2" > -{ > - rtx tmp = gen_reg_rtx (V2DFmode); > - emit_insn (gen_vec_interleave_highv2df (tmp, operands[1], operands[1])); > - emit_insn (gen_adddf3 (operands[0], > - gen_lowpart (DFmode, tmp), > - gen_lowpart (DFmode, operands[1]))); > + rtx tmp = gen_reg_rtx (<MODE>mode); > + ix86_expand_reduc (gen_add<mode>3, tmp, operands[1]); > + emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, > + const0_rtx)); > DONE; > }) > > -(define_expand "reduc_plus_scal_v16sf" > - [(match_operand:SF 0 "register_operand") > - (match_operand:V16SF 1 "register_operand")] > - "TARGET_AVX512F" > -{ > - rtx tmp = gen_reg_rtx (V16SFmode); > - ix86_expand_reduc (gen_addv16sf3, tmp, operands[1]); > - emit_insn (gen_vec_extractv16sfsf (operands[0], tmp, const0_rtx)); > +(define_mode_iterator REDUC_PLUS_MODE > + [(V4DF "TARGET_AVX") (V8SF "TARGET_AVX") > + (V8DF "TARGET_AVX512F") (V16SF "TARGET_AVX512F")]) > + > +(define_expand "reduc_plus_scal_<mode>" > + [(plus:REDUC_PLUS_MODE > + (match_operand:<ssescalarmode> 0 "register_operand") > + (match_operand:REDUC_PLUS_MODE 1 "register_operand"))] > + "" > +{ > + rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1])); > + rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_add<ssehalfvecmodelower>3 > + (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1]))); > + emit_insn (gen_reduc_plus_scal_<ssehalfvecmodelower> (operands[0], tmp2)); > DONE; > }) > > -(define_expand "reduc_plus_scal_v8sf" > - [(match_operand:SF 0 "register_operand") > - (match_operand:V8SF 1 "register_operand")] > - "TARGET_AVX" > -{ > - rtx tmp = gen_reg_rtx (V8SFmode); > - rtx tmp2 = gen_reg_rtx (V8SFmode); > - rtx vec_res = gen_reg_rtx (V8SFmode); > - emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1])); > - emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp)); > - emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1))); > - emit_insn (gen_addv8sf3 (vec_res, tmp, tmp2)); > - emit_insn (gen_vec_extractv8sfsf (operands[0], vec_res, const0_rtx)); > - DONE; > -}) > +;; Modes handled by reduc_sm{in,ax}* patterns. > +(define_mode_iterator REDUC_SSE_SMINMAX_MODE > + [(V4SF "TARGET_SSE") (V2DF "TARGET_SSE") > + (V2DI "TARGET_SSE") (V4SI "TARGET_SSE") (V8HI "TARGET_SSE") > + (V16QI "TARGET_SSE")]) > > -(define_expand "reduc_plus_scal_v4sf" > - [(match_operand:SF 0 "register_operand") > - (match_operand:V4SF 1 "register_operand")] > - "TARGET_SSE" > +(define_expand "reduc_<code>_scal_<mode>" > + [(smaxmin:REDUC_SSE_SMINMAX_MODE > + (match_operand:<ssescalarmode> 0 "register_operand") > + (match_operand:REDUC_SSE_SMINMAX_MODE 1 "register_operand"))] > + "" > { > - rtx vec_res = gen_reg_rtx (V4SFmode); > - if (TARGET_SSE3) > - { > - rtx tmp = gen_reg_rtx (V4SFmode); > - emit_insn (gen_sse3_haddv4sf3 (tmp, operands[1], operands[1])); > - emit_insn (gen_sse3_haddv4sf3 (vec_res, tmp, tmp)); > - } > - else > - ix86_expand_reduc (gen_addv4sf3, vec_res, operands[1]); > - emit_insn (gen_vec_extractv4sfsf (operands[0], vec_res, const0_rtx)); > + rtx tmp = gen_reg_rtx (<MODE>mode); > + ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]); > + emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, > + const0_rtx)); > DONE; > }) > > -;; Modes handled by reduc_sm{in,ax}* patterns. > (define_mode_iterator REDUC_SMINMAX_MODE > [(V32QI "TARGET_AVX2") (V16HI "TARGET_AVX2") > (V8SI "TARGET_AVX2") (V4DI "TARGET_AVX2") > (V8SF "TARGET_AVX") (V4DF "TARGET_AVX") > - (V4SF "TARGET_SSE") (V64QI "TARGET_AVX512BW") > + (V64QI "TARGET_AVX512BW") > (V32HI "TARGET_AVX512BW") (V16SI "TARGET_AVX512F") > (V8DI "TARGET_AVX512F") (V16SF "TARGET_AVX512F") > (V8DF "TARGET_AVX512F")]) > @@ -2559,10 +2526,12 @@ > (match_operand:REDUC_SMINMAX_MODE 1 "register_operand"))] > "" > { > - rtx tmp = gen_reg_rtx (<MODE>mode); > - ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]); > - emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, > - const0_rtx)); > + rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1])); > + rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_<code><ssehalfvecmodelower>3 > + (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1]))); > + emit_insn (gen_reduc_<code>_scal_<ssehalfvecmodelower> (operands[0], > tmp2)); > DONE; > }) > > @@ -2572,10 +2541,12 @@ > (match_operand:VI_AVX512BW 1 "register_operand"))] > "TARGET_AVX512F" > { > - rtx tmp = gen_reg_rtx (<MODE>mode); > - ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]); > - emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, > - const0_rtx)); > + rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1])); > + rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_<code><ssehalfvecmodelower>3 > + (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1]))); > + emit_insn (gen_reduc_<code>_scal_<ssehalfvecmodelower> (operands[0], > tmp2)); > DONE; > }) > > @@ -2585,10 +2556,15 @@ > (match_operand:VI_256 1 "register_operand"))] > "TARGET_AVX2" > { > - rtx tmp = gen_reg_rtx (<MODE>mode); > - ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]); > - emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, > - const0_rtx)); > + rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1])); > + rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_<code><ssehalfvecmodelower>3 > + (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1]))); > + rtx tmp3 = gen_reg_rtx (<ssehalfvecmode>mode); > + ix86_expand_reduc (gen_<code><ssehalfvecmodelower>3, tmp3, tmp2); > + emit_insn (gen_vec_extract<ssehalfvecmodelower><ssescalarmodelower> > + (operands[0], tmp3, const0_rtx)); > DONE; > }) >