> On Mon, 1 Oct 2018, Richard Biener wrote: > > > > > I notice that for Zen we create > > > > 0.00 │ vhaddp %ymm3,%ymm3,%ymm3 > > 1.41 │ vperm2 $0x1,%ymm3,%ymm3,%ymm1 > > 1.45 │ vaddpd %ymm1,%ymm2,%ymm2 > > > > from reduc_plus_scal_v4df which uses a cross-lane permute vperm2f128 > > even though the upper half of the result is unused in the end > > (we only use the single-precision element zero). Much better would > > be to use vextractf128 which is well-pipelined and has good throughput > > (though using vhaddp in itself is quite bad for Zen I didn't try > > benchmarking it against open-coding that yet, aka disabling the > > expander). > > So here's an actual patch that I benchmarked and compared against > the sequence generated by ICC when tuning for core-avx. ICC seems > to avoid haddpd for both V4DF and V2DF in favor of using > vunpckhpd plus an add so this is what I do now. Zen is also happy > with that in addition to the lower overall latency and higher > throughput "on the paper" (consulting Agners tables). > > The disadvantage for the V2DF case is that it now needs two > registers compared to just one, the advantage is that we can > enable it for SSE2 where we previously got psrldq + addpd > and now unpckhpd + addpd (psrldq has similar latency but eventually the > wrong non-FP "domain"?). > > It speeds up 482.sphinx3 on Zen with -mprefer-avx256 by ~7% (with > -mprefer-avx128 the difference is in the noise). The major issue > with the old sequence seems to be throughput of hadd and vperm2f128 > with the two parallel reductions 482.sphinx3 performs resulting in > even worse latency while with the new sequence two reduction sequences > can be fully carried out in parallel (on Zen). > > Bootstrap and regtest running on x86_64-unknown-linux-gnu. > > OK for trunk? > > The v8sf/v4sf reduction sequences and ix86_expand_reduc probably > benefit from similar treatment (ix86_expand_reduc from first reducing > %zmm -> %ymm -> %xmm). > > Thanks, > Richard. > > 2018-10-02 Richard Biener <rguent...@suse.de> > > * config/i386/sse.md (reduc_plus_scal_v4df): Avoid the use > of haddv4df, first reduce to SSE width and exploit the fact > that we only need element zero with the reduction result. > (reduc_plus_scal_v2df): Likewise.
OK, thanks! Honza > > Index: gcc/config/i386/sse.md > =================================================================== > --- gcc/config/i386/sse.md (revision 264758) > +++ gcc/config/i386/sse.md (working copy) > @@ -2473,24 +2473,28 @@ (define_expand "reduc_plus_scal_v4df" > (match_operand:V4DF 1 "register_operand")] > "TARGET_AVX" > { > - rtx tmp = gen_reg_rtx (V4DFmode); > - rtx tmp2 = gen_reg_rtx (V4DFmode); > - rtx vec_res = gen_reg_rtx (V4DFmode); > - emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1])); > - emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1))); > - emit_insn (gen_addv4df3 (vec_res, tmp, tmp2)); > - emit_insn (gen_vec_extractv4dfdf (operands[0], vec_res, const0_rtx)); > + rtx tmp = gen_reg_rtx (V2DFmode); > + emit_insn (gen_vec_extract_hi_v4df (tmp, operands[1])); > + rtx tmp2 = gen_reg_rtx (V2DFmode); > + emit_insn (gen_addv2df3 (tmp2, tmp, gen_lowpart (V2DFmode, operands[1]))); > + rtx tmp3 = gen_reg_rtx (V2DFmode); > + emit_insn (gen_vec_interleave_highv2df (tmp3, tmp2, tmp2)); > + emit_insn (gen_adddf3 (operands[0], > + gen_lowpart (DFmode, tmp2), > + gen_lowpart (DFmode, tmp3))); > DONE; > }) > > (define_expand "reduc_plus_scal_v2df" > [(match_operand:DF 0 "register_operand") > (match_operand:V2DF 1 "register_operand")] > - "TARGET_SSE3" > + "TARGET_SSE2" > { > rtx tmp = gen_reg_rtx (V2DFmode); > - emit_insn (gen_sse3_haddv2df3 (tmp, operands[1], operands[1])); > - emit_insn (gen_vec_extractv2dfdf (operands[0], tmp, const0_rtx)); > + emit_insn (gen_vec_interleave_highv2df (tmp, operands[1], operands[1])); > + emit_insn (gen_adddf3 (operands[0], > + gen_lowpart (DFmode, tmp), > + gen_lowpart (DFmode, operands[1]))); > DONE; > }) >