On Mon, 1 Oct 2018, Richard Biener wrote: > > I notice that for Zen we create > > 0.00 │ vhaddp %ymm3,%ymm3,%ymm3 > 1.41 │ vperm2 $0x1,%ymm3,%ymm3,%ymm1 > 1.45 │ vaddpd %ymm1,%ymm2,%ymm2 > > from reduc_plus_scal_v4df which uses a cross-lane permute vperm2f128 > even though the upper half of the result is unused in the end > (we only use the single-precision element zero). Much better would > be to use vextractf128 which is well-pipelined and has good throughput > (though using vhaddp in itself is quite bad for Zen I didn't try > benchmarking it against open-coding that yet, aka disabling the > expander).
So here's an actual patch that I benchmarked and compared against the sequence generated by ICC when tuning for core-avx. ICC seems to avoid haddpd for both V4DF and V2DF in favor of using vunpckhpd plus an add so this is what I do now. Zen is also happy with that in addition to the lower overall latency and higher throughput "on the paper" (consulting Agners tables). The disadvantage for the V2DF case is that it now needs two registers compared to just one, the advantage is that we can enable it for SSE2 where we previously got psrldq + addpd and now unpckhpd + addpd (psrldq has similar latency but eventually the wrong non-FP "domain"?). It speeds up 482.sphinx3 on Zen with -mprefer-avx256 by ~7% (with -mprefer-avx128 the difference is in the noise). The major issue with the old sequence seems to be throughput of hadd and vperm2f128 with the two parallel reductions 482.sphinx3 performs resulting in even worse latency while with the new sequence two reduction sequences can be fully carried out in parallel (on Zen). Bootstrap and regtest running on x86_64-unknown-linux-gnu. OK for trunk? The v8sf/v4sf reduction sequences and ix86_expand_reduc probably benefit from similar treatment (ix86_expand_reduc from first reducing %zmm -> %ymm -> %xmm). Thanks, Richard. 2018-10-02 Richard Biener <rguent...@suse.de> * config/i386/sse.md (reduc_plus_scal_v4df): Avoid the use of haddv4df, first reduce to SSE width and exploit the fact that we only need element zero with the reduction result. (reduc_plus_scal_v2df): Likewise. Index: gcc/config/i386/sse.md =================================================================== --- gcc/config/i386/sse.md (revision 264758) +++ gcc/config/i386/sse.md (working copy) @@ -2473,24 +2473,28 @@ (define_expand "reduc_plus_scal_v4df" (match_operand:V4DF 1 "register_operand")] "TARGET_AVX" { - rtx tmp = gen_reg_rtx (V4DFmode); - rtx tmp2 = gen_reg_rtx (V4DFmode); - rtx vec_res = gen_reg_rtx (V4DFmode); - emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1])); - emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1))); - emit_insn (gen_addv4df3 (vec_res, tmp, tmp2)); - emit_insn (gen_vec_extractv4dfdf (operands[0], vec_res, const0_rtx)); + rtx tmp = gen_reg_rtx (V2DFmode); + emit_insn (gen_vec_extract_hi_v4df (tmp, operands[1])); + rtx tmp2 = gen_reg_rtx (V2DFmode); + emit_insn (gen_addv2df3 (tmp2, tmp, gen_lowpart (V2DFmode, operands[1]))); + rtx tmp3 = gen_reg_rtx (V2DFmode); + emit_insn (gen_vec_interleave_highv2df (tmp3, tmp2, tmp2)); + emit_insn (gen_adddf3 (operands[0], + gen_lowpart (DFmode, tmp2), + gen_lowpart (DFmode, tmp3))); DONE; }) (define_expand "reduc_plus_scal_v2df" [(match_operand:DF 0 "register_operand") (match_operand:V2DF 1 "register_operand")] - "TARGET_SSE3" + "TARGET_SSE2" { rtx tmp = gen_reg_rtx (V2DFmode); - emit_insn (gen_sse3_haddv2df3 (tmp, operands[1], operands[1])); - emit_insn (gen_vec_extractv2dfdf (operands[0], tmp, const0_rtx)); + emit_insn (gen_vec_interleave_highv2df (tmp, operands[1], operands[1])); + emit_insn (gen_adddf3 (operands[0], + gen_lowpart (DFmode, tmp), + gen_lowpart (DFmode, operands[1]))); DONE; })