> On Mon, 1 Oct 2018, Richard Biener wrote:
> 
> > 
> > I notice that for Zen we create
> > 
> >   0.00 │       vhaddp %ymm3,%ymm3,%ymm3
> >   1.41 │       vperm2 $0x1,%ymm3,%ymm3,%ymm1
> >   1.45 │       vaddpd %ymm1,%ymm2,%ymm2
> > 
> > from reduc_plus_scal_v4df which uses a cross-lane permute vperm2f128
> > even though the upper half of the result is unused in the end
> > (we only use the single-precision element zero).  Much better would
> > be to use vextractf128 which is well-pipelined and has good throughput
> > (though using vhaddp in itself is quite bad for Zen I didn't try
> > benchmarking it against open-coding that yet, aka disabling the
> > expander).
> 
> So here's an actual patch that I benchmarked and compared against
> the sequence generated by ICC when tuning for core-avx.  ICC seems
> to avoid haddpd for both V4DF and V2DF in favor of using
> vunpckhpd plus an add so this is what I do now.  Zen is also happy
> with that in addition to the lower overall latency and higher
> throughput "on the paper" (consulting Agners tables).
> 
> The disadvantage for the V2DF case is that it now needs two
> registers compared to just one, the advantage is that we can
> enable it for SSE2 where we previously got psrldq + addpd
> and now unpckhpd + addpd (psrldq has similar latency but eventually the
> wrong non-FP "domain"?).
> 
> It speeds up 482.sphinx3 on Zen with -mprefer-avx256 by ~7% (with
> -mprefer-avx128 the difference is in the noise).  The major issue
> with the old sequence seems to be throughput of hadd and vperm2f128
> with the two parallel reductions 482.sphinx3 performs resulting in
> even worse latency while with the new sequence two reduction sequences
> can be fully carried out in parallel (on Zen).
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> 
> OK for trunk?
> 
> The v8sf/v4sf reduction sequences and ix86_expand_reduc probably
> benefit from similar treatment (ix86_expand_reduc from first reducing
> %zmm -> %ymm -> %xmm).
> 
> Thanks,
> Richard.
> 
> 2018-10-02  Richard Biener  <rguent...@suse.de>
> 
>       * config/i386/sse.md (reduc_plus_scal_v4df): Avoid the use
>       of haddv4df, first reduce to SSE width and exploit the fact
>       that we only need element zero with the reduction result.
>       (reduc_plus_scal_v2df): Likewise.

OK,
thanks!
Honza
> 
> Index: gcc/config/i386/sse.md
> ===================================================================
> --- gcc/config/i386/sse.md    (revision 264758)
> +++ gcc/config/i386/sse.md    (working copy)
> @@ -2473,24 +2473,28 @@ (define_expand "reduc_plus_scal_v4df"
>     (match_operand:V4DF 1 "register_operand")]
>    "TARGET_AVX"
>  {
> -  rtx tmp = gen_reg_rtx (V4DFmode);
> -  rtx tmp2 = gen_reg_rtx (V4DFmode);
> -  rtx vec_res = gen_reg_rtx (V4DFmode);
> -  emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1]));
> -  emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1)));
> -  emit_insn (gen_addv4df3 (vec_res, tmp, tmp2));
> -  emit_insn (gen_vec_extractv4dfdf (operands[0], vec_res, const0_rtx));
> +  rtx tmp = gen_reg_rtx (V2DFmode);
> +  emit_insn (gen_vec_extract_hi_v4df (tmp, operands[1]));
> +  rtx tmp2 = gen_reg_rtx (V2DFmode);
> +  emit_insn (gen_addv2df3 (tmp2, tmp, gen_lowpart (V2DFmode, operands[1])));
> +  rtx tmp3 = gen_reg_rtx (V2DFmode);
> +  emit_insn (gen_vec_interleave_highv2df (tmp3, tmp2, tmp2));
> +  emit_insn (gen_adddf3 (operands[0],
> +                      gen_lowpart (DFmode, tmp2),
> +                      gen_lowpart (DFmode, tmp3)));
>    DONE;
>  })
>  
>  (define_expand "reduc_plus_scal_v2df"
>    [(match_operand:DF 0 "register_operand")
>     (match_operand:V2DF 1 "register_operand")]
> -  "TARGET_SSE3"
> +  "TARGET_SSE2"
>  {
>    rtx tmp = gen_reg_rtx (V2DFmode);
> -  emit_insn (gen_sse3_haddv2df3 (tmp, operands[1], operands[1]));
> -  emit_insn (gen_vec_extractv2dfdf (operands[0], tmp, const0_rtx));
> +  emit_insn (gen_vec_interleave_highv2df (tmp, operands[1], operands[1]));
> +  emit_insn (gen_adddf3 (operands[0],
> +                      gen_lowpart (DFmode, tmp),
> +                      gen_lowpart (DFmode, operands[1])));
>    DONE;
>  })
>  

Reply via email to