On Mon, 1 Oct 2018, Richard Biener wrote:

> 
> I notice that for Zen we create
> 
>   0.00 │       vhaddp %ymm3,%ymm3,%ymm3
>   1.41 │       vperm2 $0x1,%ymm3,%ymm3,%ymm1
>   1.45 │       vaddpd %ymm1,%ymm2,%ymm2
> 
> from reduc_plus_scal_v4df which uses a cross-lane permute vperm2f128
> even though the upper half of the result is unused in the end
> (we only use the single-precision element zero).  Much better would
> be to use vextractf128 which is well-pipelined and has good throughput
> (though using vhaddp in itself is quite bad for Zen I didn't try
> benchmarking it against open-coding that yet, aka disabling the
> expander).

So here's an actual patch that I benchmarked and compared against
the sequence generated by ICC when tuning for core-avx.  ICC seems
to avoid haddpd for both V4DF and V2DF in favor of using
vunpckhpd plus an add so this is what I do now.  Zen is also happy
with that in addition to the lower overall latency and higher
throughput "on the paper" (consulting Agners tables).

The disadvantage for the V2DF case is that it now needs two
registers compared to just one, the advantage is that we can
enable it for SSE2 where we previously got psrldq + addpd
and now unpckhpd + addpd (psrldq has similar latency but eventually the
wrong non-FP "domain"?).

It speeds up 482.sphinx3 on Zen with -mprefer-avx256 by ~7% (with
-mprefer-avx128 the difference is in the noise).  The major issue
with the old sequence seems to be throughput of hadd and vperm2f128
with the two parallel reductions 482.sphinx3 performs resulting in
even worse latency while with the new sequence two reduction sequences
can be fully carried out in parallel (on Zen).

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

OK for trunk?

The v8sf/v4sf reduction sequences and ix86_expand_reduc probably
benefit from similar treatment (ix86_expand_reduc from first reducing
%zmm -> %ymm -> %xmm).

Thanks,
Richard.

2018-10-02  Richard Biener  <rguent...@suse.de>

        * config/i386/sse.md (reduc_plus_scal_v4df): Avoid the use
        of haddv4df, first reduce to SSE width and exploit the fact
        that we only need element zero with the reduction result.
        (reduc_plus_scal_v2df): Likewise.

Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md      (revision 264758)
+++ gcc/config/i386/sse.md      (working copy)
@@ -2473,24 +2473,28 @@ (define_expand "reduc_plus_scal_v4df"
    (match_operand:V4DF 1 "register_operand")]
   "TARGET_AVX"
 {
-  rtx tmp = gen_reg_rtx (V4DFmode);
-  rtx tmp2 = gen_reg_rtx (V4DFmode);
-  rtx vec_res = gen_reg_rtx (V4DFmode);
-  emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1]));
-  emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1)));
-  emit_insn (gen_addv4df3 (vec_res, tmp, tmp2));
-  emit_insn (gen_vec_extractv4dfdf (operands[0], vec_res, const0_rtx));
+  rtx tmp = gen_reg_rtx (V2DFmode);
+  emit_insn (gen_vec_extract_hi_v4df (tmp, operands[1]));
+  rtx tmp2 = gen_reg_rtx (V2DFmode);
+  emit_insn (gen_addv2df3 (tmp2, tmp, gen_lowpart (V2DFmode, operands[1])));
+  rtx tmp3 = gen_reg_rtx (V2DFmode);
+  emit_insn (gen_vec_interleave_highv2df (tmp3, tmp2, tmp2));
+  emit_insn (gen_adddf3 (operands[0],
+                        gen_lowpart (DFmode, tmp2),
+                        gen_lowpart (DFmode, tmp3)));
   DONE;
 })
 
 (define_expand "reduc_plus_scal_v2df"
   [(match_operand:DF 0 "register_operand")
    (match_operand:V2DF 1 "register_operand")]
-  "TARGET_SSE3"
+  "TARGET_SSE2"
 {
   rtx tmp = gen_reg_rtx (V2DFmode);
-  emit_insn (gen_sse3_haddv2df3 (tmp, operands[1], operands[1]));
-  emit_insn (gen_vec_extractv2dfdf (operands[0], tmp, const0_rtx));
+  emit_insn (gen_vec_interleave_highv2df (tmp, operands[1], operands[1]));
+  emit_insn (gen_adddf3 (operands[0],
+                        gen_lowpart (DFmode, tmp),
+                        gen_lowpart (DFmode, operands[1])));
   DONE;
 })
 

Reply via email to