RFC: x87 reduc_plus_scal_* AVX (and AVX512?) expanders

Richard Biener Mon, 01 Oct 2018 07:47:47 -0700

I notice that for Zen we create

  0.00 │       vhaddp %ymm3,%ymm3,%ymm3
  1.41 │       vperm2 $0x1,%ymm3,%ymm3,%ymm1
  1.45 │       vaddpd %ymm1,%ymm2,%ymm2


from reduc_plus_scal_v4df which uses a cross-lane permute vperm2f128
even though the upper half of the result is unused in the end
(we only use the single-precision element zero).  Much better would
be to use vextractf128 which is well-pipelined and has good throughput
(though using vhaddp in itself is quite bad for Zen I didn't try
benchmarking it against open-coding that yet, aka disabling the
expander).  I can generate

        vhaddpd %ymm3, %ymm3, %ymm3
        vextractf128    $0x1, %ymm3, %xmm1
        vaddpd  %xmm1, %xmm3, %xmm3

with

Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md      (revision 264758)
+++ gcc/config/i386/sse.md      (working copy)
@@ -2474,12 +2474,12 @@ (define_expand "reduc_plus_scal_v4df"
   "TARGET_AVX"
 {
   rtx tmp = gen_reg_rtx (V4DFmode);
-  rtx tmp2 = gen_reg_rtx (V4DFmode);
-  rtx vec_res = gen_reg_rtx (V4DFmode);
+  rtx tmp2 = gen_reg_rtx (V2DFmode);
+  rtx vec_res = gen_reg_rtx (V2DFmode);
   emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1]));
-  emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1)));
-  emit_insn (gen_addv4df3 (vec_res, tmp, tmp2));
-  emit_insn (gen_vec_extractv4dfdf (operands[0], vec_res, const0_rtx));
+  emit_insn (gen_vec_extract_hi_v4df (tmp2, tmp));
+  emit_insn (gen_addv2df3 (vec_res, gen_lowpart (V2DFmode, tmp), tmp2));
+  emit_insn (gen_vec_extractv2dfdf (operands[0], vec_res, const0_rtx));
   DONE;
 })
 

easily though even using scalar operations for the add would be possible.

reduc_plus_scal_v8df uses ix86_expand_reduc which seems to use
full-width instructions throughout.  I recently changed the vectorizer
to open-code tem%ymm = lowpart(%zmm) + highpart(%zmm);
tem%xmm = lowpart(tem%ymm) + highpart(tem%ymm); reduce(tem%xmm) which
is better for all cores so I wonder if ix86_expand_reduc should follow
that scheme.  As said in a related PR the backend is in full control
of the final reduction sequence used when defining reduc_plus_scal_<mode>
which IMHO is good and we likely should have tuning specific patterns
in case there isn't a one-fits-all one.

Thanks,
Richard.

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

RFC: x87 reduc_plus_scal_* AVX (and AVX512?) expanders

Reply via email to