I notice that for Zen we create
0.00 │ vhaddp %ymm3,%ymm3,%ymm3
1.41 │ vperm2 $0x1,%ymm3,%ymm3,%ymm1
1.45 │ vaddpd %ymm1,%ymm2,%ymm2
from reduc_plus_scal_v4df which uses a cross-lane permute vperm2f128
even though the upper half of the result is unused in the end
(we only use the single-precision element zero). Much better would
be to use vextractf128 which is well-pipelined and has good throughput
(though using vhaddp in itself is quite bad for Zen I didn't try
benchmarking it against open-coding that yet, aka disabling the
expander). I can generate
vhaddpd %ymm3, %ymm3, %ymm3
vextractf128 $0x1, %ymm3, %xmm1
vaddpd %xmm1, %xmm3, %xmm3
with
Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md (revision 264758)
+++ gcc/config/i386/sse.md (working copy)
@@ -2474,12 +2474,12 @@ (define_expand "reduc_plus_scal_v4df"
"TARGET_AVX"
{
rtx tmp = gen_reg_rtx (V4DFmode);
- rtx tmp2 = gen_reg_rtx (V4DFmode);
- rtx vec_res = gen_reg_rtx (V4DFmode);
+ rtx tmp2 = gen_reg_rtx (V2DFmode);
+ rtx vec_res = gen_reg_rtx (V2DFmode);
emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1]));
- emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1)));
- emit_insn (gen_addv4df3 (vec_res, tmp, tmp2));
- emit_insn (gen_vec_extractv4dfdf (operands[0], vec_res, const0_rtx));
+ emit_insn (gen_vec_extract_hi_v4df (tmp2, tmp));
+ emit_insn (gen_addv2df3 (vec_res, gen_lowpart (V2DFmode, tmp), tmp2));
+ emit_insn (gen_vec_extractv2dfdf (operands[0], vec_res, const0_rtx));
DONE;
})
easily though even using scalar operations for the add would be possible.
reduc_plus_scal_v8df uses ix86_expand_reduc which seems to use
full-width instructions throughout. I recently changed the vectorizer
to open-code tem%ymm = lowpart(%zmm) + highpart(%zmm);
tem%xmm = lowpart(tem%ymm) + highpart(tem%ymm); reduce(tem%xmm) which
is better for all cores so I wonder if ix86_expand_reduc should follow
that scheme. As said in a related PR the backend is in full control
of the final reduction sequence used when defining reduc_plus_scal_<mode>
which IMHO is good and we likely should have tuning specific patterns
in case there isn't a one-fits-all one.
Thanks,
Richard.
--
Richard Biener <[email protected]>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB
21284 (AG Nuernberg)