I notice that for Zen we create 0.00 │ vhaddp %ymm3,%ymm3,%ymm3 1.41 │ vperm2 $0x1,%ymm3,%ymm3,%ymm1 1.45 │ vaddpd %ymm1,%ymm2,%ymm2
from reduc_plus_scal_v4df which uses a cross-lane permute vperm2f128 even though the upper half of the result is unused in the end (we only use the single-precision element zero). Much better would be to use vextractf128 which is well-pipelined and has good throughput (though using vhaddp in itself is quite bad for Zen I didn't try benchmarking it against open-coding that yet, aka disabling the expander). I can generate vhaddpd %ymm3, %ymm3, %ymm3 vextractf128 $0x1, %ymm3, %xmm1 vaddpd %xmm1, %xmm3, %xmm3 with Index: gcc/config/i386/sse.md =================================================================== --- gcc/config/i386/sse.md (revision 264758) +++ gcc/config/i386/sse.md (working copy) @@ -2474,12 +2474,12 @@ (define_expand "reduc_plus_scal_v4df" "TARGET_AVX" { rtx tmp = gen_reg_rtx (V4DFmode); - rtx tmp2 = gen_reg_rtx (V4DFmode); - rtx vec_res = gen_reg_rtx (V4DFmode); + rtx tmp2 = gen_reg_rtx (V2DFmode); + rtx vec_res = gen_reg_rtx (V2DFmode); emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1])); - emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1))); - emit_insn (gen_addv4df3 (vec_res, tmp, tmp2)); - emit_insn (gen_vec_extractv4dfdf (operands[0], vec_res, const0_rtx)); + emit_insn (gen_vec_extract_hi_v4df (tmp2, tmp)); + emit_insn (gen_addv2df3 (vec_res, gen_lowpart (V2DFmode, tmp), tmp2)); + emit_insn (gen_vec_extractv2dfdf (operands[0], vec_res, const0_rtx)); DONE; }) easily though even using scalar operations for the add would be possible. reduc_plus_scal_v8df uses ix86_expand_reduc which seems to use full-width instructions throughout. I recently changed the vectorizer to open-code tem%ymm = lowpart(%zmm) + highpart(%zmm); tem%xmm = lowpart(tem%ymm) + highpart(tem%ymm); reduce(tem%xmm) which is better for all cores so I wonder if ix86_expand_reduc should follow that scheme. As said in a related PR the backend is in full control of the final reduction sequence used when defining reduc_plus_scal_<mode> which IMHO is good and we likely should have tuning specific patterns in case there isn't a one-fits-all one. Thanks, Richard. -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)