https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117562
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
34.37% 419183 sphinx_livepret
sphinx_livepretend_peak.amd64-m64-gcc42-nn [.] utt_decode_block.constprop.0
17.28% 212804 sphinx_livepret
sphinx_livepretend_base.amd64-m64-gcc42-nn [.] utt_decode_block.constprop.0
it's the inlined copy of vector_gautbl_eval_logs3 where we now seem to
hit vectorized code for
for (i = 0; i < veclen; i++) {
diff1 = x[i] - m1[i];
dval1 -= diff1 * diff1 * v1[i];
diff2 = x[i] - m2[i];
dval2 -= diff2 * diff2 * v2[i];
}
never for the zmm version, few cycles on the ymm and now very many on
the xmm version
772 │ vmovaps -0xc0(%rbp),%xmm6
▒
4 │ vmulpd %xmm13,%xmm13,%xmm13
▒
│ vmulpd %xmm1,%xmm1,%xmm1
▒
102664 │ vfnmadd132pd %xmm14,%xmm3,%xmm13
▒
1 │ vcvtps2pd %xmm6,%xmm3
▒
│ diff2 = x[i] - m2[i];
▒
3 │ vmovaps -0xd0(%rbp),%xmm6
▒
│ dval1 -= diff1 * diff1 * v1[i];
▒
95472 │ vfnmadd132pd %xmm1,%xmm13,%xmm3
We end up also applying basic-block
vectorization to the scalar loop via the stores after it:
if (dval1 < gautbl->distfloor)
dval1 = gautbl->distfloor;
if (dval2 < gautbl->distfloor)
dval2 = gautbl->distfloor;
score[r] = (int32)(f * dval1);
score[r+1] = (int32)(f * dval2);
interestingly without LTO this specific instance doesn't behave that badly
but is faster with the three epilogues:
12.63% 132184 sphinx_livepret
sphinx_livepretend_base.amd64-m64-gcc42-nn [.] vector_gautbl_eval_logs3
▒
11.55% 119447 sphinx_livepret
sphinx_livepretend_peak.amd64-m64-gcc42-nn [.] vector_gautbl_eval_logs3
in turn a similar case elsewhere shows up:
27.63% 285537 sphinx_livepret
sphinx_livepretend_peak.amd64-m64-gcc42-nn [.] mgau_eval
◆
15.19% 158646 sphinx_livepret
sphinx_livepretend_base.amd64-m64-gcc42-nn [.] mgau_eval
that's basically exactly the same loop kernel.
What we can see here is
130 │ vmovhps %xmm4,0x60(%rsp)
▒
5 │ vmovhps %xmm2,0x70(%rsp)
▒
│ vmovaps 0x70(%rsp),%xmm0
▒
10552 │ vcvtps2pd %xmm4,%xmm6
▒
1204 │ vcvtps2pd %xmm2,%xmm3
▒
│ vcvtps2pd %xmm0,%xmm2
▒
6872 │ vmovaps 0x60(%rsp),%xmm0
▒
1598 │ vmulpd %xmm3,%xmm3,%xmm3
▒
5 │ vmulpd %xmm2,%xmm2,%xmm2
▒
15119 │ vfnmadd132pd %xmm6,%xmm1,%xmm3
▒
38 │ vcvtps2pd %xmm0,%xmm1
▒
5088 │ vfnmadd132pd %xmm2,%xmm3,%xmm1
▒
33818 │ vunpckhpd %xmm1,%xmm1,%xmm2
▒
39938 │ vaddpd %xmm1,%xmm2,%xmm2
▒
37932 │ test $0x3,%cl
There is an odd spill/reload at 0x60(%rsp) which I can't see where it's coming
from.
Could it be that V2SFmode is spilled as V2SFmode but reloaded as V4SFmode?!
#(insn:TI 1161 1175 1578 15 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
# (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
# (vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
# (const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
# (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566]))
# (parallel [
# (const_int 6 [0x6])
# (const_int 7 [0x7])
# (const_int 2 [0x2])
# (const_int 3 [0x3])
# ]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
# (expr_list:REG_DEAD (reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566])
# (nil)))
vmovhps %xmm2, 112(%rsp) # 1161 [c=4 l=8] sse_movhlps/4
#(insn 1578 1161 1173 15 (set (reg:V4SF 20 xmm0 [853])
# (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
# (const_int 112 [0x70])) [11 %sfp+-16 S16 A128]))
"cont_mgau.c":157:9 2354 {movv4sf_internal}
# (nil))
vmovaps 112(%rsp), %xmm0 # 1578 [c=10 l=8] movv4sf_internal/3
Huh. It looks like this is from a V4SF -> 2xV2DF extension via
vec_unpack_{hi,lo}_expr.
Originally this is
(insn 1161 1160 1162 58 (set (reg:V4SF 853)
(vec_select:V4SF (vec_concat:V8SF (reg:V4SF 853)
(reg:V4SF 566 [ vect__811.229 ]))
(parallel [
(const_int 6 [0x6])
(const_int 7 [0x7])
(const_int 2 [0x2])
(const_int 3 [0x3])
]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
(expr_list:REG_DEAD (reg:V4SF 566 [ vect__811.229 ])
(nil)))
but LRA chooses the memory alternative for the destination:
Choosing alt 4 in insn 1161: (0) m (1) 0 (2) v {sse_movhlps}
Considering alt=0 of insn 1162: (0) =v (1) v
overall=6,losers=1,rld_nregs=1
Choosing alt 0 in insn 1162: (0) =v (1) v {sse2_cvtps2pd}
Creating newreg=998 from oldreg=853, assigning class ALL_SSE_REGS to r998
1162: r568:V2DF=float_extend(vec_select(r998:V4SF,parallel))
Inserting insn reload before:
1578: r998:V4SF=r853:V4SF
This is the following pattern:
(define_insn "sse_movhlps"
[(set (match_operand:V4SF 0 "nonimmediate_operand" "=x,v,x,v,m")
(vec_select:V4SF
(vec_concat:V8SF
(match_operand:V4SF 1 "nonimmediate_operand" " 0,v,0,v,0")
(match_operand:V4SF 2 "nonimmediate_operand" " x,v,o,o,v"))
(parallel [(const_int 6)
(const_int 7)
(const_int 2)
(const_int 3)])))]
"TARGET_SSE && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
"@
movhlps\t{%2, %0|%0, %2}
vmovhlps\t{%2, %1, %0|%0, %1, %2}
movlps\t{%H2, %0|%0, %H2}
vmovlps\t{%H2, %1, %0|%0, %1, %H2}
%vmovhps\t{%2, %0|%q0, %2}"
[(set_attr "isa" "noavx,avx,noavx,avx,*")
(set_attr "type" "ssemov2")
(set_attr "prefix" "orig,maybe_evex,orig,maybe_evex,maybe_vex")
(set_attr "mode" "V4SF,V4SF,V2SF,V2SF,V2SF")])
indeed the "mode" attr says V2SF for the memory (store) alternative but
this is for a V4SFmode. Also LRA doesn't seem to understand that
the match_operand:1 should be the same memory.
(insn 1161 1160 1578 59 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
(const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
(vec_select:V4SF (vec_concat:V8SF (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
(const_int 112 [0x70])) [11 %sfp+-16 S16 A128])
(reg:V4SF 22 xmm2 [orig:566 vect__811.229 ] [566]))
(parallel [
(const_int 6 [0x6])
(const_int 7 [0x7])
(const_int 2 [0x2])
(const_int 3 [0x3])
]))) "cont_mgau.c":157:9 5181 {sse_movhlps}
(nil))
it's also odd that I don't see the spill to (mem/c:V4SF (plus:DI (reg/f:DI 7
sp)(const_int 112 [0x70])) that LRA would need to generate for the input
operand.
Clearly something is odd here and clearly this alternative is bad.