https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2019-01-31 CC| |jakub at gcc dot gnu.org, | |peter at cordes dot ca Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Bisecting on a different Haswell machine: r266526: 5.35user 0.00system 0:05.36elapsed 99%CPU trunk head: 5.80user 0.00system 0:05.81elapsed 99%CPU output also differs: STEP LP KIN.E POT.E TOT.E DIFFUS PX PY PZ ---- -- ------- ------- ------- -------- -------- -------- -------- LENGTH = 25804/ 163840 - 1 L 0.0000 -3.0509 -3.0509 0.0000 -0.8E-15 -0.5E-15 0.1E-14 + 1 L 0.0000 -3.0509 -3.0509 0.0000 0.3E-15 0.8E-15 0.0E+00 on current trunk verification says it PASSES, past logs indicate it passed there as well. r266587: 5.90user 0.00system 0:05.91elapsed 99%CPU with same output as r266526. r266557: 5.90user 0.01system 0:05.91elapsed 99%CPU r266537: 5.33user 0.00system 0:05.33elapsed 100%CPU r266548: 5.88user 0.01system 0:05.89elapsed 100%CPU r266545: 5.34user 0.00system 0:05.34elapsed r266546: same f951 r266547: same f951 So it is r266548, the fix for PR88189 PR target/88189 * config/i386/i386.c (ix86_expand_sse_movcc): Handle DFmode and SFmode using sse4_1_blendvs[sd] with TARGET_SSE4_1. Formatting fixes. * config/i386/sse.md (sse4_1_blendv<ssemodesuffix>): New pattern. we see extra vblendvpd used for if-conversion in non-vectorized paths in mforce: DO i = 1 , MOLsa DO nll = MRKr1(i) , MRKr2(i) j = LISt(nll) xij = X0(1,i) - X0(1,j) IF ( xij.GT.+HALf ) xij = xij - PBCx IF ( xij.LT.-HALf ) xij = xij + PBCx yij = X0(2,i) - X0(2,j) IF ( yij.GT.+HALf ) yij = yij - PBCy IF ( yij.LT.-HALf ) yij = yij + PBCy zij = X0(3,i) - X0(3,j) IF ( zij.GT.+HALf ) zij = zij - PBCz IF ( zij.LT.-HALf ) zij = zij + PBCz ... .L241: .L241: movslq liscom_-4(,%rdx,4), %rcx movslq liscom_-4(,%rdx,4), %rcx leaq (%rcx,%rcx,2), %rax leaq (%rcx,%rcx,2), %rax vsubsd lcs_+48(,%rax,8), %xmm9, %xmm3 | vsubsd lcs_+48(,%rax,8), %xmm11, %xmm6 vcomisd %xmm4, %xmm3 | vsubsd %xmm13, %xmm6, %xmm5 jbe .L226 | vcmpltsd %xmm6, %xmm0, %xmm4 vsubsd %xmm11, %xmm3, %xmm3 | vblendvpd %xmm4, %xmm5, %xmm6, %xmm7 .L226: | vsubsd lcs_+56(,%rax,8), %xmm10, %xmm5 vcomisd %xmm3, %xmm5 | vaddsd %xmm13, %xmm7, %xmm8 jbe .L228 | vcmpltsd %xmm1, %xmm7, %xmm6 vaddsd %xmm11, %xmm3, %xmm3 | vsubsd %xmm2, %xmm5, %xmm4 .L228: | vblendvpd %xmm6, %xmm8, %xmm7, %xmm6 vsubsd lcs_+56(,%rax,8), %xmm8, %xmm2 | vcmpltsd %xmm5, %xmm0, %xmm7 vcomisd %xmm4, %xmm2 | vblendvpd %xmm7, %xmm4, %xmm5, %xmm8 jbe .L230 | vcmpltsd %xmm1, %xmm8, %xmm4 vsubsd 264(%rsp), %xmm2, %xmm2 | vaddsd %xmm2, %xmm8, %xmm5 .L230: | vblendvpd %xmm4, %xmm5, %xmm8, %xmm5 vcomisd %xmm2, %xmm5 | vsubsd lcs_+64(,%rax,8), %xmm9, %xmm4 jbe .L232 | vsubsd %xmm3, %xmm4, %xmm7 vaddsd 264(%rsp), %xmm2, %xmm2 | vcmpltsd %xmm4, %xmm0, %xmm8 .L232: | vblendvpd %xmm8, %xmm7, %xmm4, %xmm4 vsubsd lcs_+64(,%rax,8), %xmm7, %xmm0 | vaddsd %xmm3, %xmm4, %xmm7 vcomisd %xmm4, %xmm0 | vcmpltsd %xmm1, %xmm4, %xmm8 jbe .L234 | vblendvpd %xmm8, %xmm7, %xmm4, %xmm4 vsubsd 256(%rsp), %xmm0, %xmm0 | vmulsd 256(%rsp), %xmm4, %xmm7 .L234: | vmulsd 272(%rsp), %xmm5, %xmm8 vcomisd %xmm0, %xmm5 | vfmadd231sd %xmm5, %xmm14, %xmm7 jbe .L236 | vfmadd231sd %xmm6, %xmm15, %xmm8 vaddsd 256(%rsp), %xmm0, %xmm0 | vfmadd231sd 264(%rsp), %xmm4, %xmm8 .L236: | vmulsd %xmm5, %xmm7, %xmm5 vmulsd 272(%rsp), %xmm0, %xmm1 | vfmadd231sd %xmm6, %xmm8, %xmm5 vmulsd %xmm2, %xmm14, %xmm6 | vmulsd %xmm4, %xmm4, %xmm6 vfmadd231sd %xmm2, %xmm12, %xmm1 | vfmadd231sd 280(%rsp), %xmm6, %xmm5 vfmadd231sd %xmm3, %xmm13, %xmm6 | vcomisd %xmm5, %xmm12 vfmadd231sd 280(%rsp), %xmm0, %xmm6 < vmulsd %xmm2, %xmm1, %xmm2 < vfmadd231sd %xmm3, %xmm6, %xmm2 < vmulsd %xmm0, %xmm0, %xmm3 < vfmadd231sd %xmm3, %xmm15, %xmm2 < vcomisd %xmm2, %xmm10 < jbe .L238 jbe .L238 the code looks better but my guess is that the branches are well-predicted and in the actual arithmetic there are no bad data dependences while the if-converted code is full of those. According to agner tables blendvpd is also 2 uops and constrainted to one port with only one executed every two cycles and two cycles latency. compared to blendpd which has three ports to issue and one uop and one cycle latency. So these many blendvpd in rapid succession are not a good idea. I wasn't able to actually perf this, somehow it doesn't like me today.