https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494
--- Comment #5 from Peter Cordes <peter at cordes dot ca> --- IF ( xij.GT.+HALf ) xij = xij - PBCx IF ( xij.LT.-HALf ) xij = xij + PBCx For code like this, *if we can prove only one of the IF() conditions will be true*, we can implement it more efficiently, I think, by checking the magnitude of xij to see if a SUB is needed, and if so figuring out the sign to apply to PBCx. if(abs(xij) > HALF) { xij -= PBCx XOR sign_bit( xij ) } # xij in xmm0 # PBCx in xmm7 # HALF in xmm6 # set1( -0.0f ) in xmm5 (i.e. 1U<<31 a sign-bit mask) vandnps %xmm5, %xmm0, %xmm1 # abs(xij) vcmpltps %xmm1, %xmm6, %xmm1 # HALF < abs(xij) vandps %xmm5, %xmm0, %xmm2 # signbit(xij) vxorps %xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx (xij<0) vandps %xmm2, %xmm1, %xmm1 # +-PBCx or 0.0 if abs(xij) is between -+HALF vsubps %xmm1, %xmm0, %xmm0 # xij -= PBCx, -PBCx, or 0.0 There's a good amount of ILP here, but the critical path is ANDPS + CMPPS + ANDPS + SUBPS = 10 cycles on Skylake. We might want to use VPAND for some of this on Haswell, to avoid a port 5 bottleneck at least on the critical path. (Skylake runs FP booleans on any port. BDW and earlier restrict them to port 5 where they can't compete with FMA, and where bypass latency is always optimal. On SKL they can introduce extra bypass latency if they pick p0 or p1.) ---- vandnps %xmm5, %xmm0, %xmm2 # signbit(xij) vxorps %xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx (xij<0) could be replaced with a (v)blendvps using the original xij to select between PBCx and -PBCx. With the SSE encoding, that saves a uop and a cycle of latency (but only off the critical path). And I think it would cost us a vmovaps to set up for it. --- I think this is better than IF-conversion of both IFs separately, but I haven't really looked. It should be much better for *latency*. But it's only equivalent if subtracting PBCx can't possibly make xij negative and the next IF condition also true. --- I was looking at a similar case of applying a fixup if the abs value of an input is outside a range in https://stackoverflow.com/questions/54364694/how-to-convert-scalar-code-of-the-double-version-of-vdts-pade-exp-fast-ex-app/54377840#54377840. I don't think I came up with anything there that's not already obvious or covered by the example above, though. Except if we had needed to square xij at some point, we could have checked xij * xij < HALF*HALF as the bound condition to save the ANDNPS. But then the mulps latency is part of the input to cmpps.