https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
               IF ( xij.GT.+HALf ) xij = xij - PBCx
               IF ( xij.LT.-HALf ) xij = xij + PBCx

For code like this, *if we can prove only one of the IF() conditions will be
true*, we can implement it more efficiently, I think, by checking the magnitude
of xij to see if a SUB is needed, and if so figuring out the sign to apply to
PBCx.

if(abs(xij) > HALF) {
    xij -= PBCx XOR sign_bit( xij )
}


    # xij  in  xmm0
    # PBCx in  xmm7
    # HALF in  xmm6
    # set1( -0.0f ) in xmm5 (i.e. 1U<<31 a sign-bit mask)
    vandnps    %xmm5, %xmm0, %xmm1    # abs(xij)
    vcmpltps   %xmm1, %xmm6, %xmm1    # HALF < abs(xij)

    vandps    %xmm5, %xmm0, %xmm2     # signbit(xij)
    vxorps    %xmm7, %xmm2, %xmm2     # PBCX (xij>=0) or -PBCx  (xij<0)

    vandps    %xmm2, %xmm1, %xmm1     # +-PBCx or 0.0 if abs(xij) is between
-+HALF
    vsubps    %xmm1, %xmm0, %xmm0     # xij -= PBCx, -PBCx, or 0.0

There's a good amount of ILP here, but the critical path is ANDPS + CMPPS +
ANDPS + SUBPS = 10 cycles on Skylake.

We might want to use VPAND for some of this on Haswell, to avoid a port 5
bottleneck at least on the critical path.  (Skylake runs FP booleans on any
port.  BDW and earlier restrict them to port 5 where they can't compete with
FMA, and where bypass latency is always optimal.  On SKL they can introduce
extra bypass latency if they pick p0 or p1.)

----

    vandnps   %xmm5, %xmm0, %xmm2     # signbit(xij)
    vxorps    %xmm7, %xmm2, %xmm2     # PBCX (xij>=0) or -PBCx  (xij<0)

could be replaced with a (v)blendvps using the original xij to select between
PBCx and -PBCx.  With the SSE encoding, that saves a uop and a cycle of latency
(but only off the critical path).  And I think it would cost us a vmovaps to
set up for it.

---

I think this is better than IF-conversion of both IFs separately, but I haven't
really looked.  It should be much better for *latency*.  But it's only
equivalent if subtracting PBCx can't possibly make xij negative and the next IF
condition also true.

---

I was looking at a similar case of applying a fixup if the abs value of an
input is outside a range in
https://stackoverflow.com/questions/54364694/how-to-convert-scalar-code-of-the-double-version-of-vdts-pade-exp-fast-ex-app/54377840#54377840.
 I don't think I came up with anything there that's not already obvious or
covered by the example above, though.

Except if we had needed to square xij at some point, we could have checked  xij
* xij < HALF*HALF as the bound condition to save the ANDNPS.  But then the
mulps latency is part of the input to cmpps.

Reply via email to