https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82680
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- gcc's sequence is *probably* good, as long as it uses xor / comisd / setcc and not comisd / setcc / movzx (which gcc often likes to do for integer setcc). (u)comisd and cmpeqsd both run on the FP add unit. Agner Fog doesn't list the latency. (It's hard to measure, because you'd need to construct a round-trip back to FP.) XOR-zeroing is as cheap as a NOP on Intel SnB-family, but uses an execution port on AMD, so gcc's sequence is the same front-end uops but fewer unfused-domain uops for the execution units on SnB. Also, the xor-zeroing is off the critical path on all CPUs. (But ucomisd latency is probably as high as cmpeqsd + movd). Hmm, AMD bdver* and Ryzen take 2 uops for comisd, so for tune=generic it's probably worth thinking about using ICC's sequence. ICC's sequence is especially good if you're doing something with the integer result that can optimize away the NEG. (e.g. use it with AND instead of a CMOV to conditionally zero something, or AND it with another condition). Or if you're storing the boolean result to memory, psrld $31, %xmm0 or PAND, then movd directly to memory without going through integer regs. comisd doesn't destroy either of its args, but cmpeqsd does (without AVX). If you want both x and y afterwards (e.g. if they weren't equal, or you care about -0.0 and +0.0 being different even though they compare equal), then comisd is a win. So I think we need to look at the choices given some more surrounding code. I'll hopefully look at this some more soon.