https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466
--- Comment #21 from Marc Glisse <glisse at gcc dot gnu.org> --- (In reply to Daniel Elliott from comment #20) > still clang is 1.64x faster. had a look at the assembly. My limited > understanding makes me think that the ucomiss is not fully vectorized and > the clang one is (clangs ucomiss %xmm0,%xmm1 vs gcc's ucomiss > 0x218b4(%rip),%xmm0). Feel free to correct me if I am wrong. Nothing gets vectorized (likely because of the "dontoptimize" code). The ucomiss difference is that llvm keeps the constant .5f in a register, while gcc reloads it every time. I don't know if the speed difference comes from that, or from some subtle tuning arrangement of the operations (I didn't try to understand why llvm has 4 mov where gcc has only 2).