https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466
--- Comment #20 from Daniel Elliott <cpphackster at gmail dot com> --- cool. just tried that. gets gcc down to GCC: ------------------------------------------------------- ifStandard 596892 ns ifNoConditional 148075 ns <--- with "result[n] = tab[item > .5f];" trick Clang:(no change) ifStandard 88777 ns ifNoConditional 89818 ns ------------------------------------------------------ still clang is 1.64x faster. had a look at the assembly. My limited understanding makes me think that the ucomiss is not fully vectorized and the clang one is (clangs ucomiss %xmm0,%xmm1 vs gcc's ucomiss 0x218b4(%rip),%xmm0). Feel free to correct me if I am wrong. clang: movss 0x61a80(%r15,%rcx,1),%xmm1 22.95% xor %eax,%eax ucomiss %xmm0,%xmm1 13.81% seta %al 22.55% mov 0x4335d0(,%rax,4),%eax 4.31% mov %eax,0x61a80(%rbx,%rcx,1) 22.03% movss 0x61a84(%rbx,%rcx,1),%xmm1 0.40% movss %xmm1,0xc(%rsp) 13.93% add $0x4,%rcx jne 404b50 <ifNoConditional(benchmark::State&)+0x180> gcc: 14.45% movss 0x0(%r13,%rax,1),%xmm0 0.18% xor %edx,%edx 21.27% ucomiss 0x218b4(%rip),%xmm0 # 426bf4 <_IO_stdin_used+0x34> 16.84% seta %dl 21.79% movss 0x8(%rsp,%rdx,4),%xmm0 1.41% movss %xmm0,(%r12,%rax,1) 23.94% add $0x4,%rax cmp $0x61a80,%rax jne 405330 <ifNoConditional(benchmark::State&)+0x160>