https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466
--- Comment #20 from Daniel Elliott <cpphackster at gmail dot com> ---
cool. just tried that.
gets gcc down to
GCC:
-------------------------------------------------------
ifStandard 596892 ns
ifNoConditional 148075 ns <--- with "result[n] = tab[item > .5f];" trick
Clang:(no change)
ifStandard 88777 ns
ifNoConditional 89818 ns
------------------------------------------------------
still clang is 1.64x faster. had a look at the assembly. My limited
understanding makes me think that the ucomiss is not fully vectorized and the
clang one is (clangs ucomiss %xmm0,%xmm1 vs gcc's ucomiss 0x218b4(%rip),%xmm0).
Feel free to correct me if I am wrong.
clang:
movss 0x61a80(%r15,%rcx,1),%xmm1
22.95% xor %eax,%eax
ucomiss %xmm0,%xmm1
13.81% seta %al
22.55% mov 0x4335d0(,%rax,4),%eax
4.31% mov %eax,0x61a80(%rbx,%rcx,1)
22.03% movss 0x61a84(%rbx,%rcx,1),%xmm1
0.40% movss %xmm1,0xc(%rsp)
13.93% add $0x4,%rcx
jne 404b50 <ifNoConditional(benchmark::State&)+0x180>
gcc:
14.45% movss 0x0(%r13,%rax,1),%xmm0
0.18% xor %edx,%edx
21.27% ucomiss 0x218b4(%rip),%xmm0 # 426bf4 <_IO_stdin_used+0x34>
16.84% seta %dl
21.79% movss 0x8(%rsp,%rdx,4),%xmm0
1.41% movss %xmm0,(%r12,%rax,1)
23.94% add $0x4,%rax
cmp $0x61a80,%rax
jne 405330 <ifNoConditional(benchmark::State&)+0x160>