https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466
Bug ID: 85466
Summary: Performance is slow when doing 'branchless'
conditional style math operations
Product: gcc
Version: 7.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: cpphackster at gmail dot com
Target Milestone: ---
I have been investigating turning if statements into math operations inspired
by a blog article...
http://theorangeduck.com/page/avoiding-shader-conditionals
...and other resources listed here...
https://gist.github.com/unitycoder/4d988bb21b3ce820eaa23028ed6d04bd
There are also many 'branchless' type things on stack overflow (like the
signmum function needed for the branchless operations)
https://stackoverflow.com/questions/1903954/is-there-a-standard-sign-function-signum-sgn-in-c-c
I set up a quickbench benchmark to test if this branchless code is faster on
CPU as well.
http://quick-bench.com/o5lYur5c9rVuOyAn6-fzDf6xTuk
It seems that for a case such as...
if (myVector[n] > 0.5){
result[n] = 0.8f;
}
else {
result[n] = 0.1f;
}
...which gets turned into the branchless....
result[n] = lerp(0.1f, 0.8f, when_gt(myVec[n], 0.5f));
...clang runs ~2x faster than the standard if statement (it seems to turn it
into a lot of vectorized code which seems to be many movups)
gcc is very slow compared to even the standard base case.
one suspect part is ~68% of time being spend in one part of the code.
3.00% mulss 0x8(%rsp),%xmm0
67.88% addss %xmm3,%xmm0
4.60% movss %xmm0,(%rbx,%rdx,1)
2.14% add $0x4,%rdx
cmp $0x61a80,%rdx
je 4053a0 <ifNoConditional(benchmark::State&)+0x1d0>
movss 0x0(%rbp,%rdx,1),%xmm0
0.68% xor %eax,%eax
0.45% subss %xmm2,%xmm0
1.89% ucomiss %xmm1,%xmm0
1.47% seta %al
1.85% xor %ecx,%ecx
ucomiss %xmm0,%xmm1
pxor %xmm0,%xmm0
seta %cl
1.17% sub %ecx,%eax
0.90% cvtsi2ss %eax,%xmm0
4.29% ucomiss %xmm0,%xmm1
2.89% movaps %xmm4,%xmm0
jbe 405350 <ifNoConditional(benchmark::State&)+0x180>
3.02% mulss 0xc(%rsp),%xmm0
3.72% jmp 405356 <ifNoConditional(benchmark::State&)+0x186>
I'm happy to help out with testing any build or fixes for this. My assembly
knowledge is limited but willing to help out where possible/run benchmarks etc.
Cheers
Dan