https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466
Bug ID: 85466 Summary: Performance is slow when doing 'branchless' conditional style math operations Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: cpphackster at gmail dot com Target Milestone: --- I have been investigating turning if statements into math operations inspired by a blog article... http://theorangeduck.com/page/avoiding-shader-conditionals ...and other resources listed here... https://gist.github.com/unitycoder/4d988bb21b3ce820eaa23028ed6d04bd There are also many 'branchless' type things on stack overflow (like the signmum function needed for the branchless operations) https://stackoverflow.com/questions/1903954/is-there-a-standard-sign-function-signum-sgn-in-c-c I set up a quickbench benchmark to test if this branchless code is faster on CPU as well. http://quick-bench.com/o5lYur5c9rVuOyAn6-fzDf6xTuk It seems that for a case such as... if (myVector[n] > 0.5){ result[n] = 0.8f; } else { result[n] = 0.1f; } ...which gets turned into the branchless.... result[n] = lerp(0.1f, 0.8f, when_gt(myVec[n], 0.5f)); ...clang runs ~2x faster than the standard if statement (it seems to turn it into a lot of vectorized code which seems to be many movups) gcc is very slow compared to even the standard base case. one suspect part is ~68% of time being spend in one part of the code. 3.00% mulss 0x8(%rsp),%xmm0 67.88% addss %xmm3,%xmm0 4.60% movss %xmm0,(%rbx,%rdx,1) 2.14% add $0x4,%rdx cmp $0x61a80,%rdx je 4053a0 <ifNoConditional(benchmark::State&)+0x1d0> movss 0x0(%rbp,%rdx,1),%xmm0 0.68% xor %eax,%eax 0.45% subss %xmm2,%xmm0 1.89% ucomiss %xmm1,%xmm0 1.47% seta %al 1.85% xor %ecx,%ecx ucomiss %xmm0,%xmm1 pxor %xmm0,%xmm0 seta %cl 1.17% sub %ecx,%eax 0.90% cvtsi2ss %eax,%xmm0 4.29% ucomiss %xmm0,%xmm1 2.89% movaps %xmm4,%xmm0 jbe 405350 <ifNoConditional(benchmark::State&)+0x180> 3.02% mulss 0xc(%rsp),%xmm0 3.72% jmp 405356 <ifNoConditional(benchmark::State&)+0x186> I'm happy to help out with testing any build or fixes for this. My assembly knowledge is limited but willing to help out where possible/run benchmarks etc. Cheers Dan