https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466

            Bug ID: 85466
           Summary: Performance is slow when doing 'branchless'
                    conditional style math operations
           Product: gcc
           Version: 7.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cpphackster at gmail dot com
  Target Milestone: ---

I have been investigating turning if statements into math operations inspired
by a blog article...

http://theorangeduck.com/page/avoiding-shader-conditionals

...and other resources listed here...
https://gist.github.com/unitycoder/4d988bb21b3ce820eaa23028ed6d04bd

There are also many 'branchless' type things on stack overflow (like the
signmum function needed for the branchless operations)

https://stackoverflow.com/questions/1903954/is-there-a-standard-sign-function-signum-sgn-in-c-c


I set up a quickbench benchmark to test if this branchless code is faster on
CPU as well.

http://quick-bench.com/o5lYur5c9rVuOyAn6-fzDf6xTuk

It seems that for a case such as...

if (myVector[n] > 0.5){
    result[n] = 0.8f;
}
else {
    result[n] = 0.1f;
}

...which gets turned into the branchless....

result[n] = lerp(0.1f, 0.8f, when_gt(myVec[n], 0.5f));

...clang runs ~2x faster than the standard if statement (it seems to turn it
into a lot of vectorized code which seems to be many movups)

gcc is very slow compared to even the standard base case.

one suspect part is ~68% of time being spend in one part of the code.

3.00%  mulss  0x8(%rsp),%xmm0
67.88% addss  %xmm3,%xmm0
4.60%  movss  %xmm0,(%rbx,%rdx,1)
2.14%  add    $0x4,%rdx
       cmp    $0x61a80,%rdx
       je     4053a0 <ifNoConditional(benchmark::State&)+0x1d0>
       movss  0x0(%rbp,%rdx,1),%xmm0
0.68%  xor    %eax,%eax
0.45%  subss  %xmm2,%xmm0
1.89%  ucomiss %xmm1,%xmm0
1.47%  seta   %al
1.85%  xor    %ecx,%ecx
       ucomiss %xmm0,%xmm1
       pxor   %xmm0,%xmm0
       seta   %cl
1.17%  sub    %ecx,%eax
0.90%  cvtsi2ss %eax,%xmm0
4.29%  ucomiss %xmm0,%xmm1
2.89%  movaps %xmm4,%xmm0
       jbe    405350 <ifNoConditional(benchmark::State&)+0x180>
3.02%  mulss  0xc(%rsp),%xmm0
3.72%  jmp    405356 <ifNoConditional(benchmark::State&)+0x186>


I'm happy to help out with testing any build or fixes for this. My assembly
knowledge is limited but willing to help out where possible/run benchmarks etc.

Cheers
Dan

Reply via email to