[Bug c++/85466] New: Performance is slow when doing 'branchless' conditional style math operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466 Bug ID: 85466 Summary: Performance is slow when doing 'branchless' conditional style math operations Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: cpphackster at gmail dot com Target Milestone: --- I have been investigating turning if statements into math operations inspired by a blog article... http://theorangeduck.com/page/avoiding-shader-conditionals ...and other resources listed here... https://gist.github.com/unitycoder/4d988bb21b3ce820eaa23028ed6d04bd There are also many 'branchless' type things on stack overflow (like the signmum function needed for the branchless operations) https://stackoverflow.com/questions/1903954/is-there-a-standard-sign-function-signum-sgn-in-c-c I set up a quickbench benchmark to test if this branchless code is faster on CPU as well. http://quick-bench.com/o5lYur5c9rVuOyAn6-fzDf6xTuk It seems that for a case such as... if (myVector[n] > 0.5){ result[n] = 0.8f; } else { result[n] = 0.1f; } ...which gets turned into the branchless result[n] = lerp(0.1f, 0.8f, when_gt(myVec[n], 0.5f)); ...clang runs ~2x faster than the standard if statement (it seems to turn it into a lot of vectorized code which seems to be many movups) gcc is very slow compared to even the standard base case. one suspect part is ~68% of time being spend in one part of the code. 3.00% mulss 0x8(%rsp),%xmm0 67.88% addss %xmm3,%xmm0 4.60% movss %xmm0,(%rbx,%rdx,1) 2.14% add$0x4,%rdx cmp$0x61a80,%rdx je 4053a0 movss 0x0(%rbp,%rdx,1),%xmm0 0.68% xor%eax,%eax 0.45% subss %xmm2,%xmm0 1.89% ucomiss %xmm1,%xmm0 1.47% seta %al 1.85% xor%ecx,%ecx ucomiss %xmm0,%xmm1 pxor %xmm0,%xmm0 seta %cl 1.17% sub%ecx,%eax 0.90% cvtsi2ss %eax,%xmm0 4.29% ucomiss %xmm0,%xmm1 2.89% movaps %xmm4,%xmm0 jbe405350 3.02% mulss 0xc(%rsp),%xmm0 3.72% jmp405356 I'm happy to help out with testing any build or fixes for this. My assembly knowledge is limited but willing to help out where possible/run benchmarks etc. Cheers Dan
[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466 --- Comment #14 from Daniel Elliott --- I had a response from chandler carruth on twitter, who informed me that the benchark was hoisting the computation out of the loop. So thats why clang was faster. but also he said that the noconditional version was not vectorized.
[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466 --- Comment #15 from Daniel Elliott --- Good catch johnathan on the return type of max. (PS also enjoyed your accu talk on youtube). I also have been messing around with the benchmark a bit and have come to the conclusion that the sign function and the max isn't really necesary. simple ternary operators seems to do a better job. However, gcc still seems to be slower in these cases compared to clang. Ive attached benchmarkv2 which on my ivy bridge 2013 macbook pro gets GCC: ifStandard 600741 ns 600594 ns 1056 ifNoConditional 191043 ns 191000 ns 3694 Clang: ifStandard 88777 ns 88726 ns 7439 ifNoConditional 89818 ns 89777 ns 7910 Interestingly for the gcc case, if I return float from the when_greater_than function (which is just doing x > y ? 1: 0; then it matches gcc ifStandard speed exactly but if I return a float then goes down to the ~191000 ns speed shown above. But still not as fast as both clang cases. Have to say this is a lot of fun and thanks everyone for looking at this!
[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466 --- Comment #16 from Daniel Elliott --- Created attachment 44001 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44001&action=edit revised benchmark w/different approach
[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466 --- Comment #17 from Daniel Elliott --- my previous comment above meant to say this (change from float to int) Interestingly for the gcc case, if I return float from the when_greater_than function (which is just doing x > y ? 1: 0; then it matches gcc ifStandard speed exactly but if I return an > int << then goes down to the ~191000 ns speed shown above. But still not as fast as both clang cases.
[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466 --- Comment #20 from Daniel Elliott --- cool. just tried that. gets gcc down to GCC: --- ifStandard 596892 ns ifNoConditional 148075 ns <--- with "result[n] = tab[item > .5f];" trick Clang:(no change) ifStandard 88777 ns ifNoConditional 89818 ns -- still clang is 1.64x faster. had a look at the assembly. My limited understanding makes me think that the ucomiss is not fully vectorized and the clang one is (clangs ucomiss %xmm0,%xmm1 vs gcc's ucomiss 0x218b4(%rip),%xmm0). Feel free to correct me if I am wrong. clang: movss 0x61a80(%r15,%rcx,1),%xmm1 22.95% xor%eax,%eax ucomiss %xmm0,%xmm1 13.81% seta %al 22.55% mov0x4335d0(,%rax,4),%eax 4.31% mov%eax,0x61a80(%rbx,%rcx,1) 22.03% movss 0x61a84(%rbx,%rcx,1),%xmm1 0.40% movss %xmm1,0xc(%rsp) 13.93% add$0x4,%rcx jne404b50 gcc: 14.45% movss 0x0(%r13,%rax,1),%xmm0 0.18% xor%edx,%edx 21.27% ucomiss 0x218b4(%rip),%xmm0# 426bf4 <_IO_stdin_used+0x34> 16.84% seta %dl 21.79% movss 0x8(%rsp,%rdx,4),%xmm0 1.41% movss %xmm0,(%r12,%rax,1) 23.94% add$0x4,%rax cmp$0x61a80,%rax jne405330
[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466 --- Comment #22 from Daniel Elliott --- (In reply to Marc Glisse from comment #21) > (In reply to Daniel Elliott from comment #20) > > still clang is 1.64x faster. had a look at the assembly. My limited > > understanding makes me think that the ucomiss is not fully vectorized and > > the clang one is (clangs ucomiss %xmm0,%xmm1 vs gcc's ucomiss > > 0x218b4(%rip),%xmm0). Feel free to correct me if I am wrong. > > Nothing gets vectorized (likely because of the "dontoptimize" code). The > ucomiss difference is that llvm keeps the constant .5f in a register, while > gcc reloads it every time. I don't know if the speed difference comes from > that, or from some subtle tuning arrangement of the operations (I didn't try > to understand why llvm has 4 mov where gcc has only 2). Right I thought because it was an xmm0 that means vector register. I'm going to go and read up on assembly!