[Bug c++/85466] New: Performance is slow when doing 'branchless' conditional style math operations

2018-04-19 Thread cpphackster at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466

Bug ID: 85466
   Summary: Performance is slow when doing 'branchless'
conditional style math operations
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: cpphackster at gmail dot com
  Target Milestone: ---

I have been investigating turning if statements into math operations inspired
by a blog article...

http://theorangeduck.com/page/avoiding-shader-conditionals

...and other resources listed here...
https://gist.github.com/unitycoder/4d988bb21b3ce820eaa23028ed6d04bd

There are also many 'branchless' type things on stack overflow (like the
signmum function needed for the branchless operations)

https://stackoverflow.com/questions/1903954/is-there-a-standard-sign-function-signum-sgn-in-c-c


I set up a quickbench benchmark to test if this branchless code is faster on
CPU as well.

http://quick-bench.com/o5lYur5c9rVuOyAn6-fzDf6xTuk

It seems that for a case such as...

if (myVector[n] > 0.5){
result[n] = 0.8f;
}
else {
result[n] = 0.1f;
}

...which gets turned into the branchless

result[n] = lerp(0.1f, 0.8f, when_gt(myVec[n], 0.5f));

...clang runs ~2x faster than the standard if statement (it seems to turn it
into a lot of vectorized code which seems to be many movups)

gcc is very slow compared to even the standard base case.

one suspect part is ~68% of time being spend in one part of the code.

3.00%  mulss  0x8(%rsp),%xmm0
67.88% addss  %xmm3,%xmm0
4.60%  movss  %xmm0,(%rbx,%rdx,1)
2.14%  add$0x4,%rdx
   cmp$0x61a80,%rdx
   je 4053a0 
   movss  0x0(%rbp,%rdx,1),%xmm0
0.68%  xor%eax,%eax
0.45%  subss  %xmm2,%xmm0
1.89%  ucomiss %xmm1,%xmm0
1.47%  seta   %al
1.85%  xor%ecx,%ecx
   ucomiss %xmm0,%xmm1
   pxor   %xmm0,%xmm0
   seta   %cl
1.17%  sub%ecx,%eax
0.90%  cvtsi2ss %eax,%xmm0
4.29%  ucomiss %xmm0,%xmm1
2.89%  movaps %xmm4,%xmm0
   jbe405350 
3.02%  mulss  0xc(%rsp),%xmm0
3.72%  jmp405356 


I'm happy to help out with testing any build or fixes for this. My assembly
knowledge is limited but willing to help out where possible/run benchmarks etc.

Cheers
Dan

[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations

2018-04-19 Thread cpphackster at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466

--- Comment #14 from Daniel Elliott  ---
I had a response from chandler carruth on twitter, who informed me that the
benchark was hoisting the computation out of the loop. So thats why clang was
faster. but also he said that the noconditional version was not vectorized.

[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations

2018-04-20 Thread cpphackster at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466

--- Comment #15 from Daniel Elliott  ---
Good catch johnathan on the return type of max. (PS also enjoyed your accu talk
on youtube).

I also have been messing around with the benchmark a bit and have come to the
conclusion that the sign function and the max isn't really necesary. simple
ternary operators seems to do a better job.

However, gcc still seems to be slower in these cases compared to clang.

Ive attached benchmarkv2 which on my ivy bridge 2013 macbook pro gets

GCC:
ifStandard  600741 ns 600594 ns   1056
ifNoConditional 191043 ns 191000 ns   3694

Clang:
ifStandard   88777 ns  88726 ns   7439
ifNoConditional  89818 ns  89777 ns   7910

Interestingly for the gcc case, if I return float from the when_greater_than
function (which is just doing x > y ? 1: 0;  then it matches gcc ifStandard
speed exactly but if I return a float then goes down to the ~191000 ns speed
shown above. But still not as fast as both clang cases.  

Have to say this is a lot of fun and thanks everyone for looking at this!

[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations

2018-04-20 Thread cpphackster at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466

--- Comment #16 from Daniel Elliott  ---
Created attachment 44001
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44001&action=edit
revised benchmark w/different approach

[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations

2018-04-20 Thread cpphackster at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466

--- Comment #17 from Daniel Elliott  ---
my previous comment above meant to say this (change from float to int)

Interestingly for the gcc case, if I return float from the when_greater_than
function (which is just doing x > y ? 1: 0;  then it matches gcc ifStandard
speed exactly but if I return an > int << then goes down to the ~191000
ns speed shown above. But still not as fast as both clang cases.

[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations

2018-04-20 Thread cpphackster at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466

--- Comment #20 from Daniel Elliott  ---
cool. just tried that.

gets gcc down to 

GCC:
---
ifStandard  596892 ns   
ifNoConditional 148075 ns <--- with "result[n] = tab[item > .5f];" trick

Clang:(no change)
ifStandard   88777 ns   
ifNoConditional  89818 ns  

--

still clang is 1.64x faster. had a look at the assembly. My limited
understanding makes me think that the ucomiss is not fully vectorized and the
clang one is (clangs ucomiss %xmm0,%xmm1 vs gcc's ucomiss 0x218b4(%rip),%xmm0).
Feel free to correct me if I am wrong.


clang:

   movss  0x61a80(%r15,%rcx,1),%xmm1
22.95% xor%eax,%eax
   ucomiss %xmm0,%xmm1
13.81% seta   %al
22.55% mov0x4335d0(,%rax,4),%eax
4.31%  mov%eax,0x61a80(%rbx,%rcx,1)
22.03% movss  0x61a84(%rbx,%rcx,1),%xmm1
0.40%  movss  %xmm1,0xc(%rsp)
13.93% add$0x4,%rcx
   jne404b50 


gcc:

14.45% movss  0x0(%r13,%rax,1),%xmm0
0.18%  xor%edx,%edx
21.27% ucomiss 0x218b4(%rip),%xmm0# 426bf4 <_IO_stdin_used+0x34>
16.84% seta   %dl
21.79% movss  0x8(%rsp,%rdx,4),%xmm0
1.41%  movss  %xmm0,(%r12,%rax,1)
23.94% add$0x4,%rax
   cmp$0x61a80,%rax
   jne405330 

[Bug libstdc++/85466] Performance is slow when doing 'branchless' conditional style math operations

2018-04-21 Thread cpphackster at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85466

--- Comment #22 from Daniel Elliott  ---
(In reply to Marc Glisse from comment #21)
> (In reply to Daniel Elliott from comment #20)
> > still clang is 1.64x faster. had a look at the assembly. My limited
> > understanding makes me think that the ucomiss is not fully vectorized and
> > the clang one is (clangs ucomiss %xmm0,%xmm1 vs gcc's ucomiss
> > 0x218b4(%rip),%xmm0). Feel free to correct me if I am wrong.
> 
> Nothing gets vectorized (likely because of the "dontoptimize" code). The
> ucomiss difference is that llvm keeps the constant .5f in a register, while
> gcc reloads it every time. I don't know if the speed difference comes from
> that, or from some subtle tuning arrangement of the operations (I didn't try
> to understand why llvm has 4 mov where gcc has only 2).

Right I thought because it was an xmm0 that means vector register. I'm going to
go and read up on assembly!