[Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code

2013-02-13 Thread arturomdn at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309



 Bug #: 56309

   Summary: -O3 optimizer generates conditional moves instead of

compare and branch resulting in almost 2x slower code

Classification: Unclassified

   Product: gcc

   Version: 4.7.2

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: middle-end

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: arturo...@gmail.com





Created attachment 29442

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29442

Self contained source file with parameter x passed by value (slow)



This bug report reflects the analysis of a question asked in stackoverflow



http://stackoverflow.com/questions/14805641/why-does-changing-const-ull-to-const-ull-in-function-parameter-result-in-pe/14819939#14819939



When an unsigned long long parameter to a function is passed by reference

instead of by value the result is a dramatic almost 2x improvement in speed

when compiled with -O3.  Given that the function is inlined this is unexpected.

 Upon closer inspection it was found that the code generated is quite

different, as if passing the parameter by value enables an optimization (use of

x86 conditional moves) that backfires, possibly by suffering an unexpected

stall in the processor.



Two files are attached



by-val-O3.ii

by-ref-O3.ii



They differ only in the way the unsigned long long parameter "x" is passed.



./by-ref-O3

Took 11.85 seconds total.



./by-ref-O3

Took 6.67 seconds total.


[Bug middle-end/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code

2013-02-13 Thread arturomdn at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309



--- Comment #1 from arturomdn at gmail dot com 2013-02-13 20:22:51 UTC ---

Created attachment 29443

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29443

Self contained source file with parameter x passed by reference (fast)



This source file differs from by-val-O3.ii only in the way parameter "x" is

passed.


[Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code

2013-02-13 Thread arturomdn at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309



--- Comment #3 from arturomdn at gmail dot com 2013-02-13 20:29:12 UTC ---

Intel Xeon X5570 @ 2.93GHz



(In reply to comment #2)

> Which target is this on?  On some (most non x86 targets) conditional moves are

> faster than compare and branch.


[Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code

2013-02-14 Thread arturomdn at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309



--- Comment #8 from arturomdn at gmail dot com 2013-02-14 15:53:15 UTC ---

It is possible (just a guess) that the extra compare is causing an interlock in

the processor since the first cmov is issued speculatively and the condition

won't be confirmed until the first compare has executed.  Someone from Intel

could tell us exactly why the original sequence is so disastrous and suggest an

alternative that still uses cmov and is better than jmp.



I wonder if instead of emitting this sequence



   shr$0x20,%rdi

   and$0x,%ecx  

   cmp%r8,%rdx  

   cmovbe %r11,%rdi 

   add$0x1,%rax 

   cmp%r8,%rdx  

   cmovbe %rdx,%rcx 



it would do this instead



   shr$0x20,%rdi

   and$0x,%ecx  

   add$0x1,%rax 

   cmp%r8,%rdx  

   cmovbe %r11,%rdi 

   cmovbe %rdx,%rcx


[Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code

2013-02-14 Thread arturomdn at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309



--- Comment #9 from arturomdn at gmail dot com 2013-02-14 16:00:49 UTC ---

I found in the Intel optimization guide an example of this idiom of comparing

once and issuing two cmov back-to-back... so the problem isn't the two cmov,

but possibly introducing the 2nd compare that splits them.



not_equal:

   movzx eax, BYTE PTR[esi+edx]

   movzx edx, BYTE PTR[edi+edx]

   cmp eax, edx

   cmova eax, ONE

   cmovb eax, NEG_ONE

   jmp ret_tag



Taken from example 10-16 of

http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf


[Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code

2013-02-14 Thread arturomdn at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309



--- Comment #10 from arturomdn at gmail dot com 2013-02-14 16:43:23 UTC ---

Might be worth mentioning here what I said in the stackoverflow answer, that in

this particular case the entire conditional branch can be avoided because it is

redundant.



This code



if (tmp >= imax) {

carry = tmp >> numbits;// < A

tmp &= imax - 1;// < B

} else {

carry = 0;// < C

}



can be reduced to



carry = tmp >> numbits;

tmp &= imax - 1;



Proof:

1) numbits is 32

2) imax is 1ULL << 32 so lower 32 bits are zero

3) imax - 1 is 0x (see 2)

4) if tmp >= imax then tmp has bits set in upper 32 bits

5) otherwise if tmp < imax then tmp does not have bits set in upper 32 bits

6) statement "A" is equivalent to "C" when tmp < imax (because of 4)

7) statement "B" is a NOP when tmp < imax (because of 3 and 5)


[Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code

2013-02-14 Thread arturomdn at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309



--- Comment #14 from arturomdn at gmail dot com 2013-02-14 17:30:54 UTC ---

I also did the experiment, with the same results... it got faster but not as

fast as the version with

conditional branch instead of conditional moves:



./by-ref-O3

Took 6.65 seconds total.



./by-val-O3 

Took 11.64 seconds total.



./by-val-fixed-O3 

Took 9.94 seconds total.



--- by-val-O3.s 2013-02-14 11:27:28.109856000 -0600

+++ by-val-fixed-O3.s   2013-02-14 11:11:07.312317000 -0600

@@ -679,13 +679,12 @@

shrq$32, %rdi

.loc 4 25 0

andl$4294967295, %ecx

+   addq$1, %rax

cmpq%r8, %rdx

cmovbe  %r11, %rdi

 .LVL43:

.loc 4 29 0

-   addq$1, %rax

 .LVL44:

-   cmpq%r8, %rdx

cmovbe  %rdx, %rcx

 .LVL45:

.loc 4 21 0


[Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code

2013-02-14 Thread arturomdn at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309



--- Comment #16 from arturomdn at gmail dot com 2013-02-14 17:42:55 UTC ---

With -ftree-vectorize -fno-tree-loop-if-convert flags it generated this for the

loop in question:



.L39:

movq%rdi, %rdx

addq(%rsi,%rax,8), %rcx

imulq   (%r9,%rax,8), %rdx

addq%rcx, %rdx

xorl%ecx, %ecx

cmpq%r10, %rdx

jbe .L38

movq%rdx, %rcx

andl$4294967295, %edx

shrq$32, %rcx

.L38:

addq$1, %rax

cmpq%r8, %rax

movq%rdx, -8(%rsi,%rax,8)

jne .L39



And it executed fast:



./by-val-O3-flags

Took 6.74 seconds total.


[Bug c++/60702] thread_local initialization

2014-04-02 Thread arturomdn at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60702

arturomdn at gmail dot com changed:

   What|Removed |Added

 CC||arturomdn at gmail dot com

--- Comment #4 from arturomdn at gmail dot com ---
Created attachment 32528
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32528&action=edit
Smaller testcase that reproduces the problem

clang had the same problem, this smaller test was submitted to the clang team
and they identified it as a duplicate of a recently fixed bug:

http://llvm.org/bugs/show_bug.cgi?id=19254

Which was fixed as follows

http://llvm.org/viewvc/llvm-project?view=revision&revision=204869

With the following comment:

PR19254: If a thread_local data member of a class is accessed via member access
syntax, don't forget to run its initializer.