[Bug middle-end/56309] New: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 Bug #: 56309 Summary: -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code Classification: Unclassified Product: gcc Version: 4.7.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: arturo...@gmail.com Created attachment 29442 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29442 Self contained source file with parameter x passed by value (slow) This bug report reflects the analysis of a question asked in stackoverflow http://stackoverflow.com/questions/14805641/why-does-changing-const-ull-to-const-ull-in-function-parameter-result-in-pe/14819939#14819939 When an unsigned long long parameter to a function is passed by reference instead of by value the result is a dramatic almost 2x improvement in speed when compiled with -O3. Given that the function is inlined this is unexpected. Upon closer inspection it was found that the code generated is quite different, as if passing the parameter by value enables an optimization (use of x86 conditional moves) that backfires, possibly by suffering an unexpected stall in the processor. Two files are attached by-val-O3.ii by-ref-O3.ii They differ only in the way the unsigned long long parameter "x" is passed. ./by-ref-O3 Took 11.85 seconds total. ./by-ref-O3 Took 6.67 seconds total.
[Bug middle-end/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #1 from arturomdn at gmail dot com 2013-02-13 20:22:51 UTC --- Created attachment 29443 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29443 Self contained source file with parameter x passed by reference (fast) This source file differs from by-val-O3.ii only in the way parameter "x" is passed.
[Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #3 from arturomdn at gmail dot com 2013-02-13 20:29:12 UTC --- Intel Xeon X5570 @ 2.93GHz (In reply to comment #2) > Which target is this on? On some (most non x86 targets) conditional moves are > faster than compare and branch.
[Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #8 from arturomdn at gmail dot com 2013-02-14 15:53:15 UTC --- It is possible (just a guess) that the extra compare is causing an interlock in the processor since the first cmov is issued speculatively and the condition won't be confirmed until the first compare has executed. Someone from Intel could tell us exactly why the original sequence is so disastrous and suggest an alternative that still uses cmov and is better than jmp. I wonder if instead of emitting this sequence shr$0x20,%rdi and$0x,%ecx cmp%r8,%rdx cmovbe %r11,%rdi add$0x1,%rax cmp%r8,%rdx cmovbe %rdx,%rcx it would do this instead shr$0x20,%rdi and$0x,%ecx add$0x1,%rax cmp%r8,%rdx cmovbe %r11,%rdi cmovbe %rdx,%rcx
[Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #9 from arturomdn at gmail dot com 2013-02-14 16:00:49 UTC --- I found in the Intel optimization guide an example of this idiom of comparing once and issuing two cmov back-to-back... so the problem isn't the two cmov, but possibly introducing the 2nd compare that splits them. not_equal: movzx eax, BYTE PTR[esi+edx] movzx edx, BYTE PTR[edi+edx] cmp eax, edx cmova eax, ONE cmovb eax, NEG_ONE jmp ret_tag Taken from example 10-16 of http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
[Bug target/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #10 from arturomdn at gmail dot com 2013-02-14 16:43:23 UTC --- Might be worth mentioning here what I said in the stackoverflow answer, that in this particular case the entire conditional branch can be avoided because it is redundant. This code if (tmp >= imax) { carry = tmp >> numbits;// < A tmp &= imax - 1;// < B } else { carry = 0;// < C } can be reduced to carry = tmp >> numbits; tmp &= imax - 1; Proof: 1) numbits is 32 2) imax is 1ULL << 32 so lower 32 bits are zero 3) imax - 1 is 0x (see 2) 4) if tmp >= imax then tmp has bits set in upper 32 bits 5) otherwise if tmp < imax then tmp does not have bits set in upper 32 bits 6) statement "A" is equivalent to "C" when tmp < imax (because of 4) 7) statement "B" is a NOP when tmp < imax (because of 3 and 5)
[Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #14 from arturomdn at gmail dot com 2013-02-14 17:30:54 UTC --- I also did the experiment, with the same results... it got faster but not as fast as the version with conditional branch instead of conditional moves: ./by-ref-O3 Took 6.65 seconds total. ./by-val-O3 Took 11.64 seconds total. ./by-val-fixed-O3 Took 9.94 seconds total. --- by-val-O3.s 2013-02-14 11:27:28.109856000 -0600 +++ by-val-fixed-O3.s 2013-02-14 11:11:07.312317000 -0600 @@ -679,13 +679,12 @@ shrq$32, %rdi .loc 4 25 0 andl$4294967295, %ecx + addq$1, %rax cmpq%r8, %rdx cmovbe %r11, %rdi .LVL43: .loc 4 29 0 - addq$1, %rax .LVL44: - cmpq%r8, %rdx cmovbe %rdx, %rcx .LVL45: .loc 4 21 0
[Bug tree-optimization/56309] -O3 optimizer generates conditional moves instead of compare and branch resulting in almost 2x slower code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #16 from arturomdn at gmail dot com 2013-02-14 17:42:55 UTC --- With -ftree-vectorize -fno-tree-loop-if-convert flags it generated this for the loop in question: .L39: movq%rdi, %rdx addq(%rsi,%rax,8), %rcx imulq (%r9,%rax,8), %rdx addq%rcx, %rdx xorl%ecx, %ecx cmpq%r10, %rdx jbe .L38 movq%rdx, %rcx andl$4294967295, %edx shrq$32, %rcx .L38: addq$1, %rax cmpq%r8, %rax movq%rdx, -8(%rsi,%rax,8) jne .L39 And it executed fast: ./by-val-O3-flags Took 6.74 seconds total.
[Bug c++/60702] thread_local initialization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60702 arturomdn at gmail dot com changed: What|Removed |Added CC||arturomdn at gmail dot com --- Comment #4 from arturomdn at gmail dot com --- Created attachment 32528 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32528&action=edit Smaller testcase that reproduces the problem clang had the same problem, this smaller test was submitted to the clang team and they identified it as a duplicate of a recently fixed bug: http://llvm.org/bugs/show_bug.cgi?id=19254 Which was fixed as follows http://llvm.org/viewvc/llvm-project?view=revision&revision=204869 With the following comment: PR19254: If a thread_local data member of a class is accessed via member access syntax, don't forget to run its initializer.