[Bug target/35926] Pushing / Poping ebx without using it.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35926 --- Comment #8 from Tony Poppleton 2013-04-10 01:42:20 UTC --- This appears to be fixed in GCC 4.8.0, compiling with just -S and -O3, the asm output is now a much simpler: .file "test.c" .text .p2align 4,,15 .globl add .type add, @function add: .LFB0: .cfi_startproc andq$-2, %rsi leaq(%rdi,%rsi), %rax ret .cfi_endproc .LFE0: .size add, .-add .ident "GCC: (GNU) 4.8.0 20130316 (Red Hat 4.8.0-0.17)" .section.note.GNU-stack,"",@progbits
[Bug rtl-optimization/47477] [4.6/4.7/4.8/4.9 regression] Sub-optimal mov at end of method
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47477 --- Comment #13 from Tony Poppleton 2013-04-10 01:44:18 UTC --- The test case appears to be fixed in GCC 4.8.0, compiling with just -S and -O3, the asm output is now a much simpler: .file "test.c" .text .p2align 4,,15 .globl add .type add, @function add: .LFB0: .cfi_startproc andq$-2, %rsi leaq(%rdi,%rsi), %rax ret .cfi_endproc .LFE0: .size add, .-add .ident "GCC: (GNU) 4.8.0 20130316 (Red Hat 4.8.0-0.17)" .section.note.GNU-stack,"",@progbits I don't know if other test cases would reproduce the original bug however.
[Bug rtl-optimization/47521] Unnecessary usage of edx register
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521 --- Comment #7 from Tony Poppleton 2013-04-10 01:51:36 UTC --- This appears to be fixed with GCC 4.8.0 and flag -O2. The asm code produced is now exactly as Jeff said in comment #3: .file "test.c" .text .p2align 4,,15 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc testb $16, %dil movl$1, %eax cmovne %edi, %eax ret .cfi_endproc .LFE0: .size foo, .-foo .ident "GCC: (GNU) 4.8.0 20130316 (Red Hat 4.8.0-0.17)" .section.note.GNU-stack,"",@progbits
[Bug target/34653] operation performed unnecessarily in 64-bit mode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653 --- Comment #9 from Tony Poppleton 2013-04-10 02:01:27 UTC --- GCC 4.8.0 with -O2 produces something similar to the original, so the regression noted in comment #7 and comment #8 is now resolved. movzbl (%rdi), %eax shrq$4, %rax movqtable(,%rax,8), %rax ret However the original bug from comment #1 is still present.
[Bug target/35926] Pushing / Poping ebx without using it.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35926 --- Comment #5 from Tony Poppleton 2011-01-25 19:33:20 UTC --- I can confirm this still exists on both GCC 4.5.1 and GCC 4.6.0 (20110115), when compiling with -O3. I did some basic investigation into the files produced by the --dump-tree-all flag, which shows: add (struct toto_s * a, struct toto_s * b) { int64_t tmp; int D.2686; struct toto_s * D.2685; long long int D.2684; int D.2683; int b.1; long long int D.2681; int a.0; : a.0_2 = (int) a_1(d); D.2681_3 = (long long int) a.0_2; b.1_5 = (int) b_4(d); D.2683_6 = b.1_5 & -2; D.2684_7 = (long long int) D.2683_6; tmp_8 = D.2681_3 + D.2684_7; D.2686_9 = (int) tmp_8; D.2685_10 = (struct toto_s *) D.2686_9; return D.2685_10; } What I don't understand here, is the excessive casting; The addition in tmp_8 is being done as (long long int), yet both terms are ultimiately derived from (int) variables. Is this casting to (long long int) necessary to deal with an overflow on the addition? If so, then why does the final asm code not appear to be catering for overflow? Alternatively, could the whole block be simplified down to (int) during this phase of the compile, thereby fixing the subsequent unnecessary usage of BX during the RTL phase (as per comment #3)? As an aside (possibly another bug report?), it appears there is a regression in 4.6.0, which requires the use of an additional movl compared to what is in the original bug description (4.5.1 does not suffer from this) .file "PR35926.c" .text .p2align 4,,15 .globl add .type add, @function add: .LFB0: .cfi_startproc pushl %ebx .cfi_def_cfa_offset 8 .cfi_offset 3, -8 movl12(%esp), %eax movl8(%esp), %ecx popl%ebx .cfi_def_cfa_offset 4 .cfi_restore 3 andl$-2, %eax addl%eax, %ecx < order of regs inverted movl%ecx, %eax < resulting in unnecessary movl ret .cfi_endproc .LFE0: .size add, .-add .ident "GCC: (GNU) 4.6.0 20110115 (experimental)" .section.note.GNU-stack,"",@progbits
[Bug rtl-optimization/47477] New: [4.6 regression] Sub-optimal mov at end of method
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47477 Summary: [4.6 regression] Sub-optimal mov at end of method Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: tony.popple...@gmail.com Host: Linux x86-64 Whilst investigating PR35926, I noticed a slight inefficiency in code generated by 4.6.0 (20110115) versus that of 4.5.1. Duplicating the C code here from that PR for easy reference: typedef struct toto_s *toto_t; toto_t add (toto_t a, toto_t b) { int64_t tmp = (int64_t)(intptr_t)a + ((int64_t)(intptr_t)b&~1L); return (toto_t)(intptr_t) tmp; } The ASM generated by 4.6.0 with flags -O3 is: .file "PR35926.c" .text .p2align 4,,15 .globl add .type add, @function add: .LFB0: .cfi_startproc pushl %ebx .cfi_def_cfa_offset 8 .cfi_offset 3, -8 movl12(%esp), %eax movl8(%esp), %ecx popl%ebx .cfi_def_cfa_offset 4 .cfi_restore 3 andl$-2, %eax addl%eax, %ecx < order of regs inverted movl%ecx, %eax < resulting in unnecessary movl ret .cfi_endproc .LFE0: .size add, .-add .ident "GCC: (GNU) 4.6.0 20110115 (experimental)" .section.note.GNU-stack,"",@progbits In 4.5.1, the last bit is one instruction shorter, with just: addl%ecx, %eax ret A bug search revealed a similar sounding PR44249, however that is a regression in 4.5 too apparently, yet this only affects 4.6.
[Bug target/35926] Pushing / Poping ebx without using it.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35926 --- Comment #6 from Tony Poppleton 2011-01-27 16:55:26 UTC --- For the record, the additional movl noticed above in GCC 4.6.0 has been factored out into PR47477
[Bug rtl-optimization/47477] [4.6 regression] Sub-optimal mov at end of method
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47477 --- Comment #5 from Tony Poppleton 2011-01-27 17:58:12 UTC --- The modified testcase in comment #4 also fixes the original bug with redundent push/pop of BX (as described in PR35926), so fixing this during tree optimizations would be good.
[Bug rtl-optimization/46235] inefficient bittest code generation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 --- Comment #2 from Tony Poppleton 2011-01-28 16:55:48 UTC --- Based on Richard's comment, I tried a modified version of the code replacing the (1 << x) with just (16). This shows that GCC (4.6 & 4.5.2) does perform an optimization similar to llvm, and uses the testb instruction: movl%edi, %eax movl$1, %edx testb $16, %al cmove %edx, %eax ret Therefore, perhaps it would be beneficial to not convert from "a & (n << x)" to "(a >> x) & n", in the special case where the value n is 1 (or a power of 2 potentially)? Incidentally, the above code could have been optimized further to remove the usage of edx entirely (will make a separate PR about that)
[Bug rtl-optimization/46235] inefficient bittest code generation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 --- Comment #3 from Tony Poppleton 2011-01-28 17:02:50 UTC --- Actually what I said above isn't correct - had it compiled down to "bt $4, %al" then it would make sense to do that special case, but as it used "testb" it is inconclusive.
[Bug rtl-optimization/47521] New: Unnecessary usage of edx register
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521 Summary: Unnecessary usage of edx register Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: tony.popple...@gmail.com In testing PR46235 I noticed some minor inefficiency in the usage of an extra register. The C code is: int foo(int a, int x, int y) { if (a & (16)) return a; return 1; } Which produces the asm: movl%edi, %eax movl$1, %edx testb $16, %al cmove %edx, %eax ret The above code could have been further optimized to remove the usage of edx: movl$1, %eax test$16, %edi cmove %edi, %eax ret
[Bug rtl-optimization/47521] Unnecessary usage of edx register
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521 Tony Poppleton changed: What|Removed |Added Known to work||4.3.5 Known to fail||4.4.5, 4.5.2, 4.6.0 --- Comment #1 from Tony Poppleton 2011-01-28 17:23:19 UTC --- I probably meant "testb $16, %dil" above... GCC 4.3.5 avoids the usage of edx, although it too probably has 1 instruction too many: testb $16, %dil movl$1, %eax cmove %eax, %edi movl%edi, %eax ret
[Bug rtl-optimization/46235] inefficient bittest code generation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235 --- Comment #4 from Tony Poppleton 2011-01-28 18:08:15 UTC --- As a quick test, I commented out the block with the following comment in fold-const.c: /* If this is an EQ or NE comparison with zero and ARG0 is (1 << foo) & bar, convert it to (bar >> foo) & 1. Both require two operations, but the latter can be done in one less insn on machines that have only two-operand insns or on which a constant cannot be the first operand. */ This produces the following asm code: movl$1, %edx movl%edi, %eax movl%esi, %ecx movl%edx, %edi sall%cl, %edi testl %eax, %edi cmove %edx, %eax ret (using modified GCC 4.6.0 20110122) So whilst I was hoping for an easy quick-fix, it appears that the required optimization to convert it into a "btl" test isn't there later on in the compile. Incidentally, from looking at http://gmplib.org/~tege/x86-timing.pdf, it appears that "bt" is slow on P4 architecture (8 cycles if I am reading it correctly? which sounds slow), so the llvm code in the bug description isn't necessarily an optimization on this arch. Newer chips would probably still benefit though.
[Bug rtl-optimization/47521] [Regression 4.4/4.5/4.6] Unnecessary usage of edx register
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521 Tony Poppleton changed: What|Removed |Added Summary|Unnecessary usage of edx|[Regression 4.4/4.5/4.6] |register|Unnecessary usage of edx ||register --- Comment #2 from Tony Poppleton 2011-01-29 08:41:47 UTC --- Changing bug title to include regression, as 4.3.5 is able to avoid the usage of edx.
[Bug target/35926] Pushing / Poping ebx without using it.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35926 Tony Poppleton changed: What|Removed |Added Last reconfirmed|2008-12-28 06:57:47 |2011-02-01 13:13 Known to fail||4.1.2, 4.3.5, 4.4.5, 4.5.2, ||4.6.0 Severity|normal |enhancement --- Comment #7 from Tony Poppleton 2011-02-01 13:44:28 UTC --- Set known to fail field, last reconfirmed field, and modified to "enhancement"
[Bug tree-optimization/47555] [4.4 Regression] Huge memory usage when optimizing
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47555 --- Comment #6 from Tony Poppleton 2011-02-01 14:45:28 UTC --- Out of interest, could this parameter of 20 be automatically tuned based on the available RAM? For systems with a lot of RAM, it might make sense to increase the parameter above 20 (assuming this produces better code in the end). Whilst they could override this using the flag mentioned in comment #3, auto-detecting this parameter would make it easier for the end user.
[Bug target/34653] operation performed unnecessarily in 64-bit mode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653 Tony Poppleton changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2011.02.01 16:45:31 Ever Confirmed|0 |1 Known to fail||4.3.5, 4.4.5, 4.5.2, 4.6.0 --- Comment #8 from Tony Poppleton 2011-02-01 16:45:31 UTC --- Confirmed that both the example in the description and the example in comment #1 apply to GCC 4.3.5, 4.4.5, 4.5.2 and 4.6.0 (20110129). Also confirmed the regression noted in comment #7, where an extra register is used (ecx), resulting in an additional mov instruction. This regression is present in versions 4.4.5, 4.5.2 and 4.6.0 (20110129). This regression could possibly be related to PR47521, which also first appeared in 4.4.x.
[Bug rtl-optimization/47581] New: [4.6 regression] Unnecessary adjustments to stack pointer
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47581 Summary: [4.6 regression] Unnecessary adjustments to stack pointer Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: tony.popple...@gmail.com Whilst investigating PR4079 (which affects PPC), I found some strange adjustments to the stack pointer when compiling with 4.6.0 (20110129) on x86. For reference, the C code from that PR is: unsigned mulh(unsigned a, unsigned b) { return ((unsigned long long)a * (unsigned long long)b) >> 32; } On 4.5.2 using "-O2 -m32 -fomit-frame-pointer", this produced the following succinct code: mulh: movl8(%esp), %eax mull4(%esp) movl%edx, %eax ret .size mulh, .-mulh .ident "GCC: (GNU) 4.5.2" However on 4.6.0 with the same arguments: mulh: .LFB0: .cfi_startproc subl$4, %esp <== isn't this unnecessary? .cfi_def_cfa_offset 8 movl12(%esp), %eax <== this could just be 8(%esp) mull8(%esp)<== this could just be 4(%esp) addl$4, %esp <== isn't this unnecessary? .cfi_def_cfa_offset 4 movl%edx, %eax ret .cfi_endproc .LFE0: .size mulh, .-mulh .ident "GCC: (GNU) 4.6.0 20110129 (experimental)"
[Bug rtl-optimization/47582] New: Combine chains of movl into movq
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582 Summary: Combine chains of movl into movq Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: tony.popple...@gmail.com The following C code (adapted from http://stackoverflow.com/questions/4544804/in-what-cases-should-i-use-memcpy-over-standard-operators-in-c) shows that adjacent sequences of movl could be combined into movq. extern float a[5]; extern float b[5]; int main() { #if defined(M1) a[0] = b[0]; a[1] = b[1]; a[2] = b[2]; a[3] = b[3]; a[4] = b[4]; #elif defined(M2) memcpy(a, b, 5*sizeof(float)); #endif } When compiled with "-O2 -fomit-frame-pointer" on GCC 4.3.5, 4.4.5, 4.5.2 and 4.6.0 (20110129), the following asm is produced for the -DM1 branch: movlb(%rip), %eax movl%eax, a(%rip) movlb+4(%rip), %eax movl%eax, a+4(%rip) movlb+8(%rip), %eax movl%eax, a+8(%rip) movlb+12(%rip), %eax movl%eax, a+12(%rip) movlb+16(%rip), %eax movl%eax, a+16(%rip) ret However for the -DM2 branch, the memcpy implementation shows that this can be done more efficiently: movqb(%rip), %rax movq%rax, a(%rip) movqb+8(%rip), %rax movq%rax, a+8(%rip) movlb+16(%rip), %eax movl%eax, a+16(%rip) ret I presume that the memcpy is being done in hand-written asm? If so, then once this enhancment is done, then presumably that portion of memcpy code could be converted to C code and be just as efficient.
[Bug rtl-optimization/47521] Unnecessary usage of edx register
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521 --- Comment #5 from Tony Poppleton 2011-02-03 14:16:01 UTC --- As a quick test, would this be fixed by re-ordering the register file to move eax above edx? If so, then another possible fix to this would be to effectively re-run the RA pass multiple times, each time using a different register file, and then select the one that produces the "best" code and discard the other RA attempts. The register files would only differ in their sort order when register costs are equal. I am guessing that only a few such register files would be needed (in particular ones where the eax is shuffled around), rather than every single possible combination of sort orders (which would be prohibitive), so this doesn't necessarily have to impact the length of compilation by much. A metric would also be needed to be able to then select the "best" version of the compiled code - possibly using number of instructions and number of registers used?
[Bug rtl-optimization/47582] Combine chains of movl into movq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582 Tony Poppleton changed: What|Removed |Added Last reconfirmed||2017-7-22 Known to fail||5.1.1, 6.3.1, 7.1.1 --- Comment #4 from Tony Poppleton --- Retesting this with gcc 7.1.1 20170622 (Red Hat 7.1.1-3) shows that the mov's are still not being combined, even though dependency #23684 is marked as fixed: The DM1 branch produces: .LFB0: .cfi_startproc movss b(%rip), %xmm0 xorl%eax, %eax movss %xmm0, a(%rip) movss b+4(%rip), %xmm0 movss %xmm0, a+4(%rip) movss b+8(%rip), %xmm0 movss %xmm0, a+8(%rip) movss b+12(%rip), %xmm0 movss %xmm0, a+12(%rip) movss b+16(%rip), %xmm0 movss %xmm0, a+16(%rip) ret .cfi_endproc Whilst the DM2 branch produces an even better result than previous GCC versions I have tested on: .LFB0: .cfi_startproc movlb+16(%rip), %eax movdqu b(%rip), %xmm0 movl%eax, a+16(%rip) xorl%eax, %eax movups %xmm0, a(%rip) ret
[Bug rtl-optimization/47582] Combine chains of movl into movq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582 Tony changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Last reconfirmed|2017-07-22 00:00:00 | Known to work||7.1.1 Resolution|--- |FIXED Known to fail|7.1.1 | --- Comment #6 from Tony --- Excellent, many thanks
[Bug rtl-optimization/47582] Combine chains of movl into movq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582 --- Comment #2 from Tony Poppleton --- Re-testing this with GCC 5.1, the code appears to be even less efficient, for both cases; DM1: .LFB0: .cfi_startproc movss b(%rip), %xmm0 xorl%eax, %eax movss %xmm0, a(%rip) movss b+4(%rip), %xmm0 movss %xmm0, a+4(%rip) movss b+8(%rip), %xmm0 movss %xmm0, a+8(%rip) movss b+12(%rip), %xmm0 movss %xmm0, a+12(%rip) movss b+16(%rip), %xmm0 movss %xmm0, a+16(%rip) ret .cfi_endproc .LFB0: .cfi_startproc movqb(%rip), %rax movq%rax, a(%rip) movqb+8(%rip), %rax movq%rax, a+8(%rip) movlb+16(%rip), %eax movl%eax, a+16(%rip) xorl%eax, %eax ret .cfi_endproc Why is the "xorl" appearing in both cases? Should this be logged as a separate bug. Incidentally, compiling with -O1 produces the same code as -O2 on older GCCs (as in the description comment above) My total guess is it is due to a and b not having any initial values, and an optimization that takes into account value ranges is getting confused?
[Bug rtl-optimization/47582] Combine chains of movl into movq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582 --- Comment #3 from Tony Poppleton --- Ignore the last comment - hadn't spotted the "int" return value on main... So the code is actually more correct than previous versions, and no change to the status of this bug.