[Bug target/35926] Pushing / Poping ebx without using it.

2013-04-09 Thread tony.poppleton at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35926



--- Comment #8 from Tony Poppleton  2013-04-10 
01:42:20 UTC ---

This appears to be fixed in GCC 4.8.0, compiling with just -S and -O3, the asm

output is now a much simpler:



.file   "test.c"

.text

.p2align 4,,15

.globl  add

.type   add, @function

add:

.LFB0:

.cfi_startproc

andq$-2, %rsi

leaq(%rdi,%rsi), %rax

ret

.cfi_endproc

.LFE0:

.size   add, .-add

.ident  "GCC: (GNU) 4.8.0 20130316 (Red Hat 4.8.0-0.17)"

.section.note.GNU-stack,"",@progbits


[Bug rtl-optimization/47477] [4.6/4.7/4.8/4.9 regression] Sub-optimal mov at end of method

2013-04-09 Thread tony.poppleton at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47477



--- Comment #13 from Tony Poppleton  
2013-04-10 01:44:18 UTC ---

The test case appears to be fixed in GCC 4.8.0, compiling with just -S and -O3,

the asm

output is now a much simpler:



.file   "test.c"

.text

.p2align 4,,15

.globl  add

.type   add, @function

add:

.LFB0:

.cfi_startproc

andq$-2, %rsi

leaq(%rdi,%rsi), %rax

ret

.cfi_endproc

.LFE0:

.size   add, .-add

.ident  "GCC: (GNU) 4.8.0 20130316 (Red Hat 4.8.0-0.17)"

.section.note.GNU-stack,"",@progbits



I don't know if other test cases would reproduce the original bug however.


[Bug rtl-optimization/47521] Unnecessary usage of edx register

2013-04-09 Thread tony.poppleton at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521



--- Comment #7 from Tony Poppleton  2013-04-10 
01:51:36 UTC ---

This appears to be fixed with GCC 4.8.0 and flag -O2.  The asm code produced is

now exactly as Jeff said in comment #3:



.file   "test.c"

.text

.p2align 4,,15

.globl  foo

.type   foo, @function

foo:

.LFB0:

.cfi_startproc

testb   $16, %dil

movl$1, %eax

cmovne  %edi, %eax

ret

.cfi_endproc

.LFE0:

.size   foo, .-foo

.ident  "GCC: (GNU) 4.8.0 20130316 (Red Hat 4.8.0-0.17)"

.section.note.GNU-stack,"",@progbits


[Bug target/34653] operation performed unnecessarily in 64-bit mode

2013-04-09 Thread tony.poppleton at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653



--- Comment #9 from Tony Poppleton  2013-04-10 
02:01:27 UTC ---

GCC 4.8.0 with -O2 produces something similar to the original, so the

regression noted in comment #7 and comment #8 is now resolved.



movzbl  (%rdi), %eax

shrq$4, %rax

movqtable(,%rax,8), %rax

ret



However the original bug from comment #1 is still present.


[Bug target/35926] Pushing / Poping ebx without using it.

2011-01-25 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35926

--- Comment #5 from Tony Poppleton  2011-01-25 
19:33:20 UTC ---
I can confirm this still exists on both GCC 4.5.1 and GCC 4.6.0 (20110115),
when compiling with -O3.

I did some basic investigation into the files produced by the --dump-tree-all
flag, which shows:

add (struct toto_s * a, struct toto_s * b)
{
  int64_t tmp;
  int D.2686;
  struct toto_s * D.2685;
  long long int D.2684;
  int D.2683;
  int b.1;
  long long int D.2681;
  int a.0;

:
  a.0_2 = (int) a_1(d);
  D.2681_3 = (long long int) a.0_2;
  b.1_5 = (int) b_4(d);
  D.2683_6 = b.1_5 & -2;
  D.2684_7 = (long long int) D.2683_6;
  tmp_8 = D.2681_3 + D.2684_7;
  D.2686_9 = (int) tmp_8;
  D.2685_10 = (struct toto_s *) D.2686_9;
  return D.2685_10;
}

What I don't understand here, is the excessive casting;

The addition in tmp_8 is being done as (long long int), yet both terms are
ultimiately derived from (int) variables.  Is this casting to (long long int)
necessary to deal with an overflow on the addition?

If so, then why does the final asm code not appear to be catering for overflow?

Alternatively, could the whole block be simplified down to (int) during this
phase of the compile, thereby fixing the subsequent unnecessary usage of BX
during the RTL phase (as per comment #3)?

As an aside (possibly another bug report?), it appears there is a regression in
4.6.0, which requires the use of an additional movl compared to what is in the
original bug description (4.5.1 does not suffer from this)

 .file   "PR35926.c"
.text
.p2align 4,,15
.globl  add
.type   add, @function
add:
.LFB0:
.cfi_startproc
pushl   %ebx
.cfi_def_cfa_offset 8
.cfi_offset 3, -8
movl12(%esp), %eax
movl8(%esp), %ecx
popl%ebx
.cfi_def_cfa_offset 4
.cfi_restore 3
andl$-2, %eax
addl%eax, %ecx   < order of regs inverted
movl%ecx, %eax   < resulting in unnecessary movl
ret
.cfi_endproc
.LFE0:
.size   add, .-add
.ident  "GCC: (GNU) 4.6.0 20110115 (experimental)"
.section.note.GNU-stack,"",@progbits


[Bug rtl-optimization/47477] New: [4.6 regression] Sub-optimal mov at end of method

2011-01-26 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47477

   Summary: [4.6 regression] Sub-optimal mov at end of method
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: tony.popple...@gmail.com
  Host: Linux x86-64


Whilst investigating PR35926, I noticed a slight inefficiency in code generated
by 4.6.0 (20110115) versus that of 4.5.1.

Duplicating the C code here from that PR for easy reference:

typedef struct toto_s *toto_t;
toto_t add (toto_t a, toto_t b) {
  int64_t tmp = (int64_t)(intptr_t)a + ((int64_t)(intptr_t)b&~1L);
  return (toto_t)(intptr_t) tmp;
}

The ASM generated by 4.6.0 with flags -O3 is:

.file   "PR35926.c"
.text
.p2align 4,,15
.globl  add
.type   add, @function
add:
.LFB0:
.cfi_startproc
pushl   %ebx
.cfi_def_cfa_offset 8
.cfi_offset 3, -8
movl12(%esp), %eax
movl8(%esp), %ecx
popl%ebx
.cfi_def_cfa_offset 4
.cfi_restore 3
andl$-2, %eax
addl%eax, %ecx   < order of regs inverted
movl%ecx, %eax   < resulting in unnecessary movl
ret
.cfi_endproc
.LFE0:
.size   add, .-add
.ident  "GCC: (GNU) 4.6.0 20110115 (experimental)"
.section.note.GNU-stack,"",@progbits

In 4.5.1, the last bit is one instruction shorter, with just:
addl%ecx, %eax
ret

A bug search revealed a similar sounding PR44249, however that is a regression
in 4.5 too apparently, yet this only affects 4.6.


[Bug target/35926] Pushing / Poping ebx without using it.

2011-01-27 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35926

--- Comment #6 from Tony Poppleton  2011-01-27 
16:55:26 UTC ---
For the record, the additional movl noticed above in GCC 4.6.0 has been
factored out into PR47477


[Bug rtl-optimization/47477] [4.6 regression] Sub-optimal mov at end of method

2011-01-27 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47477

--- Comment #5 from Tony Poppleton  2011-01-27 
17:58:12 UTC ---
The modified testcase in comment #4 also fixes the original bug with redundent
push/pop of BX (as described in PR35926), so fixing this during tree
optimizations would be good.


[Bug rtl-optimization/46235] inefficient bittest code generation

2011-01-28 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

--- Comment #2 from Tony Poppleton  2011-01-28 
16:55:48 UTC ---
Based on Richard's comment, I tried a modified version of the code replacing
the (1 << x) with just (16).

This shows that GCC (4.6 & 4.5.2) does perform an optimization similar to llvm,
and uses the testb instruction:
movl%edi, %eax
movl$1, %edx
testb   $16, %al
cmove   %edx, %eax
ret

Therefore, perhaps it would be beneficial to not convert from "a & (n << x)" to
"(a >> x) & n", in the special case where the value n is 1 (or a power of 2
potentially)?

Incidentally, the above code could have been optimized further to remove the
usage of edx entirely (will make a separate PR about that)


[Bug rtl-optimization/46235] inefficient bittest code generation

2011-01-28 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

--- Comment #3 from Tony Poppleton  2011-01-28 
17:02:50 UTC ---
Actually what I said above isn't correct - had it compiled down to "bt $4, %al"
then it would make sense to do that special case, but as it used "testb" it is
inconclusive.


[Bug rtl-optimization/47521] New: Unnecessary usage of edx register

2011-01-28 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521

   Summary: Unnecessary usage of edx register
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: minor
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: tony.popple...@gmail.com


In testing PR46235 I noticed some minor inefficiency in the usage of an extra
register.

The C code is:

int foo(int a, int x, int y)
{
   if  (a & (16))
   return a;
   return 1;
}

Which produces the asm:
movl%edi, %eax
movl$1, %edx
testb   $16, %al
cmove   %edx, %eax
ret

The above code could have been further optimized to remove the usage of edx:
movl$1, %eax
test$16, %edi
cmove   %edi, %eax
ret


[Bug rtl-optimization/47521] Unnecessary usage of edx register

2011-01-28 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521

Tony Poppleton  changed:

   What|Removed |Added

  Known to work||4.3.5
  Known to fail||4.4.5, 4.5.2, 4.6.0

--- Comment #1 from Tony Poppleton  2011-01-28 
17:23:19 UTC ---
I probably meant "testb   $16, %dil" above...

GCC 4.3.5 avoids the usage of edx, although it too probably has 1 instruction
too many:
testb   $16, %dil
movl$1, %eax
cmove   %eax, %edi
movl%edi, %eax
ret


[Bug rtl-optimization/46235] inefficient bittest code generation

2011-01-28 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46235

--- Comment #4 from Tony Poppleton  2011-01-28 
18:08:15 UTC ---
As a quick test, I commented out the block with the following comment in
fold-const.c:
  /* If this is an EQ or NE comparison with zero and ARG0 is
 (1 << foo) & bar, convert it to (bar >> foo) & 1.  Both require
 two operations, but the latter can be done in one less insn
 on machines that have only two-operand insns or on which a
 constant cannot be the first operand.  */

This produces the following asm code:
movl$1, %edx
movl%edi, %eax
movl%esi, %ecx
movl%edx, %edi
sall%cl, %edi
testl   %eax, %edi
cmove   %edx, %eax
ret
(using modified GCC 4.6.0 20110122)

So whilst I was hoping for an easy quick-fix, it appears that the required
optimization to convert it into a "btl" test isn't there later on in the
compile.

Incidentally, from looking at http://gmplib.org/~tege/x86-timing.pdf, it
appears that "bt" is slow on P4 architecture (8 cycles if I am reading it
correctly?  which sounds slow), so the llvm code in the bug description isn't
necessarily an optimization on this arch.  Newer chips would probably still
benefit though.


[Bug rtl-optimization/47521] [Regression 4.4/4.5/4.6] Unnecessary usage of edx register

2011-01-29 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521

Tony Poppleton  changed:

   What|Removed |Added

Summary|Unnecessary usage of edx|[Regression 4.4/4.5/4.6]
   |register|Unnecessary usage of edx
   ||register

--- Comment #2 from Tony Poppleton  2011-01-29 
08:41:47 UTC ---
Changing bug title to include regression, as 4.3.5 is able to avoid the usage
of edx.


[Bug target/35926] Pushing / Poping ebx without using it.

2011-02-01 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35926

Tony Poppleton  changed:

   What|Removed |Added

   Last reconfirmed|2008-12-28 06:57:47 |2011-02-01 13:13
  Known to fail||4.1.2, 4.3.5, 4.4.5, 4.5.2,
   ||4.6.0
   Severity|normal  |enhancement

--- Comment #7 from Tony Poppleton  2011-02-01 
13:44:28 UTC ---
Set known to fail field, last reconfirmed field, and modified to "enhancement"


[Bug tree-optimization/47555] [4.4 Regression] Huge memory usage when optimizing

2011-02-01 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47555

--- Comment #6 from Tony Poppleton  2011-02-01 
14:45:28 UTC ---
Out of interest, could this parameter of 20 be automatically tuned based on the
available RAM?

For systems with a lot of RAM, it might make sense to increase the parameter
above 20 (assuming this produces better code in the end).

Whilst they could override this using the flag mentioned in comment #3,
auto-detecting this parameter would make it easier for the end user.


[Bug target/34653] operation performed unnecessarily in 64-bit mode

2011-02-01 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653

Tony Poppleton  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2011.02.01 16:45:31
 Ever Confirmed|0   |1
  Known to fail||4.3.5, 4.4.5, 4.5.2, 4.6.0

--- Comment #8 from Tony Poppleton  2011-02-01 
16:45:31 UTC ---
Confirmed that both the example in the description and the example in comment
#1 apply to GCC 4.3.5, 4.4.5, 4.5.2 and 4.6.0 (20110129).

Also confirmed the regression noted in comment #7, where an extra register is
used (ecx), resulting in an additional mov instruction.  This regression is
present in versions 4.4.5, 4.5.2 and 4.6.0 (20110129).  This regression could
possibly be related to PR47521, which also first appeared in 4.4.x.


[Bug rtl-optimization/47581] New: [4.6 regression] Unnecessary adjustments to stack pointer

2011-02-01 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47581

   Summary: [4.6 regression] Unnecessary adjustments to stack
pointer
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: tony.popple...@gmail.com


Whilst investigating PR4079 (which affects PPC), I found some strange
adjustments to the stack pointer when compiling with 4.6.0 (20110129) on x86.

For reference, the C code from that PR is:

unsigned mulh(unsigned a, unsigned b)
{
return ((unsigned long long)a * (unsigned long long)b) >> 32;
}

On 4.5.2 using "-O2 -m32 -fomit-frame-pointer", this produced the following
succinct code:
mulh:
movl8(%esp), %eax
mull4(%esp)
movl%edx, %eax
ret
.size   mulh, .-mulh
.ident  "GCC: (GNU) 4.5.2"

However on 4.6.0 with the same arguments:
mulh:
.LFB0:
.cfi_startproc
subl$4, %esp   <== isn't this unnecessary?
.cfi_def_cfa_offset 8
movl12(%esp), %eax <== this could just be 8(%esp)
mull8(%esp)<== this could just be 4(%esp)
addl$4, %esp   <== isn't this unnecessary?
.cfi_def_cfa_offset 4
movl%edx, %eax
ret
.cfi_endproc
.LFE0:
.size   mulh, .-mulh
.ident  "GCC: (GNU) 4.6.0 20110129 (experimental)"


[Bug rtl-optimization/47582] New: Combine chains of movl into movq

2011-02-01 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582

   Summary: Combine chains of movl into movq
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: tony.popple...@gmail.com


The following C code (adapted from
http://stackoverflow.com/questions/4544804/in-what-cases-should-i-use-memcpy-over-standard-operators-in-c)
shows that adjacent sequences of movl could be combined into movq.

extern float a[5];
extern float b[5];

int main()
{
#if defined(M1)
a[0] = b[0];
a[1] = b[1];
a[2] = b[2];
a[3] = b[3];
a[4] = b[4];
#elif defined(M2)
memcpy(a, b, 5*sizeof(float));
 #endif
}

When compiled with "-O2 -fomit-frame-pointer" on GCC 4.3.5, 4.4.5, 4.5.2 and
4.6.0 (20110129), the following asm is produced for the -DM1 branch:
movlb(%rip), %eax
movl%eax, a(%rip)
movlb+4(%rip), %eax
movl%eax, a+4(%rip)
movlb+8(%rip), %eax
movl%eax, a+8(%rip)
movlb+12(%rip), %eax
movl%eax, a+12(%rip)
movlb+16(%rip), %eax
movl%eax, a+16(%rip)
ret

However for the -DM2 branch, the memcpy implementation shows that this can be
done more efficiently:
movqb(%rip), %rax
movq%rax, a(%rip)
movqb+8(%rip), %rax
movq%rax, a+8(%rip)
movlb+16(%rip), %eax
movl%eax, a+16(%rip)
ret

I presume that the memcpy is being done in hand-written asm?  If so, then once
this enhancment is done, then presumably that portion of memcpy code could be
converted to C code and be just as efficient.


[Bug rtl-optimization/47521] Unnecessary usage of edx register

2011-02-03 Thread tony.poppleton at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47521

--- Comment #5 from Tony Poppleton  2011-02-03 
14:16:01 UTC ---
As a quick test, would this be fixed by re-ordering the register file to move
eax above edx?

If so, then another possible fix to this would be to effectively re-run the RA
pass multiple times, each time using a different register file, and then select
the one that produces the "best" code and discard the other RA attempts.

The register files would only differ in their sort order when register costs
are equal.  I am guessing that only a few such register files would be needed
(in particular ones where the eax is shuffled around), rather than every single
possible combination of sort orders (which would be prohibitive), so this
doesn't necessarily have to impact the length of compilation by much.

A metric would also be needed to be able to then select the "best" version of
the compiled code - possibly using number of instructions and number of
registers used?


[Bug rtl-optimization/47582] Combine chains of movl into movq

2017-07-22 Thread tony.poppleton at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582

Tony Poppleton  changed:

   What|Removed |Added

   Last reconfirmed||2017-7-22
  Known to fail||5.1.1, 6.3.1, 7.1.1

--- Comment #4 from Tony Poppleton  ---
Retesting this with gcc 7.1.1 20170622 (Red Hat 7.1.1-3) shows that the mov's
are still not being combined, even though dependency #23684 is marked as fixed:

The DM1 branch produces:
.LFB0:
.cfi_startproc
movss   b(%rip), %xmm0
xorl%eax, %eax
movss   %xmm0, a(%rip)
movss   b+4(%rip), %xmm0
movss   %xmm0, a+4(%rip)
movss   b+8(%rip), %xmm0
movss   %xmm0, a+8(%rip)
movss   b+12(%rip), %xmm0
movss   %xmm0, a+12(%rip)
movss   b+16(%rip), %xmm0
movss   %xmm0, a+16(%rip)
ret
.cfi_endproc

Whilst the DM2 branch produces an even better result than previous GCC versions
I have tested on:
.LFB0:
.cfi_startproc
movlb+16(%rip), %eax
movdqu  b(%rip), %xmm0
movl%eax, a+16(%rip)
xorl%eax, %eax
movups  %xmm0, a(%rip)
ret

[Bug rtl-optimization/47582] Combine chains of movl into movq

2017-07-22 Thread tony.poppleton at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582

Tony  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
   Last reconfirmed|2017-07-22 00:00:00 |
  Known to work||7.1.1
 Resolution|--- |FIXED
  Known to fail|7.1.1   |

--- Comment #6 from Tony  ---
Excellent, many thanks

[Bug rtl-optimization/47582] Combine chains of movl into movq

2015-06-10 Thread tony.poppleton at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582

--- Comment #2 from Tony Poppleton  ---
Re-testing this with GCC 5.1, the code appears to be even less efficient, for
both cases;

DM1:
.LFB0:
.cfi_startproc
movss   b(%rip), %xmm0
xorl%eax, %eax
movss   %xmm0, a(%rip)
movss   b+4(%rip), %xmm0
movss   %xmm0, a+4(%rip)
movss   b+8(%rip), %xmm0
movss   %xmm0, a+8(%rip)
movss   b+12(%rip), %xmm0
movss   %xmm0, a+12(%rip)
movss   b+16(%rip), %xmm0
movss   %xmm0, a+16(%rip)
ret
.cfi_endproc

.LFB0:
.cfi_startproc
movqb(%rip), %rax
movq%rax, a(%rip)
movqb+8(%rip), %rax
movq%rax, a+8(%rip)
movlb+16(%rip), %eax
movl%eax, a+16(%rip)
xorl%eax, %eax
ret
.cfi_endproc

Why is the "xorl" appearing in both cases?  Should this be logged as a separate
bug.

Incidentally, compiling with -O1 produces the same code as -O2 on older GCCs
(as in the description comment above)

My total guess is it is due to a and b not having any initial values, and an
optimization that takes into account value ranges is getting confused?


[Bug rtl-optimization/47582] Combine chains of movl into movq

2015-06-10 Thread tony.poppleton at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47582

--- Comment #3 from Tony Poppleton  ---
Ignore the last comment - hadn't spotted the "int" return value on main...

So the code is actually more correct than previous versions, and no change to
the status of this bug.