[Bug rtl-optimization/94026] combine missed opportunity to simplify comparisons with zero

2020-03-20 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94026

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #5 from Wilco  ---
(In reply to Fei Yang from comment #4)
> (In reply to Fei Yang from comment #0)
> > Created attachment 47966 [details]
> > proposed patch to fix this issue
> > 
> > Simple test case:
> > int
> > foo (int c, int d)
> > {
> >   int a = (c >> d) & 7;
> > 
> >   if (a >= 2) {
> > return 1;
> >   }
> > 
> >   return 0;
> > }
> > 
> > Compile option: gcc -S -O2 test.c
> > 
> > 
> > On aarch64, GCC trunk emits 4 instrunctions:
> > asr w0, w0, 8
> > tst w0, 6
> > csetw0, ne
> > ret
> > 
> > which can be further simplified into:
> > tst x0, 1536
> > csetw0, ne
> > ret
> > 
> > We see the same issue on other targets such as i386 and x86-64.
> > 
> > Attached please find proposed patch for this issue.
> 
> The previously posted test case is not correct.
> Test case should be:
> int fifth (int c)
> {
> int a = (c >> 8) & 7;
> 
> if (a >= 2) {
> return 1;
> } else {
> return 0;
> }
> }

Simpler cases are:

int f1(int x) { return ((x >> 8) & 6) != 0; }
int f2(int x) { return ((x << 2) & 24) != 0; }
int f3(unsigned x) { return ((x << 2) & 15) != 0; }
int f4(unsigned x) { return ((x >> 2) & 14) != 0; }

[Bug tree-optimization/91322] [10 regression] alias-4 test failure

2020-03-26 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Martin Liška from comment #1)
> I've just run the test-case on aarch64 and it works fine (-O2, -O2 -flto,
> -O3 -flto -fno-early-inlining). And lto.exp testsuite works fine on aarch64.
> @Wilco: Can you please double-check?

Yes it now works on AArch64, but I still see failures on Arm.

[Bug tree-optimization/91322] [10 regression] alias-4 test failure

2020-04-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322

--- Comment #10 from Wilco  ---
(In reply to Christophe Lyon from comment #6)
> Created attachment 48184 [details]
> GCC passes dumps

So according to that, in 105t.vrp1 it removes the branch and unconditionally
calls abort:

Folding statement: _4 = _3 == 0B;
Matching expression match.pd:1737, gimple-match.c:708
Matching expression match.pd:1740, gimple-match.c:772
Matching expression match.pd:1747, gimple-match.c:826
Not folded
Folding statement: if (_5 == 0)
gimple_simplified to if (1 != 0)
Folded into: if (1 != 0)

Folding statement: return;
Not folded
Folding statement: __builtin_abort ();
Not folded
Removing dead stmt _5 = __builtin_constant_p (_4);


It doesn't make sense, these are the VRP ranges:

_1: short int * * VARYING
_2: short int * * VARYING
_3: short int * VARYING
_4: bool VARYING
_5: int [0, 0]
_10: struct a * VARYING

So somehow it decides that __builtin_constant_p (VARYING) == [0,0]???

[Bug tree-optimization/94442] [AArch64] Redundant ldp/stp instructions emitted at -O3

2020-04-06 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94442

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
This should be marked as [10 regression].

[Bug rtl-optimization/93565] Combine duplicates count trailing zero instructions

2020-02-06 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #6 from Wilco  ---
(In reply to Segher Boessenkool from comment #4)
> Why would that be unlikely?  It lengthens the lifetime of that pseudo,
> potentially significantly.

The move should be easy to remove since it is already known to be redundant. I
don't understand how you can say that the lifetime is increased when
duplicating the instruction is actually what increases the lifetime. Also it
requires extra registers to hold the two identical results.

[Bug middle-end/66462] GCC isinf/isnan/... builtins cause sNaN exceptions

2019-09-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66462

--- Comment #12 from Wilco  ---
(In reply to Segher Boessenkool from comment #11)
> I currently have
> 
> ===
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index ad5135c..bc3d318 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -9050,6 +9050,12 @@ fold_builtin_interclass_mathfn (location_t loc, tree
> fnde
>if (interclass_mathfn_icode (arg, fndecl) != CODE_FOR_nothing)
>  return NULL_TREE;
>  
> +  /* None of these builtins are ever exceptional, not even for signaling
> NaNs,
> + so we cannot do any of these optimizations that involve a floating
> point
> + comparison.  */
> +  if (flag_signaling_nans)
> +return NULL_TREE;
> +
>mode = TYPE_MODE (TREE_TYPE (arg));
>  
>bool is_ibm_extended = MODE_COMPOSITE_P (mode);
> ===
> 
> but we should really handle this with some non-signaling insns, not punt
> it to libm to do.

Well we should simply commit Tamar's patch again since it works fine on any
IEEE targets and showed performance gains across many targets. Any issues with
weird 128-bit FP formats can be addressed separately.

[Bug fortran/91690] Slow IEEE intrinsics

2019-09-09 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91690

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #6 from Wilco  ---
(In reply to Steve Kargl from comment #5)
> On Mon, Sep 09, 2019 at 06:25:53PM +, wilco at gcc dot gnu.org wrote:
> > > 
> > > The Fortran standard may require this behavior.  18-007r1 page 435
> > 
> > But none of that is needed since a correct implementation of isnan
> > etc does not affect the floating point state at all (see PR66462).
> > 
> 
> OK.  So you can special case IEEE_IS_NAN().  What about
> the other IEEE_ARITHMETIC functions?

The same is true for the other isxxx functions from C99 and the new iszero and
isdenormal. Are there other functions besides these in IEEE_ARITHMETIC?

> PR66462 has been re-opened.  Tamir seemed to have developed
> 4 different patches, yet no one can agree on "a correct
> implementation" for all targets.

As his latest comments explain, Tamar's last patch is correct. That should be
the default since an integer implementation which doesn't affect floating point
state is the only correct one. It happens to be the fastest as well as a bonus.

Either way, I don't believe the Fortran front-end should try to work around
mid-end bugs. As reported a save/restore is prohibitive expensive on all
targets so shouldn't even be contemplated.

[Bug middle-end/91753] Bad register allocation of multi-register types

2019-09-12 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91753

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Andrew Pinski from comment #1)
> lower-subreg should have be able to help here.  I wonder why it did not ...

I'm not sure how it can help. When you write a part of a multi-register mode
using subreg, you get incorrect liveness info. This is why splitting 64-bit
types on 32-bit targets before register allocation gives such a huge gain.

The approach I used in other compilers was to generate multi-register virtual
registers using a single create operation rather than via multiple subreg
writes.

[Bug target/91766] -fvisibility=hidden during -fpic still uses GOT indirection on arm64

2019-09-14 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91766

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #4 from Wilco  ---
(In reply to martin krastev from comment #3)
> So it appears to be a clash between -fcommon and -fvisibility=hidden during
> -fpic -- passing either -fno-common or -fno-pic drops the GOT indirection.
> And explicitly hiding the symbol obviously solves it. But the crux of the
> issue, IMO, is a multi-platform one -- that behavior deviates on gcc-8.2
> from platform to platform. On amd64 it suffices to -fvisibility=hidden to
> stop GOT detours, whereas on aarch64 it's -fvisibility=hidden -fno-common.
> As a result aarch64 performance gets penalized in unsuspecting multi-plats.

-fno-common should really become the default, -fcommon causes way too many
issues.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-10 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #27 from Wilco  ---
(In reply to Segher Boessenkool from comment #26)
> Yeah, and it probably should be a param (that different targets can default
> differently, per CPU probably).  On most Power CPUs all loops take a minimum
> number of cycles per iteration (say, three), but that translates to a lot of
> instructions (say, >10).
> 
> Small loops should probably be unrolled at -O2 already as well.  Maybe fixed
> length loops only?

Agreed, there is no reason not to unroll (or vectorize) with -O2 given several
other compilers do (and beat GCC as a result). It's best to limit unrolling to
small loops and only 2x unless it's 1-2 instructions.

[Bug middle-end/84071] [7/8 regression] nonzero_bits1 of subreg incorrect

2018-01-27 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84071

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Andrew Pinski from comment #1)
> I think this is related to PR 81443.  Can you see if the 7.3 release has it
> fixed?

It's related indeed, in both cases it incorrectly uses load_extend_op on
registers rather than MEM. The first primarily fails on targets that return
ZERO_EXTEND, the 2nd on targets that use SIGN_EXTEND.

[Bug rtl-optimization/81443] [8 regression] build/genrecog.o: virtual memory exhausted: Cannot allocate memory

2018-01-27 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81443

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #21 from Wilco  ---
See also PR84071 which has bad codegen from r242326.

[Bug middle-end/84071] [7/8 regression] wrong elimination of zero-extension after sign-extended load

2018-01-30 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84071

--- Comment #8 from Wilco  ---
(In reply to Eric Botcazou from comment #6)
> > They are always written but have an undefined value. Adding 2 8-bit values
> > results in a 9-bit value with WORD_REGISTER_OPERATIONS.
> 
> If they have an undefined value, then WORD_REGISTER_OPERATIONS must not be
> defined for ARM.  Here's the definition:
> 
>  -- Macro: WORD_REGISTER_OPERATIONS
>  Define this macro to 1 if operations between registers with
>  integral mode smaller than a word are always performed on the
>  entire register.  Most RISC machines have this property and most
>  CISC machines do not.
> 
> If the 8-bit addition is not performed on the entire 32-bit register, then
> this is not a WORD_REGISTER_OPERATIONS target.

The addition is performed on the full 32-bit register, so this obviously means
that the top 24 bits have an undefined value.

[Bug tree-optimization/84114] global reassociation pass prevents fma usage, generates slower code

2018-02-04 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Steve Ellcey from comment #0)
> Created attachment 43279 [details]
> Test case
> 
> The example code comes from milc in SPEC2006.
> 
> GCC on x86 or aarch64 generates better code with -O3 than it does with
> -Ofast or '-O3 -ffast-math'.  On x86 compiling with '-mfma -O3' I get 5
> vfmadd231sd instructions, 1 vmulsd instruction and 6 vmovsd.  With '-mfma
> -Ofast' I get 3 vfmadd231sd, 2 vaddsd, 3 vmulsd, and 6 vmovsd.  That is two
> extra instructions.
> 
> The problem seems to be that -Ofast turns on -ffast-math and that enables
> the global reassociation pass (tree-ssa-reassoc.c) and the code changes
> done there create some temporary variables which inhibit the recognition
> and use of fma instructions.
> 
> Using -O3 and -Ofast on aarch64 shows the same change.

I noticed this a while back, the reassociation pass has changed and now we get
far fewer fmas.

See https://gcc.gnu.org/ml/gcc-patches/2017-06/msg00771.html

[Bug tree-optimization/90838] Detect table-based ctz implementation

2019-06-11 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Jakub Jelinek from comment #1)
> __builtin_ctzll is undefined for 0, while the above is well defined and
> returns 0.

When the target ctz is well defined and returns 64 for 0, and we want to return
0 instead, this will work:

__builtin_ctzll (b) & 63

[Bug tree-optimization/91144] [10 regiression] 176.gcc miscompare after r273294

2019-07-11 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91144

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Andrew Pinski from comment #1)
> SPEC2000's GCC is known to have strict aliasing issues in it :).
> Does adding -fno-strict-aliasing help?

Yes - like I mentioned :-) It has never been needed though. Vlad's SPEC2000
config doesn't have it either for example.

[Bug middle-end/82853] Optimize x % 3 == 0 without modulo

2018-09-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #23 from Wilco  ---
(In reply to Jakub Jelinek from comment #22)
> So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0,
> because
> for x == 0 the test will yield a different value, as 0xU % 3U is 0
> and
> 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaabU <=
> 0x5554U, but x % 3U == 0 is equivalent to x * 0xaaabU <= 0xU.
> Now to see if something useful can be used also for the even divisors.

Yes for this case it is safe to do (x - 1) * 0xaaab < 0x, but you
can also do x * 0xaaab >= 0xaaab which is even simpler. Basically (x %
C) == N is simpler for 2 values of N.

[Bug tree-optimization/84114] global reassociation pass prevents fma usage, generates slower code

2018-02-10 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114

--- Comment #3 from Wilco  ---
(In reply to Richard Biener from comment #1)
> This is probably related to targetm.sched.reassociation_width where reassoc
> will widen a PLUS chain so several instructions will be executable in
> parallel
> without dependences.  Thus, (x + (y + (z + w))) -> (x + y) + (z + w).  When
> all of them are fed by multiplications this goes from four fmas to two.
> 
> It's basically a target request we honor so it works as designed.
> 
> At some point I thought about integrating FMA detection with reassociation.

It should understand FMA indeed, A*B + p[0] + C*D + p[1] + E*F + p[2] can
become(((p[0] + p[1] + p[2]) + A*B) + C*D) + E*F. 

Also we're missing a reassociation depth parameter. You need to be able to
specify how long a chain needs to be before it is worth splitting - the example
shows a chain of 5 FMAs is not worth splitting since FMA latency on modern
cores is low, but if these were integer operations (not MADD) then the chain
should be split.

[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions

2018-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #21 from Wilco  ---
(In reply to Douglas Mencken from comment #18)
> (In reply to Wilco from comment #17)
> 
> > Yes that should work.
> 
> Oops, but it doesn’t. I just tested it with patched 8.2. Same messages, same
> breakage

That's odd. The stack pointer is definitely 16-byte aligned in all cases right?
Can you check gcc.dg/pr78468.c passes and attach the disassembly please?

[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions

2018-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669

--- Comment #24 from Wilco  ---
(In reply to Douglas Mencken from comment #22)
> (In reply to Wilco from comment #21)
> 
> > That's odd. The stack pointer is definitely 16-byte aligned in all cases
> > right? 
> 
> As I know, PowerPC has no special “stack pointer”, it is just one of general
> purpose register, conventionally it is r1. Instruction like “stwu r3,-2(r1)”
> which are common for prologues easily bin any alignment.

STACK_BOUNDARY is the minimum stack alignment. The optimizer relies on this
being correct. If the ABI or prologue doesn't guarantee the minimum alignment
then alloca will fail, so it's essential to set STACK_BOUNDARY correctly.

> > Can you check gcc.dg/pr78468.c passes and attach the disassembly
> > please?
> 
> Using which compiler? xgcc from stage1? Is it buildable alone by itself?

Yes the stage1 compiler would be fine or alternatively use --disable-bootstrap
to get an installed compiler.

[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions

2018-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669

--- Comment #26 from Wilco  ---
(In reply to Douglas Mencken from comment #25)
> (In reply to Wilco from comment #24)
> 
> > Yes the stage1 compiler would be fine or alternatively use
> > --disable-bootstrap to get an installed compiler.
> 
> I’m yet at libstdc++ of stage2 (which means that it succeeded the place of
> failing) for 8.2 patched with the reversion I’ve just published. About four
> hours and I’d get working 8.2

If that's the case then STACK_BOUNDARY is incorrect. So what is the actual
alignment on Darwin, 8-byte alignment? How are Altivec instructions handled
which do require 16-byte alignment?

> Or do you want to look at results with your r251713? Does it matter?

Either would be interesting - it would show what alignment the prologue really
uses and what code it generates for alloca.

[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions

2018-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669

--- Comment #32 from Wilco  ---
(In reply to Segher Boessenkool from comment #29)
> It aligns the stack to 16:
> 
>   # r3 is size, at entry
>   addi r3,r3,18
>   ...
>   rlwinm r3,r3,0,0,27
>   ...
>   neg r3,r3
>   ...
>   lwz r2,0(r1)
>   ...
>   stwux r2,r1,r3
> 
> (the rlwinm is  r3 &= ~15;  )

So this rounds up the size but also adds an extra 16 bytes to the requested
allocation. The alloca blocks don't get correctly aligned since
STACK_DYNAMIC_OFFSET returns 72 (which is not a multiple of 16):

t1_a4
addi r2,r1,72  -> 72 % 16 != 0 (correct value would be 64)
stw r6,60(r1)
stw r2,56(r1)

This also shows in t1/t2_a32:

addi r2,r1,103  -> 31 + 72 = 103 (correct value would be 80+31)
stw r6,64(r1)
rlwinm r2,r2,0,0,26

So this proves STACK_DYNAMIC_OFFSET is incorrect indeed. If there are still
failures with that fixed then the stack pointer must also be unaligned.

[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions

2018-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669

--- Comment #33 from Wilco  ---
(In reply to Iain Sandoe from comment #30)

> From "Mac_OS_X_ABI_Function_Calls.pdf"
> 
> m32 calling convention
> 
> Prologs and Epilogs
> The called function is responsible for allocating its own stack frame,
> making sure to preserve 16-byte alignment in the stack. This operation is
> accomplished by a section of code called the prolog, which the compiler
> places before the body of the subroutine. After the body of the subroutine,
> the compiler places an epilog to restore the processor to the state it was
> prior to the subroutine call.

So functions must preserve 16-byte alignment, but can they rely on the stack
being always 16-byte aligned on entry?

[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions

2018-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669

--- Comment #38 from Wilco  ---
(In reply to Douglas Mencken from comment #37)
> And some more in my wish list. May GCC don’t generate these
> 
> .align2
> 
> in text section? Any, each and every powerpc instruction is 32bit-wide, no
> and never more, no and never less, so these aligns are redundant

You can have data in text sections, including bytes and half words. Even if
instructions aligned automatically, the function label might be unaligned if it
was preceded by a byte.

[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions

2018-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669

--- Comment #41 from Wilco  ---
(In reply to Douglas Mencken from comment #40)

> To build it, I patched its sources with fix_gcc8_build.patch reversion
> together with changes from comment #16

So what is the disassembly now? The 2nd diff still shows the original unaligned
STACK_DYNAMIC_OFFSET.

[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions

2018-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669

--- Comment #43 from Wilco  ---
(In reply to Douglas Mencken from comment #42)
> (In reply to Wilco from comment #41)
> 
> > So what is the disassembly now?
> 
> $ /Developer/GCC/8.2p/PowerPC/32bit/bin/gcc -O2 -fno-inline pr78468.c
> -save-temps
> $ mv pr78468.s ~/
> $ diff -u ~/8.2patched-pr78468.s ~/pr78468.s
> 
> Shows nothing, so they are identical

So that patch didn't do anything to fix STACK_DYNAMIC_OFFSET then. If it has no
effect then is that header really used?

> > The 2nd diff still shows the original unaligned STACK_DYNAMIC_OFFSET.
> 
> Second is vanilla 8.2, unpatched, which is marked with “-” in diff I posted
> 
> Possible problem is that in
> 
> -  if (size_align > known_align)
> - size_align = known_align;
> +  if (extra && size_align > BITS_PER_UNIT)
> +size_align = BITS_PER_UNIT;
> 
> you forgot if (extra /* assumed >0 */

No the problem is not in this code. It's STACK_DYNAMIC_OFFSET which is wrong.

[Bug tree-optimization/69336] Constant value not detected

2016-01-29 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69336

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #13 from Wilco  ---
(In reply to Richard Biener from comment #6)
> Author: rguenth
> Date: Tue Jan 19 13:27:11 2016
> New Revision: 232559
> 
> URL: https://gcc.gnu.org/viewcvs?rev=232559&root=gcc&view=rev

Even with this fix it still fails SPEC2006 gamess on AArch64 exactly as
reported in bug 69368.  It passes if I revert the EXPR_SINGLE case in
hashable_expr_equal_p to what it was before r232055 (ie. equal_mem_array_ref_p
is still not correct).

[Bug tree-optimization/69368] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-01 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
This still fails on AArch64 in exactly the same way with latest trunk - can
someone reopen this? I don't seem to have the right permissions...

[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-01 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #5 from Wilco  ---
This still fails on AArch64 in exactly the same way with latest trunk - can
someone reopen this? I don't seem to have the right permissions...
(In reply to Richard Biener from comment #4)
> So - can you please bisect to a source file (just drop all others to -O0) at
> least?

Yes I am working on that. It's OK with O2, but starts failing with -O3/-Ofast.

[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-01 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #6 from Wilco  ---
This still fails on AArch64 in exactly the same way with latest trunk - can
someone reopen this? I don't seem to have the right permissions...
(In reply to Richard Biener from comment #4)
> So - can you please bisect to a source file (just drop all others to -O0) at
> least?

It looks like the failure is in mp2.fppized.o. Compiling it with -O3
-fomit-frame-pointer -fno-aggressive-loop-optimizations -fno-inline causes it
to fail the exam29 test with the recent tree-ssa-scopedtables.c changes, while
it passes without them. The diff is quite large so it's hard to tell which
function it is...

[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-01 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #8 from Wilco  ---
In a few functions GCC decides that the assignments in loops are redundant. The
loops still execute but have their loads and stores removed. Eg. the first DO
loop in MP2NRG should be:

.L1027: 
  add w7, w7, 1
  add w8, w8, w10   
  cmp w7, w26 
  beq .L1026   
.L1029:  
  add w0, w11, w7
  add x0, x2, x0, sxtw 3
  ldr x1, [x0, -8] 
  add x0, x2, x7, sxtw 3
  str x1, [x0, -8] 
  cmp w9, 0 
  ble .L1027 

But with the scopedtables change it becomes:

.L1027:  
  add w2, w2, 1
  cmp w2, w3   
  beq .L1026   
.L1029:  
  cmp w4, 0
  ble .L1027

[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-01 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #9 from Wilco  ---
The loops get optimized away in dom2. The info this phase emits is hard to
figure out, so it's not obvious why it thinks the array assignments are
redundant (the array is used all over the place so clearly cannot be removed).

[Bug target/69619] [6 Regression] compilation doesn't terminate during CCMP expansion

2016-02-02 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69619

--- Comment #2 from Wilco  ---
Changing to c = 3 generates code after a short time. The issue is recursive
calls to expand_ccmp_expr during the 2 possible options tried to determine
costs. That makes the algorithm exponential.

A fix would be to expand the LHS and RHS of both gs0 and gs1 in
expand_ccmp_expr_1 (and not in gen_ccmp_first) as these expansions will be
identical no matter what order we choose to expand gs0 and gs1 in. It's not
clear how easy that can be achieved using the existing interfaces though.

[Bug target/69619] [6 Regression] compilation doesn't terminate during CCMP expansion

2016-02-02 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69619

--- Comment #3 from Wilco  ---
A simple workaround is to calculate cost1 early and only try the 2nd option if
the cost is low (ie. it's not a huge expression that may evaluate into lots of
ccmps). A slightly more advanced way would be to walk prep_seq_1 to count the
actual number of ccmp instructions.

[Bug libstdc++/69657] New: [6 Regression] abs() not inlined after including math.h

2016-02-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69657

Bug ID: 69657
   Summary: [6 Regression] abs() not inlined after including
math.h
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

Since a recent C++ header change abs() no longer gets inlined if we include an
unrelated header before it.

#include 
#include 

int
wrap_abs (int x)
{
  return abs (x) + std::abs(x);
}


Until recently GCC6 with -O2 -S produced for AArch64:

_Z8wrap_absi:
cmp w0, wzr
csneg   w0, w0, w0, ge
lsl w0, w0, 1
ret

However trunk GCC6 now emits:

_Z8wrap_absi:
stp x29, x30, [sp, -16]!
add x29, sp, 0
bl  abs
lsl w0, w0, 1
ldp x29, x30, [sp], 16

This may result in significant slowdowns as it adds a call and PLT indirection
to execute a 3-instruction function. There might be other builtin functions
that are affected in a similar way.

[Bug target/69619] [6 Regression] compilation doesn't terminate during CCMP expansion

2016-02-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69619

--- Comment #5 from Wilco  ---
Proposed patch: https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00206.html

[Bug c++/69657] [6 Regression] abs() not inlined after including math.h

2016-02-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69657

--- Comment #5 from Wilco  ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Jonathan Wakely from comment #3) 
> > Recategorising as component=c++, and removing the regression marker (because
> > the change in libstdc++ that reveals this issue is required for 
> > conformance).
> 
> Adding back regression as it is an user visible regression even though only
> the library change, the user does not understand that.  In fact I suspect
> this comes from SPEC CPU 2006 or some other heavily used benchmarking code.

Correct, I found it while investigating major regressions in a few benchmarks.

Note I see the long, long long, int128, float, double, long double overloads,
but where is the plain int overload defined? 

I bet that adding an int overload that redirects to __builtin_abs similar to
the others will fix the issue.

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-05 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #29 from Wilco  ---
(In reply to rguent...@suse.de from comment #28)
> On Fri, 5 Feb 2016, alalaw01 at gcc dot gnu.org wrote:

> > Should I raise a new bug for this, as both this and 53068 are CLOSED?
> 
> I think this has been discussed in some other dup already and
> the Fortran FE folks disagreed (it was never "legal", not even in F77).
> 
> I also don't see how it can be a FE only fix.  Possibly we can
> implemnet a middle-end switch that tells us that the size of commons
> is not to be trusted.  The FE could then set that flag with -std=legacy.
> 
> You can, after all, "simulate" the very same failure with C.

Isn't there already a special exception for C (array of size 1 at end of
structure)? The same exception could be enabled with -std=legacy. You'd only
need to do extra FE work if you wanted to just do this for COMMON, but that
seems hardly worth the extra effort - how much gain do we really get from the
array size 1 optimization apart from repeatedly breaking SPEC benchmarks in
various ways? Disabling it will likely remove the need for
-fno-aggressive-loop-optimizations as well.

[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508

2016-02-17 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368

--- Comment #41 from Wilco  ---
(In reply to Jerry DeLisle from comment #40)
> Do you have a reduced test case of the Fortran code we can look at?

See comment 13/14, the same common array is declared with different sizes in
various places.

> I am concerned that you are changing things to accommodate invalid Fortran. 
> I don't have 416.gamess to look at.  I do see your suggested patch touches
> the Fortran frontend.

Since SPEC does not want to change the invalid Fortran, GCC will have to
accomodate for it somehow.

> Maybe it needs to be put behind -std=legacy?
> 
> I assume the "(without flag)" means you intend to do something like that?

Yes, but it was suggested that -std=legacy wasn't the right flag in comment
35...

[Bug target/70048] New: [AArch64] Inefficient local array addressing

2016-03-02 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048

Bug ID: 70048
   Summary: [AArch64] Inefficient local array addressing
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

The following example generates very inefficient code on AArch64:

int f1(int i) { int p[1000]; p[i] = 1; return p[i + 10] + p[i + 20]; }

f1:
sub sp, sp, #4000
add w2, w0, 10
add x4, sp, 4000
add w1, w0, 20
mov w3, 1
add x0, x4, x0, sxtw 2
add x2, x4, x2, sxtw 2
sub x0, x0, #4096
add x1, x4, x1, sxtw 2
sub x1, x1, #4096
sub x2, x2, #4096
str w3, [x0, 96]
ldr w0, [x1, 96]
ldr w2, [x2, 96]
add sp, sp, 4000
add w0, w2, w0
ret

Previous compilers, eg GCC4.9 generate:

f1:
sub sp, sp, #4000
add w1, w0, 10
add w2, w0, 20
mov w3, 1
str w3, [sp,w0,sxtw 2]
ldr w1, [sp,w1,sxtw 2]
ldr w0, [sp,w2,sxtw 2]
add sp, sp, 4000
add w0, w1, w0
ret

[Bug target/70048] [AArch64] Inefficient local array addressing

2016-03-02 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048

--- Comment #1 from Wilco  ---
The regression seem to have appeared on trunk around Feb 3-9.

[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing

2016-03-02 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048

--- Comment #5 from Wilco  ---
(In reply to amker from comment #4)
> (In reply to ktkachov from comment #3)
> > Started with r233136.
> 
> That's why I forced base+offset out of memory reference and kept register
> scaling in in the first place.  I think my statement still holds, that
> GIMPLE optimizers should be improved to CSE register scaling expression
> among different memory references, then force base(sp)+offset parts out of
> memory reference in legitimize_address so that it can be merged.  I suspect
> this case is going to be very inefficient in loop context.  Unfortunately
> GIMPLE optimizers are weak now and r233136 is a workaround to it.  I want to
> revisit slsr sometime after stage4 to fix this issue.

The issue is that the new version splits immediates from virtual stack
variables, this is what GCC used to generate:

(insn 8 7 9 2 (set (reg:DI 82)
(plus:DI (reg/f:DI 68 virtual-stack-vars)
(const_int -4000 [0xf060]))) addroff.c:38 -1

However latest GCC does:

(insn 9 8 10 2 (set (reg:DI 84)
(plus:DI (reg/f:DI 68 virtual-stack-vars)
(reg:DI 83))) addroff.c:38 -1
 (nil))
(insn 10 9 11 2 (set (reg:DI 85)
(plus:DI (reg:DI 84)
(const_int -4096 [0xf000]))) addroff.c:38 -1
 (nil))

So we need to recoginize virtual-stack-vars+offset as a special case rather
than like any other add with immediate.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Jakub Jelinek from comment #1)
> These glibc inlines are just wrong, they should trust the compiler doing the
> right thing at least for contemporary versions.
> 05a910f7b420c2b831f35ba90e61c80f001c0606 should be IMHO reverted, it should
> be approached at the compiler side if the ARM folks see any performance
> issues with what is generated.

It's not about the generated code, it's about whether one calls memcpy or
mempcpy in the case where GCC doesn't inline it. Few targets support an
assembler mempcpy implementation, so by default the best option is defer to
memcpy. Targets can add an optimized mempcpy and define
_HAVE_STRING_ARCH_mempcpy to get mempcpy (as x86 does).

Is there a mechanism by which GCC is told accurately whether the library it
targets supports an optimized mempcpy?

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #5 from Wilco  ---
(In reply to Jakub Jelinek from comment #3)
> If some arch in glibc implements memcpy.S and does not implement mempcpy.S,
> then obviously the right fix is to add mempcpy.S for that arch, usually it
> is just a matter of #include memcpy.S with some define USE_AS_MEMPCPY, and
> change a couple of instructions in the assembly.  You don't need to remember
> the original value of dest, usually you have the address to which you store
> bytes in some register, so it is just a matter of copying it to the return
> register.

We had a long discussion on this at the time. Very few targets have implemented
mempcpy.S so it would be a lot of effort to do so. Sharing a single
implementation of memcpy is typically not possible without slowing down memcpy
(which is more important), and thus you end up with 2 separate implementations
which creates unnecessary cache pollution.

So overall I believe the best option for most targets is to defer to memcpy
with an extra add instruction to compute the end. We could do that in GCC
rather than in the GLIBC headers, but then it would be harder to actually call
mempcpy in the case that a target supports an optimized implementation.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #6 from Wilco  ---
(In reply to Jakub Jelinek from comment #4)
> Note the choice of this in a header file is obviously wrong, if you at some
> point fix this up, then apps will still call memcpy rather than mempcpy,
> even when the latter is more efficient (because it doesn't have to save the
> length value in some location where it survives across the call).
> Note if you don't use the result of mempcpy, gcc is able to optimize it into
> memcpy, and tons of other optimizations.

It's highly unlikely that calling mempcpy would ever provide a performance
gain.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-04 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #9 from Wilco  ---
(In reply to H.J. Lu from comment #8)
> Inlining mempcpy uses a callee-saved register:
> 
...
> 
> Not inlining mempcpy is preferred.

If codesize is the only thing that matters... The cost is not at the caller
side but in requiring a separate mempcpy function which causes extra I-cache
misses. The only case where mempcpy makes sense is if you can use a shared
implementation with zero overhead to memcpy.

Btw those tests do want to use memcpy anyway, but you could use (mempcpy) to
avoid any unwanted header substitution to ensure the test passes.

[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing

2016-03-07 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048

--- Comment #12 from Wilco  ---
(In reply to Jiong Wang from comment #11)
> (In reply to Richard Henderson from comment #10)
> > Created attachment 37890 [details]
> > second patch
> > 
> > Still going through full testing, but I wanted to post this
> > before the end of the day.
> > 
> > This update includes a virt_or_elim_regno_p, as discussed in #c7/#c8.
> > 
> > It also updates aarch64_legitimize_address to treat R0+R1+C as a special
> > case of R0+(R1*S)+C.  All of the arguments wrt scaling apply to unscaled
> > indices as well.
> > 
> > As a minor point, doing some of the expansion in a slightly different
> > order results in less garbage rtl being generated in the process.
> 
> Richard,
> 
>   I just recalled the reassociation of constant offset with vritual frame
> pointer will increase register pressure, thus cause bad code generation
> under some situations. For example, the testcase given at
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173#c8
> 
> void bar(int i)  
> {
>   char A[10];
>   char B[10]; 
>   char C[10];   
>   g(A);
>   g(B);
>   g(C);  
>   f(A[i]); 
>   f(B[i]);
>   f(C[i]);   
>   return;   
> } 
> 
>   Before your patch we are generating  (-O2)
>   ===
> bar:
> stp x29, x30, [sp, -80]!
> add x29, sp, 0
> add x1, x29, 80
> str x19, [sp, 16]
> mov w19, w0
> add x0, x29, 32
> add x19, x1, x19, sxtw
> bl  g
> add x0, x29, 48
> bl  g
> add x0, x29, 64
> bl  g
> ldrbw0, [x19, -48]
> bl  f
> ldrbw0, [x19, -32]
> bl  f
> ldrbw0, [x19, -16]
> bl  f
> ldr x19, [sp, 16]
> ldp x29, x30, [sp], 80
> ret
> 
>   After your patch, we are generating:
>   ===
> bar:
> stp x29, x30, [sp, -96]!
> add x29, sp, 0
> stp x21, x22, [sp, 32]
> add x22, x29, 48
> stp x19, x20, [sp, 16]
> mov w19, w0
> mov x0, x22
> add x21, x29, 64
> add x20, x29, 80
> bl  g
> mov x0, x21
> bl  g
> mov x0, x20
> bl  g
> ldrbw0, [x22, w19, sxtw]
> bl  f
> ldrbw0, [x21, w19, sxtw]
> bl  f
> ldrbw0, [x20, w19, sxtw]
> bl  f
> ldp x19, x20, [sp, 16]
> ldp x21, x22, [sp, 32]
> ldp x29, x30, [sp], 96
> ret
> 
>   We are using more callee saved registers, thus extra stp/ldp generated.
> 
>   But we do will benefit from reassociation constant offset with virtual
> frame pointer if it's inside loop, because:
> 
>* vfp + const_offset is loop invariant
>* the virtual reg elimination on vfp will eventually generate one
>  extra instruction if it was not used with const_offset but another reg.
> 
>   Thus after this reassociation, rtl IVOPT can hoist it out of loop, and we
> will save two instructions in the loop. 
> 
>   A fix was proposed for loop-invariant.c to only do such reshuffling for
> loop, see https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01253.html.  That
> patch finally stopped because the issue PR62173 was fixed on tree level, and
> the pointer re-shuffling was considered to have hidding overflow risk though
> will be very rare.

I don't believe this is really worse - if we had say the same example with 3
pointers or 3 global arrays we should get the exact same code (and in fact
generating the same canonicalized form for different bases and scales is
essential). Once you've done that you can try optimizing accesses which differ
by a *small* constant offset.

[Bug middle-end/70140] New: Inefficient expansion of __builtin_mempcpy

2016-03-08 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140

Bug ID: 70140
   Summary: Inefficient expansion of __builtin_mempcpy
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

The expansion of __builtin_mempcpy is inefficient on many targets (eg. AArch64,
ARM, PPC). The issue is due to not using the same expansion options that memcpy
uses in builtins.c. As a result GCC6 produces for __builtin_mempcpy(x, y, 32):

PPC:
   0:   38 a0 00 20 li  r5,32
   4:   48 00 00 00 b   4 
4: R_PPC_REL24  mempcpy
   8:   60 00 00 00 nop
   c:   60 42 00 00 ori r2,r2,0

AArch64:
mov x2, 32
b   mempcpy

A second issue is that GCC always calls mempcpy. mempcpy is not supported or
implemented efficiently in many (if not most) library/target combinations.
GLIBC only has 3 targets which implement an optimized mempcpy, so GLIBC
currently inlines mempcpy into memcpy by default unless a target explicitly
disables this. It seems better to do this in GCC so it works for all libraries.

[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing

2016-03-10 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048

--- Comment #15 from Wilco  ---
(In reply to Richard Biener from comment #14)
> The regression in the original description looks severe enough to warrant
> some fixing even if regressing some other cases.

Agreed, I think the improvement from Richard H's "2nd patch" outweighs the
regression shown in comment 11. Assuming it benchmarks OK, for GCC6 that seems
the best we can do.

It is is unlikely that legitimize_address could ever be used improve addressing
of multiple arrays. The arrays must both be small and allocated close together
for this to be beneficial, and that is something you don't know in advance, let
alone could decide in a callback that only sees one address at a time. So that
requires a different approach.

[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing

2016-03-11 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048

--- Comment #17 from Wilco  ---
(In reply to Jiong Wang from comment #16)

>   * for the second patch at #c10, if we always do the following no matter
> op0 is virtual & eliminable or not
> 
>  "op1 = force_operand (op1, NULL_RTX);"
> 
> then there is no performance regression as well. Even more, there is
> significant perf improvement beyond current trunk on one benchmark.

I ran this modified patch through a few benchmarks and there are no
regressions. The codesize of SPEC2006 reduces significantly in several cases
(-0.25%), while there are only 2 minor increases of +0.02%, so the patch
reduces instruction counts quite effectively.

So we should go with the modified patch for GCC6.

[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing

2016-03-19 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048

--- Comment #20 from Wilco  ---
(In reply to Richard Henderson from comment #19)

> I wish that message had been a bit more complete with the description
> of the performance issue.  I must guess from this...
> 
> >   ldr dst1, [reg_base1, reg_index, #lsl 1]
> >   ldr dst2, [reg_base2, reg_index, #lsl 1]
> >   ldr dst3, [reg_base3, reg_index, #lsl 1]
> > 
> > into
> > 
> >   reg_index = reg_index << 1;
> >   ldr dst1, [reg_base1, reg_index]
> >   ldr dst2, [reg_base2, reg_index]
> >   ldr dst3, [reg_base3, reg_index]
> 
> that it must have something to do with the smaller cpus, e.g. exynosm1,
> based on the address cost tables.

Some CPUs emit seperate uops to do address shift by 1. So that would mean 6
uops in the first example vs 4 when doing the shift separately. According to
the cost tables this might actually be worse on exynosm1 as it has a cost for
any indexing.

> I'll note for the record that you cannot hope to solve this with
> the legitimize_address hook alone for the simple reason that it's not
> called for legitimate addresses, of which (base + index * 2) is
> a member.  The hook is only being called for illegitimate addresses.

Would it be possible to disallow expensive addresses initially, let CSE do its
thing and then merge addresses with 1 or 2 uses back into loads/stores?

> To include legitimate addresses, you'd have to force out the address
> components somewhere else.  Perhaps in the mov expanders, since that's
> one of the very few places mem's are allowed.  You'd want to do this
> only if !cse_not_expected.

And presumably only for addresses we would prefer to be CSEd, such as expensive
shifts or indexing.

> OTOH, it's also the sort of thing that one would hope that CSE itself
> would be able to handle.  Looking across various addresses, computing
> sums of costs, and breaking out subexpressions as necessary.

Yes that would be the ideal, but one can dream...

Did you post your patch btw? We should go ahead with that (with Jiong's minor
modification) as it looks significantly better overall.

[Bug middle-end/70801] New: IRA caller-saves does not support rematerialization

2016-04-26 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70801

Bug ID: 70801
   Summary: IRA caller-saves does not support rematerialization
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

GCC emits the same code for caller-saves in all cases, even if the caller-save
is an immediate which can be trivially rematerialized. The caller-save code
should support rematerialization directly by omitting saves and emitting the
init value for restores. This allows the cost of rematerializeable caller-saves
to be lowered and avoids a stack slot. 

Alternatively given the spill code already supports rematerialization, it may
be possible to stop immediates from being caller-saved by increasing their
caller-save cost significantly.

void g(void);
float bad_remat(float x)
{
  x += 3.0f;
  g();
  x *= 3.0f;
  x *= (3.0f + x);
  g();
  x *= 3.0f;
  x *= (3.0f + x);
  return x;
}

AArch64 gives with -O2 -fomit-frame-pointer -ffixed-d8 -ffixed-d9 -ffixed-d10
-ffixed-d11 -ffixed-d12 -ffixed-d13 -ffixed-d14 -ffixed-d15:

fmovs1, 3.0e+0
str x30, [sp, -32]!
fadds0, s0, s1
stp s0, s1, [sp, 24]
bl  g
ldp s0, s1, [sp, 24]
fmuls0, s0, s1
fadds2, s0, s1
fmuls0, s0, s2
str s0, [sp, 24]
bl  g
ldp s0, s1, [sp, 24]
ldr x30, [sp], 32
fmuls0, s0, s1
fadds1, s0, s1
fmuls0, s0, s1
ret


x86_64 with plain -O2:
subq$24, %rsp
movss   .LC0(%rip), %xmm1
addss   %xmm1, %xmm0
movss   %xmm1, 12(%rsp)
movss   %xmm0, 8(%rsp)
callg
movss   8(%rsp), %xmm0
movaps  %xmm0, %xmm2
movss   12(%rsp), %xmm1
mulss   %xmm1, %xmm2
movaps  %xmm2, %xmm0
addss   %xmm1, %xmm0
mulss   %xmm2, %xmm0
movss   %xmm0, 8(%rsp)
callg
movss   12(%rsp), %xmm1
movss   8(%rsp), %xmm0
addq$24, %rsp
mulss   %xmm1, %xmm0
addss   %xmm0, %xmm1
mulss   %xmm1, %xmm0
ret

[Bug middle-end/70802] New: IRA memory cost calculation incorrect for immediates

2016-04-26 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70802

Bug ID: 70802
   Summary: IRA memory cost calculation incorrect for immediates
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

The following code in ira-costs.c tries to improve the memory cost for
rematerializeable loads. There are several issues with this though:

1. The memory cost can become negative, forcing a spill, which is known to
cause incorrect code (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242)
2. The code only handles a subset of immediate loads, not all rematerializeable
values
3. The cost adjustment is not sufficient to make better decisions between
allocating an immediate to a callee-save register and spill a variable, or
allocate the variable and rematerialize the immediate

As an example of (3), if there is only one callee-save register free, IRA will
use it to allocate the immediate rather than the variable:

float bad_alloc(float x)
{
  x += 3.0f;
  g();
  x *= 3.0f;
  return x;
}

With -O2 -fomit-frame-pointer -ffixed-d8 -ffixed-d9 -ffixed-d10 -ffixed-d11
-ffixed-d12 -ffixed-d13 -ffixed-d14:

str x30, [sp, -32]!
str d15, [sp, 8]
fmovs15, 3.0e+0
fadds0, s0, s15
str s0, [sp, 28]
bl  g
ldr s0, [sp, 28]
fmuls0, s0, s15
ldr d15, [sp, 8]
ldr x30, [sp], 32
ret

  a0(r76,l0) costs: CALLER_SAVE_REGS:15000,15000 GENERAL_REGS:15000,15000
FP_REGS:0,0 ALL_REGS:15000,15000 MEM:12000,12000
  a1(r73,l0) costs: CALLER_SAVE_REGS:1,1 GENERAL_REGS:1,1
FP_REGS:0,0 ALL_REGS:1,1 MEM:8000,8000

The immediate value r76 is counted as 1 def and 2 uses, so memory cost of
12000, while r73 has 1 def and 1 use, so memory cost of 8000. However the
worst-case rematerialization cost of r76 would be 2 moves, one which already
exists of course, so the memory cost should have been 4000...

ira-costs.c, ~line 1458:

  if (set != 0 && REG_P (SET_DEST (set)) && MEM_P (SET_SRC (set))
  && (note = find_reg_note (insn, REG_EQUIV, NULL_RTX)) != NULL_RTX
  && ((MEM_P (XEXP (note, 0))
   && !side_effects_p (SET_SRC (set)))
  || (CONSTANT_P (XEXP (note, 0))
  && targetm.legitimate_constant_p (GET_MODE (SET_DEST (set)),
XEXP (note, 0))
  && REG_N_SETS (REGNO (SET_DEST (set))) == 1))
  && general_operand (SET_SRC (set), GET_MODE (SET_SRC (set
{
  enum reg_class cl = GENERAL_REGS;
  rtx reg = SET_DEST (set);
  int num = COST_INDEX (REGNO (reg));

  COSTS (costs, num)->mem_cost
-= ira_memory_move_cost[GET_MODE (reg)][cl][1] * frequency;
  record_address_regs (GET_MODE (SET_SRC (set)),
   MEM_ADDR_SPACE (SET_SRC (set)),
   XEXP (SET_SRC (set), 0), 0, MEM, SCRATCH,
   frequency * 2);
  counted_mem = true;
}

[Bug middle-end/70861] New: Improve code generation of switch tables

2016-04-28 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70861

Bug ID: 70861
   Summary: Improve code generation of switch tables
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

GCC uses a very basic check to determine whether to use a switch table. A
simple example from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823 still
generates a huge table with mostly default entries with -O2:

int i;
int func(int a)
{
  switch(a)
  {
case 0:   i = 20; break;
case 1:   i = 50; break;
case 2:   i = 29; break;
case 3:   i = 20; break;
case 4:   i = 50; break;
case 5:   i = 29; break;
case 6:   i = 20; break;
case 7:   i = 50; break;
case 8:   i = 29; break;
case 9:   i = 79; break;
case 110: i = 27; break;
default:  i = 77; break;
  }
  return i;
}

This shows several issues:

1. The density calculation is not adjustable depending on the expected size of
switch table entries (which depends on the target).
2. A table may contain not only 90% default entries, but they can be
consecutive as well. To avoid this the maximum number of default cases should
be limited to say 3x the average gap between real cases.
3. There is no reduction in minimum required density for larger switches - the
wastage becomes quite significant for larger switches and targets that use 4
bytes per table entry.
4. Dense regions and outlier values are not split off and handled seperately.

[Bug middle-end/70861] Improve code generation of switch tables

2016-04-28 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70861

--- Comment #3 from Wilco  ---
(In reply to Andrew Pinski from comment #2)
> Note I think if we had gotos instead of assignment here we should do the
> similar thing for the switch table itself.

Absolutely, that was my point.

> Note also the assignment to i is getting in the way for the switch to
> constant table form.

In SSA form you can sink the assignments into a shared block, so it should be
able deal with an identical assignment to a global, pointer, array or even
returning directly. Also it's common that the default or a few other cases are
doing something different, so those need to be pulled out separately.

However that's icing on the cake - the key issue is that GCC doesn't even do
the basic stuff yet. Most compilers are able to split off cases with high
profile counts, dense switch tables, ranges that all go to the same block etc.

[Bug rtl-optimization/70946] New: Bad interaction between IVOpt and loop unrolling

2016-05-04 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70946

Bug ID: 70946
   Summary: Bad interaction between IVOpt and loop unrolling
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

IVOpt chooses between using indexing for induction variables or incrementing
pointers. Due to way loop unrolling works, a decision that is optimal if
unrolling is disabled may become very non-optimal with unrolling.

Below are simple examples that show how the choice to use indexing can become a
very bad idea when unrolled, while using offset addressing leads to very decent
code. To improve this we either to teach IVOpt about unrolling (eg. prioritise
base+offset addressing) or add a tree unroller that unrolls small inner loops
before IVOpt.


void loop1 (int *p, int *q, int i)
{
   for (i = 0; i < 1000; i++) p[i] = q[i] + 1;
}

void loop2 (int *p, int i)
{
   for (i = 0; i < 1000; i++) p[i] = p[i] + 1;
}

On AArch64 with -O2 -funroll-loops this gives:

loop1:
mov x2, 0
.p2align 2
.L41:
ldr w4, [x1, x2]
add x3, x2, 4
add x10, x2, 8
add x9, x2, 12
add w5, w4, 1
str w5, [x0, x2]
add x8, x2, 16
add x7, x2, 20
add x6, x2, 24
add x11, x2, 28
ldr w12, [x1, x3]
add x2, x2, 32
cmp x2, 4000
add w13, w12, 1
str w13, [x0, x3]
ldr w14, [x1, x10]
add w15, w14, 1
str w15, [x0, x10]
ldr w16, [x1, x9]
add w17, w16, 1
str w17, [x0, x9]
ldr w18, [x1, x8]
add w4, w18, 1
str w4, [x0, x8]
ldr w3, [x1, x7]
add w10, w3, 1
str w10, [x0, x7]
ldr w9, [x1, x6]
add w5, w9, 1
str w5, [x0, x6]
ldr w8, [x1, x11]
add w7, w8, 1
str w7, [x0, x11]
bne .L41
ret
loop2:
add x6, x0, 4000
.p2align 2
.L51:
mov x1, x0
ldr w2, [x0]
add x0, x0, 32
add w3, w2, 1
cmp x0, x6
str w3, [x1], 4
ldr w4, [x0, -28]
add w5, w4, 1
str w5, [x0, -28]
ldr w7, [x1, 4]
add w8, w7, 1
str w8, [x1, 4]
ldp w9, w10, [x0, -20]
ldp w11, w12, [x0, -12]
add w14, w9, 1
ldr w13, [x0, -4]
add w15, w10, 1
add w16, w11, 1
add w17, w12, 1
add w18, w13, 1
stp w14, w15, [x0, -20]
stp w16, w17, [x0, -12]
str w18, [x0, -4]
bne .L51
ret

[Bug rtl-optimization/70946] Bad interaction between IVOpt and loop unrolling

2016-05-04 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70946

--- Comment #1 from Wilco  ---
PR36712 seems related to this

[Bug rtl-optimization/70961] New: Regrename ignores preferred_rename_class

2016-05-05 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70961

Bug ID: 70961
   Summary: Regrename ignores preferred_rename_class
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

When deciding which register to use regrename.c calls the target function
preferred_rename_class. However in pass 2 in find_rename_reg it then just
ignores this preference. This results in significantly increased codesize on
targets which prefer a subset of allocatable registers in order to use smaller
instructions.

Also the computed super_class appears to be the union of all uses and defs
instead of the intersection. This should be the intersection as that is the set
of registers that all uses and defs support. 

If the preferred class doesn't result in a valid rename then it could search a
wider class, but then it would need to check that the size of the newly
selected patterns does not increase.

[Bug rtl-optimization/70961] Regrename ignores preferred_rename_class

2016-05-06 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70961

--- Comment #3 from Wilco  ---
(In reply to Eric Botcazou from comment #2)

> Pass #2 ignores it since the preference simply couldn't be honored.

In which case it should not rename that chain rather than just ignore the
preference (and a preference of NO_REGS should probably also block renaming).

> > The super_class has nothing to do with the class that is searched for
> > renaming registers though, it's just the info passed to the back-end to
> > compute this class.  For example, on the ARM, the preferred_class will be
> > LO_REGS for any chain of GENERAL_REGS.

Actually for a chain of GENERAL_REGS the preferred class should be GENERAL_REGS
- using LO_REGS in that case would be counterproductive. You'd prefer LO_REGS
only when *some* of the patterns in a chain are LO_REGS. But since the callback
only ever sees the union of all classes in a chain, I don't think could make
the correct decision.

> As a matter of fact, that might not be true, something like
>
> +  if (TARGET_THUMB2 && reg_class_subset_p (rclass, GENERAL_REGS))
>  return LO_REGS;

I tried something similar, as well as always returning LO_REGS or NO_REGS, but
this makes no difference at all due to pass 2 ignoring the preference...

[Bug rtl-optimization/70961] Regrename ignores preferred_rename_class

2016-05-06 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70961

--- Comment #5 from Wilco  ---
As for a simple example, Proc_4 in Dhrystone is a good one. With -O2 and
-fno-rename-registers I get the following on Thumb-2:

00c8 :
  c8:   b430push{r4, r5}
  ca:   f240 0300   movwr3, #0
  ce:   f240 0400   movwr4, #0
  d2:   f2c0 0300   movtr3, #0
  d6:   f2c0 0400   movtr4, #0
  da:   f240 0100   movwr1, #0
  de:   681aldr r2, [r3, #0]
  e0:   f2c0 0100   movtr1, #0
  e4:   7824ldrbr4, [r4, #0]
  e6:   2542movsr5, #66 ; 0x42
  e8:   700dstrbr5, [r1, #0]
  ea:   2c41cmp r4, #65 ; 0x41
  ec:   bf08it  eq
  ee:   f042 0201   orreq.w r2, r2, #1
  f2:   601astr r2, [r3, #0]
  f4:   bc30pop {r4, r5}
  f6:   4770bx  lr

With -frename-registers:

00c8 :
  c8:   b430push{r4, r5}
  ca:   f240 0300   movwr3, #0
  ce:   f240 0400   movwr4, #0
  d2:   f2c0 0300   movtr3, #0
  d6:   f2c0 0400   movtr4, #0
  da:   f240 0100   movwr1, #0
  de:   681aldr r2, [r3, #0]
  e0:   f2c0 0100   movtr1, #0
  e4:   f894 c000   ldrb.w  ip, [r4]
  e8:   2542movsr5, #66 ; 0x42
  ea:   700dstrbr5, [r1, #0]
  ec:   f1bc 0f41   cmp.w   ip, #65 ; 0x41
  f0:   bf08it  eq
  f2:   f042 0201   orreq.w r2, r2, #1
  f6:   601astr r2, [r3, #0]
  f8:   bc30pop {r4, r5}
  fa:   4770bx  lr


So here it changed 2 16-bit instructions into 32-bit ones by changing R4 to IP.
However there is no benefit in doing so as it doesn't remove a move or create
additionally scheduling opportunities (the first liverange of R4 has only a
single use, so when it goes dead it is actually best to reuse R4 rather than
use a different register).

[Bug rtl-optimization/71022] New: GCC prefers register moves over move immediate

2016-05-09 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71022

Bug ID: 71022
   Summary: GCC prefers register moves over move immediate
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

When assigning the same immediate value to different registers, GCC will always
CSE the immediate and emit a register move for subsequent uses. This creates
unnecessary register dependencies and increases latency. When the cost of an
immediate move is the same as a register move (which should be true for most
targets), it should prefer the former. A register move is better only when the
immediate requires multiple instructions or is larger with -Os.

It's not obvious where this is best done. The various cprop phases before IRA
do the right thing, but cse2 (which runs later) then undoes it. And
cprop_hardreg doesn't appear to be able to deal with immediates.

int f1(int x)
{
  int y = 1, z = 1;
  while (x--)
{
  y += z;
  z += x;
}
  return y + z;
}

void g(float, float);
void f2(void) { g(1.0, 1.0); g(3.3, 3.3); }

On AArch64 I get:

f1:
sub w1, w0, #1
cbz w0, .L12
mov w0, 1
mov w2, w0 *** mov w2, 1
.p2align 2
.L11:
add w2, w2, w0
add w0, w0, w1
sub w1, w1, #1
cmn w1, #1
bne .L11
add w0, w2, w0
ret
.L12:
mov w0, 2
ret

f2:
fmovs1, 1.0e+0
str x30, [sp, -16]!
fmovs0, s1*** fmov s0, 1.0
bl  g
adrpx0, .LC1
ldr x30, [sp], 16
ldr s1, [x0, #:lo12:.LC1]
fmovs0, s1*** ldr s0, [x0, #:lo12:.LC1] 
b   g

[Bug tree-optimization/71026] New: Missing division optimizations

2016-05-09 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026

Bug ID: 71026
   Summary: Missing division optimizations
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

With -Ofast GCC doesn't reassociate constant multiplies or negates away from
divisors to allow for more reciprocal division optimizations. It is also
possible to avoid divisions (or multiplies) involving immediates in comparisons
that check for a positive/negative result.

float f1(float x, float y) { return x / (y * y); }// -> x * (1/y) * (1/y)
float f2(float x, float y) { return x / (y * 3.0f); } // -> (x/3) / y
float f3(float x, float y) { return x / -y; } // -> (-x) / y
int f4(float x) { return (1.0f / x) < 0.0f; } // -> x < 0.0f
int f5(float x) { return (x / 2.0f) <= 0.0f; }// -> x <= 0.0f

A quick experiment shows the first transformation could remove almost 100
divisions from SPEC2006, the 2nd 50. The first transformation is only useful if
there is at least one other division by y, so likely best done in the division
reciprocal optimization phase.

[Bug rtl-optimization/71022] GCC prefers register moves over move immediate

2016-05-10 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71022

--- Comment #2 from Wilco  ---
(In reply to Richard Biener from comment #1)
> IRA might choose to do this as part of life-range splitting/shortening.  Note
> that reg-reg moves may be cheaper code-size wise (like on CISC archs with
> non-fixed insn lengths).  RTL cost should tell you, of course.

The cases I spotted are typically conflicting liveranges that just happen to
share the same initialization value (0 being the most common). This
transformation can be done in a generic phase using RTL costs, the moves even
have the correct regnotes.

[Bug middle-end/71443] New: [7 regression] test case gcc.dg/plugin/must-tail-call-2.c reports error

2016-06-07 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71443

Bug ID: 71443
   Summary: [7 regression] test case
gcc.dg/plugin/must-tail-call-2.c reports error
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

There are 2 new failures in the tail-call-2.c test on recent trunk builds:
FAIL: gcc.dg/plugin/must-tail-call-2.c -fplugin=./must_tail_call_plugin.so 
(test for errors, line 32)
FAIL: gcc.dg/plugin/must-tail-call-2.c -fplugin=./must_tail_call_plugin.so
(test for excess errors)

The error reported is:
/home/wdijkstr/gcc/gcc/testsuite/gcc.dg/plugin/must-tail-call-2.c: In function
'test_2_caller':^M
/home/wdijkstr/gcc/gcc/testsuite/gcc.dg/plugin/must-tail-call-2.c:32:10: error:
cannot tail-call: argument must be passed by copying^M

The argument is a large structure which is passed by reference on AArch64. So
it looks like the test needs to allow for this possibility. This appears
related to PR71293 for Power.

[Bug target/71951] libgcc_s built with -fomit-frame-pointer on aarch64 is broken

2017-07-27 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71951

--- Comment #11 from Wilco  ---
(In reply to Icenowy Zheng from comment #10)
> In my environment (glibc 2.25, and both the building scripts of glibc and
> gcc have -fomit-frame-pointer automatically enabled), this bug is not fully
> resolved yet.
> 
> With GCC upgraded to 6.4.0, GDB debugger started to work correctly. (With
> GCC 6.3.0 GDB cannot even work and segfault at unwind code in libgcc).
> 
> However, if I still build GCC with -fomit-frame-pointer in CFLAGS, the
> backtrace() function of glibc cannot work, and segfault at line 240 of
> libgcc/unwind-dw2.c .
> 
> By reading the source code, I think the unwind code is still trying to get
> CFA from the register x29, and when debugging I found that the x29 register
> in the unwind context is 0 (because of -fomit-frame-pointer), so line 240 is
> dereferencing a NULL pointer, so it segfaulted.
> 
> Maybe the behavior that accessing x29 register to get CFA is not correct?

Well if everything built with -fomit-frame-pointer then it is definitely wrong
to read x29. Can you give more info similar to comment #3?

[Bug middle-end/78468] [8 regression] libgomp.c/reduction-10.c and many more FAIL

2017-09-06 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78468

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #38 from Wilco  ---
(In reply to Rainer Orth from comment #37)
> This patch
> 
> 2017-09-05  Wilco Dijkstra  
> 
>   * explow.c (get_dynamic_stack_size): Improve dynamic alignment.
> 
> brought the exact same set of failures back on sparc-sun-solaris2.11.
> 
>   Rainer

The existing alloca code relies on STACK_BOUNDARY being set correctly. Has the
value been fixed already for the OS variants mentioned? If stack alignment
can't be guaranteed by OS/runtime/prolog, STACK_BOUNDARY should be 8.

[Bug middle-end/78468] [8 regression] libgomp.c/reduction-10.c and many more FAIL

2017-09-06 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78468

--- Comment #40 from Wilco  ---
(In reply to Eric Botcazou from comment #39)
> > The existing alloca code relies on STACK_BOUNDARY being set correctly. Has
> > the value been fixed already for the OS variants mentioned? If stack
> > alignment can't be guaranteed by OS/runtime/prolog, STACK_BOUNDARY should be
> > 8.
> 
> No, no, no, please read the ??? note I put in emit-rtl.c.  STACK_BOUNDARY is
> correct, the problem is the declared alignment of the virtual registers.

If you cannot guarantee the alignment of the pointers to STACK_BOUNDARY then
STACK_BOUNDARY is incorrect. GCC uses the STACK_BOUNDARY guarantee in
optimizations so it is essential to get this right if you want correct code
generation.

[Bug middle-end/77484] [6/7 Regression] Static branch predictor causes ~6-8% regression of SPEC2000 GAP

2017-01-16 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77484

--- Comment #29 from Wilco  ---
(In reply to Jan Hubicka from comment #28)
> > On SPEC2000 the latest changes look good, compared to the old predictor gap
> > improved by 10% and INT/FP by 0.8%/0.6%. I'll run SPEC2006 tonight.
> 
> It is rather surprising you are seeing such large changes for one branch
> predictor
> change.  Is most of it really comming just from the bb-reorder changes? On
> x86 the
> effect is mostly within noise and on Itanium Gap improve by 2-3%.
> It may be interesting to experiment with reorderin and prediction more on
> this target.

When I looked at gap at the time, the main change was the reordering of a few
if statements in several hot functions. Incorrect block frequencies also change
register allocation in a bad way, but I didn't notice anything obvious in gap.
And many optimizations are being disabled on blocks with an incorrect frequency
- this happens all over the place and is the issue causing the huge Coremark
regression.

I could do some experiments but I believe the key underlying problem is that
GCC treats the block frequencies as accurate when they are really very vague
estimates (often incorrect) and so should only be used to break ties.

In fact I would claim that even modelling if-statements as a balanced 50/50 is
incorrect. It suggests that a block that is guarded by multiple if-statements
handling exceptional cases is much less important than the very same block that
isn't, even if they are both always executed. Without profile data providing
actual frequencies we should not optimize the outer block for speed and the
inner block for size.

[Bug middle-end/77484] [6/7 Regression] Static branch predictor causes ~6-8% regression of SPEC2000 GAP

2017-01-16 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77484

--- Comment #31 from Wilco  ---
(In reply to Jan Hubicka from comment #30)
> > 
> > When I looked at gap at the time, the main change was the reordering of a 
> > few
> > if statements in several hot functions. Incorrect block frequencies also 
> > change
> > register allocation in a bad way, but I didn't notice anything obvious in 
> > gap.
> > And many optimizations are being disabled on blocks with an incorrect 
> > frequency
> > - this happens all over the place and is the issue causing the huge Coremark
> > regression.
> 
> This is the issue with jump threading code no longer sanely updating profile,
> right?  I will try to find time to look into it this week.

I don't know the exact details but James proved that the blocks are incorrectly
assumed cold so part of the optimization doesn't trigger as expected. I'm not
sure whether that is because the frequencies got too low, set incorrectly or
not set at all. 

> > I could do some experiments but I believe the key underlying problem is that
> > GCC treats the block frequencies as accurate when they are really very vague
> > estimates (often incorrect) and so should only be used to break ties.
> > 
> > In fact I would claim that even modelling if-statements as a balanced 50/50 
> > is
> > incorrect. It suggests that a block that is guarded by multiple 
> > if-statements
> > handling exceptional cases is much less important than the very same block 
> > that
> > isn't, even if they are both always executed. Without profile data providing
> > actual frequencies we should not optimize the outer block for speed and the
> > inner block for size.
> 
> There are --param options to control this. They was originally tuned based on
> Spec2000 and x86_64 scores (in GCC 3.x timeframe). if you can get resonable
> data that they are not working very well anymore (or for ARM), we could try
> to
> tune them better.
> 
> I have WIP patches to get the propagation bit more fine grained and
> propagate i.e.
> info if BB is reachable only bo known to be cold path (such that one that has
> EH edge on it). This may make the logic bit more reliable.

I'll have a look, but I think the key is to think in terms of block importance
(from cold to hot). Apart from highly skewed cases (eg. exception edges or
loops), most blocks should be equally important to optimize.

[Bug target/71951] libgcc_s built with -fomit-frame-pointer on aarch64 is broken

2017-04-13 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71951

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #8 from Wilco  ---
Likely related to PR77455 which was backported to GCC6. Unwinding most
definitely doesn't require a frame pointer (on AArch64 the frame pointer is
completely unused unless you use eg. alloca).

[Bug target/81357] Extra mov for zero extend of add

2017-09-28 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81357

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #7 from Wilco  ---
(In reply to Qing Zhao from comment #5)

> Not sure why when -fschedule-insns is ON, the destination of the add insns
> becomes W1?

add w1, w0, 1
mov w0, w1
uxtwx1, w1

That's because the move assigns the return value which overwrites w0. However
the uxtw still needs to read the result of the addition. So the register
allocator is forced to use w1 as w0 cannot be used.

I don't think there is an easy fix for this example. The compiler believes
there are 2 distinct values so it uses 2 registers irrespectively of the order
of the mov and uxtw.

[Bug target/82439] Missing (x | y) == x simplifications

2017-10-05 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82439

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Andrew Pinski from comment #1)
> "(x | y) == x" is simpler than "(y & ~x) == 0" on the tree level. 2 gimple
> vs 3.
> So converting it on the gimple level is not going to happen.

Indeed, so not match.pd, but maybe target.pd could work. There is no BIC
tree/RTL operator, neither is there a target independent way to ask whether a
target supports BIC.

[Bug middle-end/82479] missing popcount builtin detection

2017-10-09 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #6 from Wilco  ---
(In reply to Andrew Pinski from comment #5)
> Was added to LLVM back in 2012:
> http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20121119/156272.html
> 
> Again I don't know how useful it is compared to the compile time that it
> would take.

I'd be more worried that the LLVM sequence isn't better in common usage. Given
it has a fixed high latency, you need quite a few set bits before it's faster.

[Bug middle-end/78809] Inline strcmp with small constant strings

2017-10-13 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #4 from Wilco  ---
(In reply to Qing Zhao from comment #3)
> with the latest upstream GCC, we got the following assembly for the testing
> case with -O2 on aarch64:
> 
> t1:
>   adrpx1, .LC0
>   add x1, x1, :lo12:.LC0
>   b   strcmp
>   
> t2:
>   adrpx1, .LC0
>   add x1, x1, :lo12:.LC0
>   b   strcmp

Yes the inlining was recently removed from GLIBC since the goal is to inline in
GCC.

[Bug middle-end/78809] Inline strcmp with small constant strings

2017-10-23 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809

--- Comment #8 from Wilco  ---

> /home/qinzhao/Install/latest/bin/gcc -O2 t_p_1.c t_p.c
> non-inlined version
> 20.84user 0.00system 0:20.83elapsed 100%CPU (0avgtext+0avgdata
> 360maxresident)k
> 0inputs+0outputs (0major+135minor)pagefaults 0swaps
> 
> From the data, we can see the inlined version of strcmp (by glibc) is much
> slower than the direct call to strcmp.  (this is for size 2)
> I am using GCC farm machine gcc116:

This result doesn't make sense - it looks like GCC is moving the strcmp call in
the 2nd case as a loop invariant, so you're just measuring a loop with just a
subtract and orr instruction...

[Bug middle-end/78809] Inline strcmp with small constant strings

2017-10-23 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809

--- Comment #9 from Wilco  ---
(In reply to Qing Zhao from comment #7)

str(n)cmp with a constant string can be changed into memcmp if the string has a
known alignment or is an array of known size. We should check the common cases
are implemented.

Given it takes just 3 instructions per character on almost all targets, it is
reasonable to inline strcmp with a small string if it cannot be changed into
memcmp. This will be faster than calling strcmp due to the overhead of the
call, the plt redirection and alignment checks. Note also that many strcmp
implementations will do a byte-by-byte check if the strings are not aligned.

[Bug middle-end/78809] Inline strcmp with small constant strings

2017-10-24 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809

--- Comment #16 from Wilco  ---
(In reply to Qing Zhao from comment #15)
> (In reply to Wilco from comment 14)
> > The only reason we have to do a character by character comparison is 
> > because we
> > cannot read beyond the end of a string. However when we know the size or
> > alignment we can safely process a string one word at a time.
> 
> is it possible that “NULL_terminator” is in the middle of the string even
> though we 
> know the size or alignment? for example:
> 
> const char s[8] = “abcd\0abc”;  // null byte in the middle of the string
> int f2(void) { return __builtin_strcmp(s, "abc") != 0; }
> int f3(void) { return __builtin_strcmp(s, “abc”); }
> 
> can either of the above f2 or f3 been optimized to memcmp? seems not.

You never get that to the null byte as the memcmp only compares strlen("abc"+1)
characters. However do you mean an input string which is shorter than the
constant string? That's fine as this will compare not-equal in the memcmp.

[Bug middle-end/78809] Inline strcmp with small constant strings

2017-10-24 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809

--- Comment #18 from Wilco  ---
(In reply to Qing Zhao from comment #17)
> (In reply to Wilco from comment #16)
> 
> >> const char s[8] = “abcd\0abc”;  // null byte in the middle of the string
> >> int f2(void) { return __builtin_strcmp(s, "abc") != 0; }
> >> int f3(void) { return __builtin_strcmp(s, “abc”); }
> >> 
> >> can either of the above f2 or f3 been optimized to memcmp? seems not.
> > 
> > You never get that to the null byte as the memcmp only compares 
> > strlen("abc"+1)
> > characters.
> 
> Yes, this is correct for memcmp, but not for strcmp.  strcmp will get to the
> null byte. 
> as a result,  
> 
> const char s[8] = “abcd\0abc”;
> strcmp (s, “abc”) != 0.   // s = “abcd", which is != “abc"
> strncmp (s, “abc”, 3) == 0
> memcmp(s, “abc”, 3) == 0
> 
> So, strcmp cannot optimized to memcmp 

No that should be:

strcmp (s, “abc”) != 0
strncmp (s, “abc”, 4) != 0
memcmp(s, “abc”, 4) != 0

You need to compare the null terminator as well.

> > However do you mean an input string which is shorter than the
> > constant string? That's fine as this will compare not-equal in the memcmp.
> 
> for the input string is shorter than the constant string, for example: 
> 
> const char s[8] = “ab\0\0abcd”;
> strcmp (s, “abc”) != 0
> strncmp (s, “abc”, 3) != 0
> memcmp (s, “abc”,3) != 0
> 
> In a summary, since it’s NOT easy for the compiler to know where is the
> “Null_terminator” 
> in the string, strcmp is NOT reasonable to be optimized to memcmp whenever
> its result is 
> used to compare with zero or not.

The compiler knows where the null terminator is in the constant string so it
can easily figure out when it is legal as well as faster than doing a byte by
byte expansion of strcmp.

strcmp (s, STR) -> memcmp (s, strlen (STR) + 1) iff max(sizeof_array(s),
alignment(s)) > strlen (STR).

> But for strncmp, if the result is to compare with zero, it might be
> reasonable to optimized it
> to the corresponding memcmp, i.e
> 
> strncmp (s, “abc”, 3) != 0
> 
> could be optimized to
> 
> memcmp (s, “abc”, 3) != 0

If the strncmp size is smaller than the strlen of the constant string (and
alignment is right), yes. But strncmp (s, "abc", C) is equivalent to strcmp (s,
"abc") if C >= 4.

[Bug target/77455] New: [AArch64] eh_return implementation fails

2016-09-02 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77455

Bug ID: 77455
   Summary: [AArch64] eh_return implementation fails
   Product: gcc
   Version: 4.8.4
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

The __builtin_eh_return implementation on AArch64 generates incorrect code for
many cases due to using an incorrect offset/pointer when writing the new return
address to the stack. Also optimizations may remove the write due to a missing
scheduling barrier. As a result in most cases eh_return does not work properly.

[Bug target/77455] [AArch64] eh_return implementation fails

2016-09-02 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77455

Wilco  changed:

   What|Removed |Added

 Target||AArch64
  Known to fail||4.8.4

--- Comment #1 from Wilco  ---
See https://gcc.gnu.org/ml/gcc-patches/2016-09/msg00077.html for more details,
a simpler reimplementation and a testcase.

[Bug tree-optimization/66946] Spurious uninitialized warning

2016-09-05 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66946

Wilco  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |INVALID

--- Comment #6 from Wilco  ---
(In reply to Manuel López-Ibáñez from comment #5)
> Still valid?

No, I can't trigger it anymore, so marking as resolved.

[Bug middle-end/77484] New: Static branch predictor causes ~6-8% regression of SPEC2000 GAP

2016-09-05 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77484

Bug ID: 77484
   Summary: Static branch predictor causes ~6-8% regression of
SPEC2000 GAP
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

Changes in the static branch predictor (around August last year) caused
regressions on SPEC2000. The PRED_CALL predictor causes GAP to regress by 6-8%
on AArch64, and this hasn't been fixed on trunk. With this predictor turned
off, INT is 0.6% faster and FP 0.4%.

The reason is that the predictor causes calls that are guarded by if-statements
to be placed at the end of the function. For Gap this is bad as it often
executes several such statements in a row, resulting in 2 extra taken branches
and additional I-cache misses per if-statement. So it seems that on average
this prediction makes things worse.

Overall the static prediction and -freorder-blocks provide a benefit. However
does the gain of each static prediction being correct outweigh the cost of the
prediction being incorrect? Has this been measured for each of the static
predictors across multiple targets?

[Bug tree-optimization/65068] Improve rewriting for address type induction variables in IVOPT

2016-09-08 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65068

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #3 from Wilco  ---
Note gcc.target/aarch64/ldp_vec_64_1.c shows the same issue. Same story there,
most cores produce inefficient code, only -mcpu=vulcan gets close:

foo:
add x1, x1, 8
add x2, x0, 8192
.p2align 3
.L2:
ldr d0, [x1, -8]
ldr d1, [x1], 16
add v0.2s, v0.2s, v1.2s
str d0, [x0], 16
cmp x2, x0
bne .L2
ret

That still shows an odd pre-increment in the loop header, which blocks the use
of ldp.

Also in all cases aarch64_legitimize_address is called with offset 32760 on
V2SI - this offset does not occur anywhere in the source, so it should not
matter how we split it. However what we do for that case affects IVOpt, which
is clearly a bug.

[Bug middle-end/77568] New: [7 regression] CSE/PRE/Hoisting blocks common instruction contractions

2016-09-12 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77568

Bug ID: 77568
   Summary: [7 regression] CSE/PRE/Hoisting blocks common
instruction contractions
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

The recently introduced code hoisting aggressively moves common subexpressions
that might otherwise be mergeable with other operations. This caused a large
regression in one benchmark. A simple reduced test shows the issue:

float f(float x, float y, float z, int a)
{
   if (a > 100)
 x += y * z;
   else
 x -= y * z;
   return x;
}

This now produces on AArch64:

f:
fmuls2, s1, s2
cmp w0, 100
fadds1, s0, s2
fsubs0, s0, s2
fcsel   s0, s0, s1, le
ret

Note the issue is not limited to hoisting, CSE/PRE cause similar issues:

void g(int, int);
int f2(int x)
{
  g(x, x+1);
  g(x, x+1);
  return x+1;
}

f2:
stp x29, x30, [sp, -32]!
add x29, sp, 0
stp x19, x20, [sp, 16]
add w19, w0, 1
mov w20, w0
mov w1, w19
bl  g
mov w1, w19
mov w0, w20
bl  g
mov w0, w19
ldp x19, x20, [sp, 16]
ldp x29, x30, [sp], 32
ret

Given x+1 is used as a function argument, there is no benefit in making it
available as a CSE after each call - repeating the addition is cheaper than
using an extra callee-save and copying it several times.

This shows a similar issue for bit tests. Most targets support ANDS or bit test
as a single instruction (or even bit test+branch), so CSEing the (x & C)
actually makes things worse:

void f3(char *p, int x)
{
  if (x & 1) p[0] = 0;
  if (x & 2) p[1] = 0;
  if (x & 4) p[2] = 0;
  if (x & 8) p[2] = 0;
  g(0,0);
  if (x & 1) p[3] = 0;
  if (x & 2) p[4] = 0;
  if (x & 4) p[5] = 0;
  if (x & 8) p[6] = 0;
}

This uses 4 callee-saves to hold the (x & C) CSEs. Doing several such bit tests
in a more complex function means you quickly run out of registers...

Given it would be much harder to undo these CSEs at RTL level (combine does
most contractions but can only do it intra-block), would it be reasonable to
block CSEs for these special cases?

[Bug middle-end/77568] [7 regression] CSE/PRE/Hoisting blocks common instruction contractions

2016-09-12 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77568

--- Comment #3 from Wilco  ---
(In reply to Andrew Pinski from comment #1)
> I think this is just a pass ordering issue.  We create fmas after PRE. 
> Maybe we should do it both before and after ...
> Or enhance the pass which produces FMA to walk through to another bb ...

FMAs are not created in the tree, Expand can do simple cases, and Combine finds
other cases. Given more and more targets support FMA, there is certainly an
argument for adding an FMA tree operator.

[Bug middle-end/77568] [7 regression] CSE/PRE/Hoisting blocks common instruction contractions

2016-09-12 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77568

--- Comment #5 from Wilco  ---
(In reply to Andrew Pinski from comment #2)
> Note there are two different issues here.

Well they are 3 examples of the same underlying issue - don't do a CSE when
it's not profitable. How they are resolved might be different of course.

> One for the FMA issue which can/should be solved at the tree level (and that
> is a regression).
> 
> The other is the CSE, we should be able to uncse (rematization) during
> register allocation to reduce register pressure; I think this has been filed
> already.

A more general rematerialization pass would certainly be very useful for lots
of cases. However it wouldn't improve any of my examples as they are not about
spilling (when you run out of registers it is more efficient to do more work to
recompute values rather than spill and reload them).

[Bug middle-end/77580] New: Improve devirtualization

2016-09-13 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77580

Bug ID: 77580
   Summary: Improve devirtualization
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wdijkstr at arm dot com
  Target Milestone: ---

A commonly used benchmark contains a hot loop which calls one of 2 virtual
functions via a static variable which is set just before. A reduced example is:

int f1(int x) { return x + 1; }
int f2(int x) { return x + 2; }

static int (*virt)(int);

int f(int *p, int n, int x)
{
  int i;
  if (x > 10)
virt = f1;
  else
virt = f2;
  for (i = 0; i < n; i++)
p[i] = virt (i);
}

This is obviously very stupid code, however GCC could do better. virt is not
address-taken, neither f1 nor f2 make a call, so virt is effectively a local
variable. So the loop could be transformed to p[i] = (x > 10) ? f1 (i) : f2
(i); to enable inlining after which the if can be lifted:

int f_opt(int *p, int n, int x)
{
  int i;
  if (x > 10)
virt = f1;
  else
virt = f2;
  if (x > 10)
for (i = 0; i < n; i++)
  p[i] = f1 (i);
  else
for (i = 0; i < n; i++)
  p[i] = f2 (i);
}

[Bug tree-optimization/32650] Convert p+strlen(p) to strchr(p, '\0') if profitable

2016-09-28 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=32650

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
This is codesize optimization as strchr is significantly slower than strlen -
see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61056 which changes strchr
into strlen.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-19 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Bernd Edlinger from comment #1)
> some background about this bug can be found here:
> 
> https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html

The di3_neon pattern does not constrain the input not to overlap with
the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code
does not handle partial overlaps. However it is feasible to fix that by
swapping the order for the various cases.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-19 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #4 from Wilco  ---
(In reply to Bernd Edlinger from comment #3)
> (In reply to Wilco from comment #2)
> > (In reply to Bernd Edlinger from comment #1)
> > > some background about this bug can be found here:
> > > 
> > > https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html
> > 
> > The di3_neon pattern does not constrain the input not to overlap with
> > the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code
> > does not handle partial overlaps. However it is feasible to fix that by
> > swapping the order for the various cases.
> 
> from di3_neon:
> 
> if (INTVAL (operands[2]) < 1)
>   {
> emit_insn (gen_movdi (operands[0], operands[1]));
> DONE;
>   }
> 
> Will the movdi pattern work with partial overlaps?
> It does basically this:
> 
>   emit_move_insn (gen_lowpart (SImode, operands[0]),
>   gen_lowpart (SImode, operands[1]));
>   emit_move_insn (gen_highpart (SImode, operands[0]),
>   gen_highpart (SImode, operands[1]));

I think it's OK - that code only triggers if a movdi has a physical register
that is not a valid DI register which is not the case we're dealing with. movdi
has a split that does check for partial overlap around line 5900 in arm.md:

  /* Handle a partial overlap.  */
  if (rtx_equal_p (operands[0], operands[3]))
 ...

However dealing with partial overlaps is complex so maybe the best option would
be to add alternatives to di3_neon to either allow full overlap "r 0 X X
X" or no overlap "&r r X  X X". The shift code works with full overlap.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-20 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #8 from Wilco  ---
(In reply to Bernd Edlinger from comment #7)
> (In reply to Richard Earnshaw from comment #6)
> > (In reply to Bernd Edlinger from comment #5)
> > > (In reply to Wilco from comment #4)
> > > > However dealing with partial overlaps is complex so maybe the best 
> > > > option
> > > > would be to add alternatives to di3_neon to either allow full 
> > > > overlap
> > > > "r 0 X X X" or no overlap "&r r X  X X". The shift code works with full
> > > > overlap.
> > > 
> > > That sounds like a good idea.
> > > 
> > > Then this condition in di3_neon could go away too:
> > > 
> > > && (!reg_overlap_mentioned_p (operands[0], operands[1])
> > > || REGNO (operands[0]) == REGNO (operands[1])))
> > 
> > Note that we don't want to restrict complete overlaps, only partial
> > overlaps.  Restricting complete overlaps leads to significant increase in
> > register pressure and a lot of redundant copying.
> 
> Yes.
> 
> That is Wilco's idea: instead of =r 0r X X X
> use =r 0 X X X and =&r r X X X, that should ensure that
> no partial overlap happens, just full overlap or nothing.
> 
> That's what arm_emit_coreregs_64bit_shift
> and arm_ashldi3_1bit can handle.
> 
> Who will do it?

I've got a patch that fixes it, it's being tested.

While looking at how DI mode operations get expanded, I noticed there is a CQ
issue with your shift change. Shifts that are expanded early now use extra
registers due to the DI mode write of zero. Given all other DI mode operations
are expanded after reload, it may be better to do the same for shifts too.

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-20 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #12 from Wilco  ---
It looks like we need a different approach, I've seen the extra SETs use up
more registers in some cases, and in other cases being optimized away early
on... 

Doing shift expansion at the same time as all other DI mode operations should
result in the same stack size as -fpu=neon. However that's still well behind
Thumb-1, and I would expect ARM/Thumb-2 to beat Thumb-1 easily with 6 extra
registers.

The spill code for Thumb-2 seems incorrect:

(insn 11576 8090 9941 5 (set (reg:SI 3 r3 [11890])
(plus:SI (reg/f:SI 13 sp)
(const_int 480 [0x1e0]))) sha512.c:147 4 {*arm_addsi3}
 (nil))
(insn 9941 11576 2978 5 (set (reg:DI 2 r2 [4210])
(mem/c:DI (reg:SI 3 r3 [11890]) [5 %sfpD.4158+-3112 S8 A64]))
sha512.c:147 170 {*arm_movdi}
 (nil))

LDRD has a range of 1020 on Thumb-2 so I would expect this to be a single
instruction.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-21 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #11 from Wilco  ---
(In reply to ktkachov from comment #10)
> Confirmed then. Wilco, if you're working on this can you please assign it to
> yourself?

Unfortunately the form doesn't allow me to do anything with the headers...

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-25 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #14 from Wilco  ---
(In reply to Bernd Edlinger from comment #13)
> I am still trying to understand why thumb1 seems to outperform thumb2.
> 
> Obviously thumb1 does not have the shiftdi3 pattern,
> but even if I remove these from thumb2, the result is still
> not par with thumb2.  Apparently other patterns still produce di
> values that are not enabled with thumb1, they are 
> xordi3 and anddi3, these are often used.  Then there is
> adddi3 that is enabled in thumb1 and thumb2, I also disabled
> this one, and now the sha512 gets down to inclredible 1152
> bytes frame (-Os -march=armv7 -mthumb -float-abi=soft):
> 
> I know this is a hack, but 1K stack is what we should expect...
> 
> --- arm.md  2016-10-25 19:54:16.425736721 +0200
> +++ arm.md.orig 2016-10-17 19:46:59.0 +0200
> @@ -448,7 +448,7 @@
>   (plus:DI (match_operand:DI 1 "s_register_operand" "")
>(match_operand:DI 2 "arm_adddi_operand"  "")))
>  (clobber (reg:CC CC_REGNUM))])]
> -  "TARGET_EITHER && !TARGET_THUMB2"
> +  "TARGET_EITHER"

So you're actually turning the these instructions off for Thumb-2? What does it
do instead then? Do the number of instructions go down?

I noticed that with or without -mfpu=neon, using -marm is significantly smaller
than -mthumb. Most of the extra instructions appear to be moves, which means
something is wrong (I would expect Thumb-2 to do better as it supports LDRD
with larger offsets than ARM).

[Bug target/77308] surprisingly large stack usage for sha512 on arm

2016-10-31 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308

--- Comment #32 from Wilco  ---
(In reply to Bernd Edlinger from comment #31)
> Sure, combine cant help, especially because it runs before split1.
> 
> But I wondered why this peephole2 is not enabled:
> 
> (define_peephole2 ; ldrd
>   [(set (match_operand:SI 0 "arm_general_register_operand" "")
> (match_operand:SI 2 "memory_operand" ""))
>(set (match_operand:SI 1 "arm_general_register_operand" "")
> (match_operand:SI 3 "memory_operand" ""))]
>   "TARGET_LDRD
>  && current_tune->prefer_ldrd_strd
>  && !optimize_function_for_size_p (cfun)"
>   [(const_int 0)]
> 
> 
> I have -march=armv7-a / -mcpu=cortex-a9 and thus for me
> current_tune-> prefer_ldrd_strd is FALSE.
> 
> Furthermore, if I want to do -Os the third condition is FALSE too.
> But one ldrd must be shorter than two ldr ?
> 
> That seems wrong...

Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
should only be tried on Thumb-1. Emitting LDRD from a peephole when the offset
is in range will never increase code size so should always be enabled.

[Bug rtl-optimization/65862] [MIPS] IRA/LRA issue: integers spilled to floating-point registers

2015-04-27 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65862

--- Comment #4 from Wilco  ---
(In reply to Vladimir Makarov from comment #3)

But I can not just revert the patch making ALL_REGS available
> to make
coloring heuristic more fotunate for your particular case, as it
> reopens the old PR for which the patch was created and i have no other
> solutions for the old PR.

I tried reverting the ALL_REGS patch and I don't see any regressions - in fact
allocations are slightly better (fewer registers with ALL_REGS preference which
is what we need - a strong decision to allocate to either FP or int regs). So
what was the motivation for it?

Note that it would be trivial to prefer ALL_REGS for some operands if
necessary. The only case I can imagine is load and store which on some targets
are quite orthogonal. I tried doing m=r#w and m=w#r on AArch64 and that works
fine (this tells the preferencing code that ALL_REGS is best but it still keeps
a clear INT/FP separation in the patterns which you may need for disassembly
etc).

IMHO that is a better solution than to automatically change the patterns r=r+r;
w=w+w into rw=rw+rw and assume that ALL_REGS preference on all operands has
zero cost eventhough the cost calculation explicitly states otherwise.


[Bug rtl-optimization/65862] [MIPS] IRA/LRA issue: integers spilled to floating-point registers

2015-05-14 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65862

--- Comment #13 from Wilco  ---
(In reply to Vladimir Makarov from comment #9)
> Created attachment 35503 [details]
> ira-hook.patch
> 
> Here is the patch.  Could you try it and give me your opinion about it. 
> Thanks.

I tried it out and when forcing ALL_REGS to either GENERAL_REGS or FP_REGS
based on mode it generates significantly smaller spillcode, especially on some
of the high register pressure SPECFP2006 benchmarks like gamess. It is better
than avoiding ALL_REGS if the cost is higher (like the PPC patch mentioned
earlier) - this indicates that the case for preferring ALL_REGS for generic
loads/stores is pretty thin and likely not beneficial overall.

I'm glad with this we're moving towards a more conventional allocation scheme
where the decision of which register class to use is made early rather than
independently for each operand during allocation.

I didn't do a full performance regression test but early results show identical
performance with smaller codesize.

Note that this doesn't solve the lra-constraints issue where it ignores the
allocno class during spilling and just chooses the first variant that matches.


[Bug target/63304] Aarch64 pc-relative load offset out of range

2015-11-06 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304

--- Comment #33 from Wilco  ---
(In reply to Evandro from comment #32)
> (In reply to Ramana Radhakrishnan from comment #31)
> > (In reply to Evandro from comment #30)
> > > The performance impact of always referring to constants as if they were 
> > > far
> > > away is significant on targets which do not fuse ADRP and LDR together. 
> > 
> > What happens if you split them up and schedule them appropriately ? I didn't
> > see any significant impact in my benchmarking on implementations that did
> > not implement such fusion. Where people want performance in these cases they
> > can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there
> > already.
> 
> Because of side effects of the Haiffa scheduler, the loads now pile up, and
> the ADRPs may affect the load issue rate rather badly if not fused.  At leas
> on our processor.  

ADRP latency to load-address should be zero on any OoO core - ADRP is basically
a move-immediate, so can execute early and hide any latency.

> Which brings another point, shouldn't there be just one ADRP per BB or,
> ideally, per function?  Or am I missing something?

That's not possible in this case as the section is mergeable. An alternative
implementation using anchors may be feasible, but GCC is extremely bad at using
anchors efficiently - functions using several global variables also end up with
a large number of ADRPs when you'd expect a single ADRP.

[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-10 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #6 from Wilco  ---
I ran the assembler examples on A57 hardware with identical input. The FMADD
code is ~20% faster irrespectively of the size of the input. This is not a
surprise given that the FMADD latency is lower than the FADD and FMUL latency.

The alignment of the loop or scheduling don't matter at all as the FMADD
latency dominates by far - with serious optimization this code could run 4-5
times as fast and would only be limited by memory bandwidth on datasets larger
than L2.

So this particular example shows issues in LLVM, not in GCC.


[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations

2014-10-21 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #10 from Wilco  ---
The loops shown are not the correct inner loops for those options - with
-ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the
question is why GCC doesn't unroll vectorized loops like LLVM?

GCC:

.L24:
ldrq3, [x13, x5]
addx6, x6, 1
ldrq2, [x16, x5]
cmpx6, x12
addx5, x5, 16
fmlav1.2d, v3.2d, v2.2d
bcc.L24

LLVM:

.LBB2_12:
ldurq2, [x8, #-16]
ldrq3, [x8], #32
ldurq4, [x21, #-16]
ldrq5, [x21], #32
fmlav1.2d, v2.2d, v4.2d
fmlav0.2d, v3.2d, v5.2d
subx30, x30, #4// =4
cbnzx30, .LBB2_12


  1   2   >