[Bug rtl-optimization/94026] combine missed opportunity to simplify comparisons with zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94026 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #5 from Wilco --- (In reply to Fei Yang from comment #4) > (In reply to Fei Yang from comment #0) > > Created attachment 47966 [details] > > proposed patch to fix this issue > > > > Simple test case: > > int > > foo (int c, int d) > > { > > int a = (c >> d) & 7; > > > > if (a >= 2) { > > return 1; > > } > > > > return 0; > > } > > > > Compile option: gcc -S -O2 test.c > > > > > > On aarch64, GCC trunk emits 4 instrunctions: > > asr w0, w0, 8 > > tst w0, 6 > > csetw0, ne > > ret > > > > which can be further simplified into: > > tst x0, 1536 > > csetw0, ne > > ret > > > > We see the same issue on other targets such as i386 and x86-64. > > > > Attached please find proposed patch for this issue. > > The previously posted test case is not correct. > Test case should be: > int fifth (int c) > { > int a = (c >> 8) & 7; > > if (a >= 2) { > return 1; > } else { > return 0; > } > } Simpler cases are: int f1(int x) { return ((x >> 8) & 6) != 0; } int f2(int x) { return ((x << 2) & 24) != 0; } int f3(unsigned x) { return ((x << 2) & 15) != 0; } int f4(unsigned x) { return ((x >> 2) & 14) != 0; }
[Bug tree-optimization/91322] [10 regression] alias-4 test failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Martin Liška from comment #1) > I've just run the test-case on aarch64 and it works fine (-O2, -O2 -flto, > -O3 -flto -fno-early-inlining). And lto.exp testsuite works fine on aarch64. > @Wilco: Can you please double-check? Yes it now works on AArch64, but I still see failures on Arm.
[Bug tree-optimization/91322] [10 regression] alias-4 test failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91322 --- Comment #10 from Wilco --- (In reply to Christophe Lyon from comment #6) > Created attachment 48184 [details] > GCC passes dumps So according to that, in 105t.vrp1 it removes the branch and unconditionally calls abort: Folding statement: _4 = _3 == 0B; Matching expression match.pd:1737, gimple-match.c:708 Matching expression match.pd:1740, gimple-match.c:772 Matching expression match.pd:1747, gimple-match.c:826 Not folded Folding statement: if (_5 == 0) gimple_simplified to if (1 != 0) Folded into: if (1 != 0) Folding statement: return; Not folded Folding statement: __builtin_abort (); Not folded Removing dead stmt _5 = __builtin_constant_p (_4); It doesn't make sense, these are the VRP ranges: _1: short int * * VARYING _2: short int * * VARYING _3: short int * VARYING _4: bool VARYING _5: int [0, 0] _10: struct a * VARYING So somehow it decides that __builtin_constant_p (VARYING) == [0,0]???
[Bug tree-optimization/94442] [AArch64] Redundant ldp/stp instructions emitted at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94442 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- This should be marked as [10 regression].
[Bug rtl-optimization/93565] Combine duplicates count trailing zero instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #6 from Wilco --- (In reply to Segher Boessenkool from comment #4) > Why would that be unlikely? It lengthens the lifetime of that pseudo, > potentially significantly. The move should be easy to remove since it is already known to be redundant. I don't understand how you can say that the lifetime is increased when duplicating the instruction is actually what increases the lifetime. Also it requires extra registers to hold the two identical results.
[Bug middle-end/66462] GCC isinf/isnan/... builtins cause sNaN exceptions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66462 --- Comment #12 from Wilco --- (In reply to Segher Boessenkool from comment #11) > I currently have > > === > diff --git a/gcc/builtins.c b/gcc/builtins.c > index ad5135c..bc3d318 100644 > --- a/gcc/builtins.c > +++ b/gcc/builtins.c > @@ -9050,6 +9050,12 @@ fold_builtin_interclass_mathfn (location_t loc, tree > fnde >if (interclass_mathfn_icode (arg, fndecl) != CODE_FOR_nothing) > return NULL_TREE; > > + /* None of these builtins are ever exceptional, not even for signaling > NaNs, > + so we cannot do any of these optimizations that involve a floating > point > + comparison. */ > + if (flag_signaling_nans) > +return NULL_TREE; > + >mode = TYPE_MODE (TREE_TYPE (arg)); > >bool is_ibm_extended = MODE_COMPOSITE_P (mode); > === > > but we should really handle this with some non-signaling insns, not punt > it to libm to do. Well we should simply commit Tamar's patch again since it works fine on any IEEE targets and showed performance gains across many targets. Any issues with weird 128-bit FP formats can be addressed separately.
[Bug fortran/91690] Slow IEEE intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91690 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #6 from Wilco --- (In reply to Steve Kargl from comment #5) > On Mon, Sep 09, 2019 at 06:25:53PM +, wilco at gcc dot gnu.org wrote: > > > > > > The Fortran standard may require this behavior. 18-007r1 page 435 > > > > But none of that is needed since a correct implementation of isnan > > etc does not affect the floating point state at all (see PR66462). > > > > OK. So you can special case IEEE_IS_NAN(). What about > the other IEEE_ARITHMETIC functions? The same is true for the other isxxx functions from C99 and the new iszero and isdenormal. Are there other functions besides these in IEEE_ARITHMETIC? > PR66462 has been re-opened. Tamir seemed to have developed > 4 different patches, yet no one can agree on "a correct > implementation" for all targets. As his latest comments explain, Tamar's last patch is correct. That should be the default since an integer implementation which doesn't affect floating point state is the only correct one. It happens to be the fastest as well as a bonus. Either way, I don't believe the Fortran front-end should try to work around mid-end bugs. As reported a save/restore is prohibitive expensive on all targets so shouldn't even be contemplated.
[Bug middle-end/91753] Bad register allocation of multi-register types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91753 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Andrew Pinski from comment #1) > lower-subreg should have be able to help here. I wonder why it did not ... I'm not sure how it can help. When you write a part of a multi-register mode using subreg, you get incorrect liveness info. This is why splitting 64-bit types on 32-bit targets before register allocation gives such a huge gain. The approach I used in other compilers was to generate multi-register virtual registers using a single create operation rather than via multiple subreg writes.
[Bug target/91766] -fvisibility=hidden during -fpic still uses GOT indirection on arm64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91766 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #4 from Wilco --- (In reply to martin krastev from comment #3) > So it appears to be a clash between -fcommon and -fvisibility=hidden during > -fpic -- passing either -fno-common or -fno-pic drops the GOT indirection. > And explicitly hiding the symbol obviously solves it. But the crux of the > issue, IMO, is a multi-platform one -- that behavior deviates on gcc-8.2 > from platform to platform. On amd64 it suffices to -fvisibility=hidden to > stop GOT detours, whereas on aarch64 it's -fvisibility=hidden -fno-common. > As a result aarch64 performance gets penalized in unsuspecting multi-plats. -fno-common should really become the default, -fcommon causes way too many issues.
[Bug tree-optimization/88760] GCC unrolling is suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #27 from Wilco --- (In reply to Segher Boessenkool from comment #26) > Yeah, and it probably should be a param (that different targets can default > differently, per CPU probably). On most Power CPUs all loops take a minimum > number of cycles per iteration (say, three), but that translates to a lot of > instructions (say, >10). > > Small loops should probably be unrolled at -O2 already as well. Maybe fixed > length loops only? Agreed, there is no reason not to unroll (or vectorize) with -O2 given several other compilers do (and beat GCC as a result). It's best to limit unrolling to small loops and only 2x unless it's 1-2 instructions.
[Bug middle-end/84071] [7/8 regression] nonzero_bits1 of subreg incorrect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84071 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Andrew Pinski from comment #1) > I think this is related to PR 81443. Can you see if the 7.3 release has it > fixed? It's related indeed, in both cases it incorrectly uses load_extend_op on registers rather than MEM. The first primarily fails on targets that return ZERO_EXTEND, the 2nd on targets that use SIGN_EXTEND.
[Bug rtl-optimization/81443] [8 regression] build/genrecog.o: virtual memory exhausted: Cannot allocate memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81443 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #21 from Wilco --- See also PR84071 which has bad codegen from r242326.
[Bug middle-end/84071] [7/8 regression] wrong elimination of zero-extension after sign-extended load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84071 --- Comment #8 from Wilco --- (In reply to Eric Botcazou from comment #6) > > They are always written but have an undefined value. Adding 2 8-bit values > > results in a 9-bit value with WORD_REGISTER_OPERATIONS. > > If they have an undefined value, then WORD_REGISTER_OPERATIONS must not be > defined for ARM. Here's the definition: > > -- Macro: WORD_REGISTER_OPERATIONS > Define this macro to 1 if operations between registers with > integral mode smaller than a word are always performed on the > entire register. Most RISC machines have this property and most > CISC machines do not. > > If the 8-bit addition is not performed on the entire 32-bit register, then > this is not a WORD_REGISTER_OPERATIONS target. The addition is performed on the full 32-bit register, so this obviously means that the top 24 bits have an undefined value.
[Bug tree-optimization/84114] global reassociation pass prevents fma usage, generates slower code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Steve Ellcey from comment #0) > Created attachment 43279 [details] > Test case > > The example code comes from milc in SPEC2006. > > GCC on x86 or aarch64 generates better code with -O3 than it does with > -Ofast or '-O3 -ffast-math'. On x86 compiling with '-mfma -O3' I get 5 > vfmadd231sd instructions, 1 vmulsd instruction and 6 vmovsd. With '-mfma > -Ofast' I get 3 vfmadd231sd, 2 vaddsd, 3 vmulsd, and 6 vmovsd. That is two > extra instructions. > > The problem seems to be that -Ofast turns on -ffast-math and that enables > the global reassociation pass (tree-ssa-reassoc.c) and the code changes > done there create some temporary variables which inhibit the recognition > and use of fma instructions. > > Using -O3 and -Ofast on aarch64 shows the same change. I noticed this a while back, the reassociation pass has changed and now we get far fewer fmas. See https://gcc.gnu.org/ml/gcc-patches/2017-06/msg00771.html
[Bug tree-optimization/90838] Detect table-based ctz implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90838 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Jakub Jelinek from comment #1) > __builtin_ctzll is undefined for 0, while the above is well defined and > returns 0. When the target ctz is well defined and returns 64 for 0, and we want to return 0 instead, this will work: __builtin_ctzll (b) & 63
[Bug tree-optimization/91144] [10 regiression] 176.gcc miscompare after r273294
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91144 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Andrew Pinski from comment #1) > SPEC2000's GCC is known to have strict aliasing issues in it :). > Does adding -fno-strict-aliasing help? Yes - like I mentioned :-) It has never been needed though. Vlad's SPEC2000 config doesn't have it either for example.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #23 from Wilco --- (In reply to Jakub Jelinek from comment #22) > So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0, > because > for x == 0 the test will yield a different value, as 0xU % 3U is 0 > and > 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaabU <= > 0x5554U, but x % 3U == 0 is equivalent to x * 0xaaabU <= 0xU. > Now to see if something useful can be used also for the even divisors. Yes for this case it is safe to do (x - 1) * 0xaaab < 0x, but you can also do x * 0xaaab >= 0xaaab which is even simpler. Basically (x % C) == N is simpler for 2 values of N.
[Bug tree-optimization/84114] global reassociation pass prevents fma usage, generates slower code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114 --- Comment #3 from Wilco --- (In reply to Richard Biener from comment #1) > This is probably related to targetm.sched.reassociation_width where reassoc > will widen a PLUS chain so several instructions will be executable in > parallel > without dependences. Thus, (x + (y + (z + w))) -> (x + y) + (z + w). When > all of them are fed by multiplications this goes from four fmas to two. > > It's basically a target request we honor so it works as designed. > > At some point I thought about integrating FMA detection with reassociation. It should understand FMA indeed, A*B + p[0] + C*D + p[1] + E*F + p[2] can become(((p[0] + p[1] + p[2]) + A*B) + C*D) + E*F. Also we're missing a reassociation depth parameter. You need to be able to specify how long a chain needs to be before it is worth splitting - the example shows a chain of 5 FMAs is not worth splitting since FMA latency on modern cores is low, but if these were integer operations (not MADD) then the chain should be split.
[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #21 from Wilco --- (In reply to Douglas Mencken from comment #18) > (In reply to Wilco from comment #17) > > > Yes that should work. > > Oops, but it doesn’t. I just tested it with patched 8.2. Same messages, same > breakage That's odd. The stack pointer is definitely 16-byte aligned in all cases right? Can you check gcc.dg/pr78468.c passes and attach the disassembly please?
[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669 --- Comment #24 from Wilco --- (In reply to Douglas Mencken from comment #22) > (In reply to Wilco from comment #21) > > > That's odd. The stack pointer is definitely 16-byte aligned in all cases > > right? > > As I know, PowerPC has no special “stack pointer”, it is just one of general > purpose register, conventionally it is r1. Instruction like “stwu r3,-2(r1)” > which are common for prologues easily bin any alignment. STACK_BOUNDARY is the minimum stack alignment. The optimizer relies on this being correct. If the ABI or prologue doesn't guarantee the minimum alignment then alloca will fail, so it's essential to set STACK_BOUNDARY correctly. > > Can you check gcc.dg/pr78468.c passes and attach the disassembly > > please? > > Using which compiler? xgcc from stage1? Is it buildable alone by itself? Yes the stage1 compiler would be fine or alternatively use --disable-bootstrap to get an installed compiler.
[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669 --- Comment #26 from Wilco --- (In reply to Douglas Mencken from comment #25) > (In reply to Wilco from comment #24) > > > Yes the stage1 compiler would be fine or alternatively use > > --disable-bootstrap to get an installed compiler. > > I’m yet at libstdc++ of stage2 (which means that it succeeded the place of > failing) for 8.2 patched with the reversion I’ve just published. About four > hours and I’d get working 8.2 If that's the case then STACK_BOUNDARY is incorrect. So what is the actual alignment on Darwin, 8-byte alignment? How are Altivec instructions handled which do require 16-byte alignment? > Or do you want to look at results with your r251713? Does it matter? Either would be interesting - it would show what alignment the prologue really uses and what code it generates for alloca.
[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669 --- Comment #32 from Wilco --- (In reply to Segher Boessenkool from comment #29) > It aligns the stack to 16: > > # r3 is size, at entry > addi r3,r3,18 > ... > rlwinm r3,r3,0,0,27 > ... > neg r3,r3 > ... > lwz r2,0(r1) > ... > stwux r2,r1,r3 > > (the rlwinm is r3 &= ~15; ) So this rounds up the size but also adds an extra 16 bytes to the requested allocation. The alloca blocks don't get correctly aligned since STACK_DYNAMIC_OFFSET returns 72 (which is not a multiple of 16): t1_a4 addi r2,r1,72 -> 72 % 16 != 0 (correct value would be 64) stw r6,60(r1) stw r2,56(r1) This also shows in t1/t2_a32: addi r2,r1,103 -> 31 + 72 = 103 (correct value would be 80+31) stw r6,64(r1) rlwinm r2,r2,0,0,26 So this proves STACK_DYNAMIC_OFFSET is incorrect indeed. If there are still failures with that fixed then the stack pointer must also be unaligned.
[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669 --- Comment #33 from Wilco --- (In reply to Iain Sandoe from comment #30) > From "Mac_OS_X_ABI_Function_Calls.pdf" > > m32 calling convention > > Prologs and Epilogs > The called function is responsible for allocating its own stack frame, > making sure to preserve 16-byte alignment in the stack. This operation is > accomplished by a section of code called the prolog, which the compiler > places before the body of the subroutine. After the body of the subroutine, > the compiler places an epilog to restore the processor to the state it was > prior to the subroutine call. So functions must preserve 16-byte alignment, but can they rely on the stack being always 16-byte aligned on entry?
[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669 --- Comment #38 from Wilco --- (In reply to Douglas Mencken from comment #37) > And some more in my wish list. May GCC don’t generate these > > .align2 > > in text section? Any, each and every powerpc instruction is 32bit-wide, no > and never more, no and never less, so these aligns are redundant You can have data in text sections, including bytes and half words. Even if instructions aligned automatically, the function label might be unaligned if it was preceded by a byte.
[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669 --- Comment #41 from Wilco --- (In reply to Douglas Mencken from comment #40) > To build it, I patched its sources with fix_gcc8_build.patch reversion > together with changes from comment #16 So what is the disassembly now? The 2nd diff still shows the original unaligned STACK_DYNAMIC_OFFSET.
[Bug target/85669] fail on s-case-cfn-macros: build/gencfn-macros: DEF_INTERNAL_FLT/INT_FN (%smth%) has no associated built-in functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85669 --- Comment #43 from Wilco --- (In reply to Douglas Mencken from comment #42) > (In reply to Wilco from comment #41) > > > So what is the disassembly now? > > $ /Developer/GCC/8.2p/PowerPC/32bit/bin/gcc -O2 -fno-inline pr78468.c > -save-temps > $ mv pr78468.s ~/ > $ diff -u ~/8.2patched-pr78468.s ~/pr78468.s > > Shows nothing, so they are identical So that patch didn't do anything to fix STACK_DYNAMIC_OFFSET then. If it has no effect then is that header really used? > > The 2nd diff still shows the original unaligned STACK_DYNAMIC_OFFSET. > > Second is vanilla 8.2, unpatched, which is marked with “-” in diff I posted > > Possible problem is that in > > - if (size_align > known_align) > - size_align = known_align; > + if (extra && size_align > BITS_PER_UNIT) > +size_align = BITS_PER_UNIT; > > you forgot if (extra /* assumed >0 */ No the problem is not in this code. It's STACK_DYNAMIC_OFFSET which is wrong.
[Bug tree-optimization/69336] Constant value not detected
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69336 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #13 from Wilco --- (In reply to Richard Biener from comment #6) > Author: rguenth > Date: Tue Jan 19 13:27:11 2016 > New Revision: 232559 > > URL: https://gcc.gnu.org/viewcvs?rev=232559&root=gcc&view=rev Even with this fix it still fails SPEC2006 gamess on AArch64 exactly as reported in bug 69368. It passes if I revert the EXPR_SINGLE case in hashable_expr_equal_p to what it was before r232055 (ie. equal_mem_array_ref_p is still not correct).
[Bug tree-optimization/69368] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- This still fails on AArch64 in exactly the same way with latest trunk - can someone reopen this? I don't seem to have the right permissions...
[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #5 from Wilco --- This still fails on AArch64 in exactly the same way with latest trunk - can someone reopen this? I don't seem to have the right permissions... (In reply to Richard Biener from comment #4) > So - can you please bisect to a source file (just drop all others to -O0) at > least? Yes I am working on that. It's OK with O2, but starts failing with -O3/-Ofast.
[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #6 from Wilco --- This still fails on AArch64 in exactly the same way with latest trunk - can someone reopen this? I don't seem to have the right permissions... (In reply to Richard Biener from comment #4) > So - can you please bisect to a source file (just drop all others to -O0) at > least? It looks like the failure is in mp2.fppized.o. Compiling it with -O3 -fomit-frame-pointer -fno-aggressive-loop-optimizations -fno-inline causes it to fail the exam29 test with the recent tree-ssa-scopedtables.c changes, while it passes without them. The diff is quite large so it's hard to tell which function it is...
[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #8 from Wilco --- In a few functions GCC decides that the assignments in loops are redundant. The loops still execute but have their loads and stores removed. Eg. the first DO loop in MP2NRG should be: .L1027: add w7, w7, 1 add w8, w8, w10 cmp w7, w26 beq .L1026 .L1029: add w0, w11, w7 add x0, x2, x0, sxtw 3 ldr x1, [x0, -8] add x0, x2, x7, sxtw 3 str x1, [x0, -8] cmp w9, 0 ble .L1027 But with the scopedtables change it becomes: .L1027: add w2, w2, 1 cmp w2, w3 beq .L1026 .L1029: cmp w4, 0 ble .L1027
[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #9 from Wilco --- The loops get optimized away in dom2. The info this phase emits is hard to figure out, so it's not obvious why it thinks the array assignments are redundant (the array is used all over the place so clearly cannot be removed).
[Bug target/69619] [6 Regression] compilation doesn't terminate during CCMP expansion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69619 --- Comment #2 from Wilco --- Changing to c = 3 generates code after a short time. The issue is recursive calls to expand_ccmp_expr during the 2 possible options tried to determine costs. That makes the algorithm exponential. A fix would be to expand the LHS and RHS of both gs0 and gs1 in expand_ccmp_expr_1 (and not in gen_ccmp_first) as these expansions will be identical no matter what order we choose to expand gs0 and gs1 in. It's not clear how easy that can be achieved using the existing interfaces though.
[Bug target/69619] [6 Regression] compilation doesn't terminate during CCMP expansion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69619 --- Comment #3 from Wilco --- A simple workaround is to calculate cost1 early and only try the 2nd option if the cost is low (ie. it's not a huge expression that may evaluate into lots of ccmps). A slightly more advanced way would be to walk prep_seq_1 to count the actual number of ccmp instructions.
[Bug libstdc++/69657] New: [6 Regression] abs() not inlined after including math.h
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69657 Bug ID: 69657 Summary: [6 Regression] abs() not inlined after including math.h Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- Since a recent C++ header change abs() no longer gets inlined if we include an unrelated header before it. #include #include int wrap_abs (int x) { return abs (x) + std::abs(x); } Until recently GCC6 with -O2 -S produced for AArch64: _Z8wrap_absi: cmp w0, wzr csneg w0, w0, w0, ge lsl w0, w0, 1 ret However trunk GCC6 now emits: _Z8wrap_absi: stp x29, x30, [sp, -16]! add x29, sp, 0 bl abs lsl w0, w0, 1 ldp x29, x30, [sp], 16 This may result in significant slowdowns as it adds a call and PLT indirection to execute a 3-instruction function. There might be other builtin functions that are affected in a similar way.
[Bug target/69619] [6 Regression] compilation doesn't terminate during CCMP expansion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69619 --- Comment #5 from Wilco --- Proposed patch: https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00206.html
[Bug c++/69657] [6 Regression] abs() not inlined after including math.h
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69657 --- Comment #5 from Wilco --- (In reply to Andrew Pinski from comment #4) > (In reply to Jonathan Wakely from comment #3) > > Recategorising as component=c++, and removing the regression marker (because > > the change in libstdc++ that reveals this issue is required for > > conformance). > > Adding back regression as it is an user visible regression even though only > the library change, the user does not understand that. In fact I suspect > this comes from SPEC CPU 2006 or some other heavily used benchmarking code. Correct, I found it while investigating major regressions in a few benchmarks. Note I see the long, long long, int128, float, double, long double overloads, but where is the plain int overload defined? I bet that adding an int overload that redirects to __builtin_abs similar to the others will fix the issue.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #29 from Wilco --- (In reply to rguent...@suse.de from comment #28) > On Fri, 5 Feb 2016, alalaw01 at gcc dot gnu.org wrote: > > Should I raise a new bug for this, as both this and 53068 are CLOSED? > > I think this has been discussed in some other dup already and > the Fortran FE folks disagreed (it was never "legal", not even in F77). > > I also don't see how it can be a FE only fix. Possibly we can > implemnet a middle-end switch that tells us that the size of commons > is not to be trusted. The FE could then set that flag with -std=legacy. > > You can, after all, "simulate" the very same failure with C. Isn't there already a special exception for C (array of size 1 at end of structure)? The same exception could be enabled with -std=legacy. You'd only need to do extra FE work if you wanted to just do this for COMMON, but that seems hardly worth the extra effort - how much gain do we really get from the array size 1 optimization apart from repeatedly breaking SPEC benchmarks in various ways? Disabling it will likely remove the need for -fno-aggressive-loop-optimizations as well.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #41 from Wilco --- (In reply to Jerry DeLisle from comment #40) > Do you have a reduced test case of the Fortran code we can look at? See comment 13/14, the same common array is declared with different sizes in various places. > I am concerned that you are changing things to accommodate invalid Fortran. > I don't have 416.gamess to look at. I do see your suggested patch touches > the Fortran frontend. Since SPEC does not want to change the invalid Fortran, GCC will have to accomodate for it somehow. > Maybe it needs to be put behind -std=legacy? > > I assume the "(without flag)" means you intend to do something like that? Yes, but it was suggested that -std=legacy wasn't the right flag in comment 35...
[Bug target/70048] New: [AArch64] Inefficient local array addressing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048 Bug ID: 70048 Summary: [AArch64] Inefficient local array addressing Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- The following example generates very inefficient code on AArch64: int f1(int i) { int p[1000]; p[i] = 1; return p[i + 10] + p[i + 20]; } f1: sub sp, sp, #4000 add w2, w0, 10 add x4, sp, 4000 add w1, w0, 20 mov w3, 1 add x0, x4, x0, sxtw 2 add x2, x4, x2, sxtw 2 sub x0, x0, #4096 add x1, x4, x1, sxtw 2 sub x1, x1, #4096 sub x2, x2, #4096 str w3, [x0, 96] ldr w0, [x1, 96] ldr w2, [x2, 96] add sp, sp, 4000 add w0, w2, w0 ret Previous compilers, eg GCC4.9 generate: f1: sub sp, sp, #4000 add w1, w0, 10 add w2, w0, 20 mov w3, 1 str w3, [sp,w0,sxtw 2] ldr w1, [sp,w1,sxtw 2] ldr w0, [sp,w2,sxtw 2] add sp, sp, 4000 add w0, w1, w0 ret
[Bug target/70048] [AArch64] Inefficient local array addressing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048 --- Comment #1 from Wilco --- The regression seem to have appeared on trunk around Feb 3-9.
[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048 --- Comment #5 from Wilco --- (In reply to amker from comment #4) > (In reply to ktkachov from comment #3) > > Started with r233136. > > That's why I forced base+offset out of memory reference and kept register > scaling in in the first place. I think my statement still holds, that > GIMPLE optimizers should be improved to CSE register scaling expression > among different memory references, then force base(sp)+offset parts out of > memory reference in legitimize_address so that it can be merged. I suspect > this case is going to be very inefficient in loop context. Unfortunately > GIMPLE optimizers are weak now and r233136 is a workaround to it. I want to > revisit slsr sometime after stage4 to fix this issue. The issue is that the new version splits immediates from virtual stack variables, this is what GCC used to generate: (insn 8 7 9 2 (set (reg:DI 82) (plus:DI (reg/f:DI 68 virtual-stack-vars) (const_int -4000 [0xf060]))) addroff.c:38 -1 However latest GCC does: (insn 9 8 10 2 (set (reg:DI 84) (plus:DI (reg/f:DI 68 virtual-stack-vars) (reg:DI 83))) addroff.c:38 -1 (nil)) (insn 10 9 11 2 (set (reg:DI 85) (plus:DI (reg:DI 84) (const_int -4096 [0xf000]))) addroff.c:38 -1 (nil)) So we need to recoginize virtual-stack-vars+offset as a special case rather than like any other add with immediate.
[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Jakub Jelinek from comment #1) > These glibc inlines are just wrong, they should trust the compiler doing the > right thing at least for contemporary versions. > 05a910f7b420c2b831f35ba90e61c80f001c0606 should be IMHO reverted, it should > be approached at the compiler side if the ARM folks see any performance > issues with what is generated. It's not about the generated code, it's about whether one calls memcpy or mempcpy in the case where GCC doesn't inline it. Few targets support an assembler mempcpy implementation, so by default the best option is defer to memcpy. Targets can add an optimized mempcpy and define _HAVE_STRING_ARCH_mempcpy to get mempcpy (as x86 does). Is there a mechanism by which GCC is told accurately whether the library it targets supports an optimized mempcpy?
[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055 --- Comment #5 from Wilco --- (In reply to Jakub Jelinek from comment #3) > If some arch in glibc implements memcpy.S and does not implement mempcpy.S, > then obviously the right fix is to add mempcpy.S for that arch, usually it > is just a matter of #include memcpy.S with some define USE_AS_MEMPCPY, and > change a couple of instructions in the assembly. You don't need to remember > the original value of dest, usually you have the address to which you store > bytes in some register, so it is just a matter of copying it to the return > register. We had a long discussion on this at the time. Very few targets have implemented mempcpy.S so it would be a lot of effort to do so. Sharing a single implementation of memcpy is typically not possible without slowing down memcpy (which is more important), and thus you end up with 2 separate implementations which creates unnecessary cache pollution. So overall I believe the best option for most targets is to defer to memcpy with an extra add instruction to compute the end. We could do that in GCC rather than in the GLIBC headers, but then it would be harder to actually call mempcpy in the case that a target supports an optimized implementation.
[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055 --- Comment #6 from Wilco --- (In reply to Jakub Jelinek from comment #4) > Note the choice of this in a header file is obviously wrong, if you at some > point fix this up, then apps will still call memcpy rather than mempcpy, > even when the latter is more efficient (because it doesn't have to save the > length value in some location where it survives across the call). > Note if you don't use the result of mempcpy, gcc is able to optimize it into > memcpy, and tons of other optimizations. It's highly unlikely that calling mempcpy would ever provide a performance gain.
[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055 --- Comment #9 from Wilco --- (In reply to H.J. Lu from comment #8) > Inlining mempcpy uses a callee-saved register: > ... > > Not inlining mempcpy is preferred. If codesize is the only thing that matters... The cost is not at the caller side but in requiring a separate mempcpy function which causes extra I-cache misses. The only case where mempcpy makes sense is if you can use a shared implementation with zero overhead to memcpy. Btw those tests do want to use memcpy anyway, but you could use (mempcpy) to avoid any unwanted header substitution to ensure the test passes.
[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048 --- Comment #12 from Wilco --- (In reply to Jiong Wang from comment #11) > (In reply to Richard Henderson from comment #10) > > Created attachment 37890 [details] > > second patch > > > > Still going through full testing, but I wanted to post this > > before the end of the day. > > > > This update includes a virt_or_elim_regno_p, as discussed in #c7/#c8. > > > > It also updates aarch64_legitimize_address to treat R0+R1+C as a special > > case of R0+(R1*S)+C. All of the arguments wrt scaling apply to unscaled > > indices as well. > > > > As a minor point, doing some of the expansion in a slightly different > > order results in less garbage rtl being generated in the process. > > Richard, > > I just recalled the reassociation of constant offset with vritual frame > pointer will increase register pressure, thus cause bad code generation > under some situations. For example, the testcase given at > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62173#c8 > > void bar(int i) > { > char A[10]; > char B[10]; > char C[10]; > g(A); > g(B); > g(C); > f(A[i]); > f(B[i]); > f(C[i]); > return; > } > > Before your patch we are generating (-O2) > === > bar: > stp x29, x30, [sp, -80]! > add x29, sp, 0 > add x1, x29, 80 > str x19, [sp, 16] > mov w19, w0 > add x0, x29, 32 > add x19, x1, x19, sxtw > bl g > add x0, x29, 48 > bl g > add x0, x29, 64 > bl g > ldrbw0, [x19, -48] > bl f > ldrbw0, [x19, -32] > bl f > ldrbw0, [x19, -16] > bl f > ldr x19, [sp, 16] > ldp x29, x30, [sp], 80 > ret > > After your patch, we are generating: > === > bar: > stp x29, x30, [sp, -96]! > add x29, sp, 0 > stp x21, x22, [sp, 32] > add x22, x29, 48 > stp x19, x20, [sp, 16] > mov w19, w0 > mov x0, x22 > add x21, x29, 64 > add x20, x29, 80 > bl g > mov x0, x21 > bl g > mov x0, x20 > bl g > ldrbw0, [x22, w19, sxtw] > bl f > ldrbw0, [x21, w19, sxtw] > bl f > ldrbw0, [x20, w19, sxtw] > bl f > ldp x19, x20, [sp, 16] > ldp x21, x22, [sp, 32] > ldp x29, x30, [sp], 96 > ret > > We are using more callee saved registers, thus extra stp/ldp generated. > > But we do will benefit from reassociation constant offset with virtual > frame pointer if it's inside loop, because: > >* vfp + const_offset is loop invariant >* the virtual reg elimination on vfp will eventually generate one > extra instruction if it was not used with const_offset but another reg. > > Thus after this reassociation, rtl IVOPT can hoist it out of loop, and we > will save two instructions in the loop. > > A fix was proposed for loop-invariant.c to only do such reshuffling for > loop, see https://gcc.gnu.org/ml/gcc-patches/2014-12/msg01253.html. That > patch finally stopped because the issue PR62173 was fixed on tree level, and > the pointer re-shuffling was considered to have hidding overflow risk though > will be very rare. I don't believe this is really worse - if we had say the same example with 3 pointers or 3 global arrays we should get the exact same code (and in fact generating the same canonicalized form for different bases and scales is essential). Once you've done that you can try optimizing accesses which differ by a *small* constant offset.
[Bug middle-end/70140] New: Inefficient expansion of __builtin_mempcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70140 Bug ID: 70140 Summary: Inefficient expansion of __builtin_mempcpy Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- The expansion of __builtin_mempcpy is inefficient on many targets (eg. AArch64, ARM, PPC). The issue is due to not using the same expansion options that memcpy uses in builtins.c. As a result GCC6 produces for __builtin_mempcpy(x, y, 32): PPC: 0: 38 a0 00 20 li r5,32 4: 48 00 00 00 b 4 4: R_PPC_REL24 mempcpy 8: 60 00 00 00 nop c: 60 42 00 00 ori r2,r2,0 AArch64: mov x2, 32 b mempcpy A second issue is that GCC always calls mempcpy. mempcpy is not supported or implemented efficiently in many (if not most) library/target combinations. GLIBC only has 3 targets which implement an optimized mempcpy, so GLIBC currently inlines mempcpy into memcpy by default unless a target explicitly disables this. It seems better to do this in GCC so it works for all libraries.
[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048 --- Comment #15 from Wilco --- (In reply to Richard Biener from comment #14) > The regression in the original description looks severe enough to warrant > some fixing even if regressing some other cases. Agreed, I think the improvement from Richard H's "2nd patch" outweighs the regression shown in comment 11. Assuming it benchmarks OK, for GCC6 that seems the best we can do. It is is unlikely that legitimize_address could ever be used improve addressing of multiple arrays. The arrays must both be small and allocated close together for this to be beneficial, and that is something you don't know in advance, let alone could decide in a callback that only sees one address at a time. So that requires a different approach.
[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048 --- Comment #17 from Wilco --- (In reply to Jiong Wang from comment #16) > * for the second patch at #c10, if we always do the following no matter > op0 is virtual & eliminable or not > > "op1 = force_operand (op1, NULL_RTX);" > > then there is no performance regression as well. Even more, there is > significant perf improvement beyond current trunk on one benchmark. I ran this modified patch through a few benchmarks and there are no regressions. The codesize of SPEC2006 reduces significantly in several cases (-0.25%), while there are only 2 minor increases of +0.02%, so the patch reduces instruction counts quite effectively. So we should go with the modified patch for GCC6.
[Bug target/70048] [6 Regression][AArch64] Inefficient local array addressing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70048 --- Comment #20 from Wilco --- (In reply to Richard Henderson from comment #19) > I wish that message had been a bit more complete with the description > of the performance issue. I must guess from this... > > > ldr dst1, [reg_base1, reg_index, #lsl 1] > > ldr dst2, [reg_base2, reg_index, #lsl 1] > > ldr dst3, [reg_base3, reg_index, #lsl 1] > > > > into > > > > reg_index = reg_index << 1; > > ldr dst1, [reg_base1, reg_index] > > ldr dst2, [reg_base2, reg_index] > > ldr dst3, [reg_base3, reg_index] > > that it must have something to do with the smaller cpus, e.g. exynosm1, > based on the address cost tables. Some CPUs emit seperate uops to do address shift by 1. So that would mean 6 uops in the first example vs 4 when doing the shift separately. According to the cost tables this might actually be worse on exynosm1 as it has a cost for any indexing. > I'll note for the record that you cannot hope to solve this with > the legitimize_address hook alone for the simple reason that it's not > called for legitimate addresses, of which (base + index * 2) is > a member. The hook is only being called for illegitimate addresses. Would it be possible to disallow expensive addresses initially, let CSE do its thing and then merge addresses with 1 or 2 uses back into loads/stores? > To include legitimate addresses, you'd have to force out the address > components somewhere else. Perhaps in the mov expanders, since that's > one of the very few places mem's are allowed. You'd want to do this > only if !cse_not_expected. And presumably only for addresses we would prefer to be CSEd, such as expensive shifts or indexing. > OTOH, it's also the sort of thing that one would hope that CSE itself > would be able to handle. Looking across various addresses, computing > sums of costs, and breaking out subexpressions as necessary. Yes that would be the ideal, but one can dream... Did you post your patch btw? We should go ahead with that (with Jiong's minor modification) as it looks significantly better overall.
[Bug middle-end/70801] New: IRA caller-saves does not support rematerialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70801 Bug ID: 70801 Summary: IRA caller-saves does not support rematerialization Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- GCC emits the same code for caller-saves in all cases, even if the caller-save is an immediate which can be trivially rematerialized. The caller-save code should support rematerialization directly by omitting saves and emitting the init value for restores. This allows the cost of rematerializeable caller-saves to be lowered and avoids a stack slot. Alternatively given the spill code already supports rematerialization, it may be possible to stop immediates from being caller-saved by increasing their caller-save cost significantly. void g(void); float bad_remat(float x) { x += 3.0f; g(); x *= 3.0f; x *= (3.0f + x); g(); x *= 3.0f; x *= (3.0f + x); return x; } AArch64 gives with -O2 -fomit-frame-pointer -ffixed-d8 -ffixed-d9 -ffixed-d10 -ffixed-d11 -ffixed-d12 -ffixed-d13 -ffixed-d14 -ffixed-d15: fmovs1, 3.0e+0 str x30, [sp, -32]! fadds0, s0, s1 stp s0, s1, [sp, 24] bl g ldp s0, s1, [sp, 24] fmuls0, s0, s1 fadds2, s0, s1 fmuls0, s0, s2 str s0, [sp, 24] bl g ldp s0, s1, [sp, 24] ldr x30, [sp], 32 fmuls0, s0, s1 fadds1, s0, s1 fmuls0, s0, s1 ret x86_64 with plain -O2: subq$24, %rsp movss .LC0(%rip), %xmm1 addss %xmm1, %xmm0 movss %xmm1, 12(%rsp) movss %xmm0, 8(%rsp) callg movss 8(%rsp), %xmm0 movaps %xmm0, %xmm2 movss 12(%rsp), %xmm1 mulss %xmm1, %xmm2 movaps %xmm2, %xmm0 addss %xmm1, %xmm0 mulss %xmm2, %xmm0 movss %xmm0, 8(%rsp) callg movss 12(%rsp), %xmm1 movss 8(%rsp), %xmm0 addq$24, %rsp mulss %xmm1, %xmm0 addss %xmm0, %xmm1 mulss %xmm1, %xmm0 ret
[Bug middle-end/70802] New: IRA memory cost calculation incorrect for immediates
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70802 Bug ID: 70802 Summary: IRA memory cost calculation incorrect for immediates Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- The following code in ira-costs.c tries to improve the memory cost for rematerializeable loads. There are several issues with this though: 1. The memory cost can become negative, forcing a spill, which is known to cause incorrect code (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242) 2. The code only handles a subset of immediate loads, not all rematerializeable values 3. The cost adjustment is not sufficient to make better decisions between allocating an immediate to a callee-save register and spill a variable, or allocate the variable and rematerialize the immediate As an example of (3), if there is only one callee-save register free, IRA will use it to allocate the immediate rather than the variable: float bad_alloc(float x) { x += 3.0f; g(); x *= 3.0f; return x; } With -O2 -fomit-frame-pointer -ffixed-d8 -ffixed-d9 -ffixed-d10 -ffixed-d11 -ffixed-d12 -ffixed-d13 -ffixed-d14: str x30, [sp, -32]! str d15, [sp, 8] fmovs15, 3.0e+0 fadds0, s0, s15 str s0, [sp, 28] bl g ldr s0, [sp, 28] fmuls0, s0, s15 ldr d15, [sp, 8] ldr x30, [sp], 32 ret a0(r76,l0) costs: CALLER_SAVE_REGS:15000,15000 GENERAL_REGS:15000,15000 FP_REGS:0,0 ALL_REGS:15000,15000 MEM:12000,12000 a1(r73,l0) costs: CALLER_SAVE_REGS:1,1 GENERAL_REGS:1,1 FP_REGS:0,0 ALL_REGS:1,1 MEM:8000,8000 The immediate value r76 is counted as 1 def and 2 uses, so memory cost of 12000, while r73 has 1 def and 1 use, so memory cost of 8000. However the worst-case rematerialization cost of r76 would be 2 moves, one which already exists of course, so the memory cost should have been 4000... ira-costs.c, ~line 1458: if (set != 0 && REG_P (SET_DEST (set)) && MEM_P (SET_SRC (set)) && (note = find_reg_note (insn, REG_EQUIV, NULL_RTX)) != NULL_RTX && ((MEM_P (XEXP (note, 0)) && !side_effects_p (SET_SRC (set))) || (CONSTANT_P (XEXP (note, 0)) && targetm.legitimate_constant_p (GET_MODE (SET_DEST (set)), XEXP (note, 0)) && REG_N_SETS (REGNO (SET_DEST (set))) == 1)) && general_operand (SET_SRC (set), GET_MODE (SET_SRC (set { enum reg_class cl = GENERAL_REGS; rtx reg = SET_DEST (set); int num = COST_INDEX (REGNO (reg)); COSTS (costs, num)->mem_cost -= ira_memory_move_cost[GET_MODE (reg)][cl][1] * frequency; record_address_regs (GET_MODE (SET_SRC (set)), MEM_ADDR_SPACE (SET_SRC (set)), XEXP (SET_SRC (set), 0), 0, MEM, SCRATCH, frequency * 2); counted_mem = true; }
[Bug middle-end/70861] New: Improve code generation of switch tables
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70861 Bug ID: 70861 Summary: Improve code generation of switch tables Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- GCC uses a very basic check to determine whether to use a switch table. A simple example from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11823 still generates a huge table with mostly default entries with -O2: int i; int func(int a) { switch(a) { case 0: i = 20; break; case 1: i = 50; break; case 2: i = 29; break; case 3: i = 20; break; case 4: i = 50; break; case 5: i = 29; break; case 6: i = 20; break; case 7: i = 50; break; case 8: i = 29; break; case 9: i = 79; break; case 110: i = 27; break; default: i = 77; break; } return i; } This shows several issues: 1. The density calculation is not adjustable depending on the expected size of switch table entries (which depends on the target). 2. A table may contain not only 90% default entries, but they can be consecutive as well. To avoid this the maximum number of default cases should be limited to say 3x the average gap between real cases. 3. There is no reduction in minimum required density for larger switches - the wastage becomes quite significant for larger switches and targets that use 4 bytes per table entry. 4. Dense regions and outlier values are not split off and handled seperately.
[Bug middle-end/70861] Improve code generation of switch tables
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70861 --- Comment #3 from Wilco --- (In reply to Andrew Pinski from comment #2) > Note I think if we had gotos instead of assignment here we should do the > similar thing for the switch table itself. Absolutely, that was my point. > Note also the assignment to i is getting in the way for the switch to > constant table form. In SSA form you can sink the assignments into a shared block, so it should be able deal with an identical assignment to a global, pointer, array or even returning directly. Also it's common that the default or a few other cases are doing something different, so those need to be pulled out separately. However that's icing on the cake - the key issue is that GCC doesn't even do the basic stuff yet. Most compilers are able to split off cases with high profile counts, dense switch tables, ranges that all go to the same block etc.
[Bug rtl-optimization/70946] New: Bad interaction between IVOpt and loop unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70946 Bug ID: 70946 Summary: Bad interaction between IVOpt and loop unrolling Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- IVOpt chooses between using indexing for induction variables or incrementing pointers. Due to way loop unrolling works, a decision that is optimal if unrolling is disabled may become very non-optimal with unrolling. Below are simple examples that show how the choice to use indexing can become a very bad idea when unrolled, while using offset addressing leads to very decent code. To improve this we either to teach IVOpt about unrolling (eg. prioritise base+offset addressing) or add a tree unroller that unrolls small inner loops before IVOpt. void loop1 (int *p, int *q, int i) { for (i = 0; i < 1000; i++) p[i] = q[i] + 1; } void loop2 (int *p, int i) { for (i = 0; i < 1000; i++) p[i] = p[i] + 1; } On AArch64 with -O2 -funroll-loops this gives: loop1: mov x2, 0 .p2align 2 .L41: ldr w4, [x1, x2] add x3, x2, 4 add x10, x2, 8 add x9, x2, 12 add w5, w4, 1 str w5, [x0, x2] add x8, x2, 16 add x7, x2, 20 add x6, x2, 24 add x11, x2, 28 ldr w12, [x1, x3] add x2, x2, 32 cmp x2, 4000 add w13, w12, 1 str w13, [x0, x3] ldr w14, [x1, x10] add w15, w14, 1 str w15, [x0, x10] ldr w16, [x1, x9] add w17, w16, 1 str w17, [x0, x9] ldr w18, [x1, x8] add w4, w18, 1 str w4, [x0, x8] ldr w3, [x1, x7] add w10, w3, 1 str w10, [x0, x7] ldr w9, [x1, x6] add w5, w9, 1 str w5, [x0, x6] ldr w8, [x1, x11] add w7, w8, 1 str w7, [x0, x11] bne .L41 ret loop2: add x6, x0, 4000 .p2align 2 .L51: mov x1, x0 ldr w2, [x0] add x0, x0, 32 add w3, w2, 1 cmp x0, x6 str w3, [x1], 4 ldr w4, [x0, -28] add w5, w4, 1 str w5, [x0, -28] ldr w7, [x1, 4] add w8, w7, 1 str w8, [x1, 4] ldp w9, w10, [x0, -20] ldp w11, w12, [x0, -12] add w14, w9, 1 ldr w13, [x0, -4] add w15, w10, 1 add w16, w11, 1 add w17, w12, 1 add w18, w13, 1 stp w14, w15, [x0, -20] stp w16, w17, [x0, -12] str w18, [x0, -4] bne .L51 ret
[Bug rtl-optimization/70946] Bad interaction between IVOpt and loop unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70946 --- Comment #1 from Wilco --- PR36712 seems related to this
[Bug rtl-optimization/70961] New: Regrename ignores preferred_rename_class
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70961 Bug ID: 70961 Summary: Regrename ignores preferred_rename_class Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- When deciding which register to use regrename.c calls the target function preferred_rename_class. However in pass 2 in find_rename_reg it then just ignores this preference. This results in significantly increased codesize on targets which prefer a subset of allocatable registers in order to use smaller instructions. Also the computed super_class appears to be the union of all uses and defs instead of the intersection. This should be the intersection as that is the set of registers that all uses and defs support. If the preferred class doesn't result in a valid rename then it could search a wider class, but then it would need to check that the size of the newly selected patterns does not increase.
[Bug rtl-optimization/70961] Regrename ignores preferred_rename_class
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70961 --- Comment #3 from Wilco --- (In reply to Eric Botcazou from comment #2) > Pass #2 ignores it since the preference simply couldn't be honored. In which case it should not rename that chain rather than just ignore the preference (and a preference of NO_REGS should probably also block renaming). > > The super_class has nothing to do with the class that is searched for > > renaming registers though, it's just the info passed to the back-end to > > compute this class. For example, on the ARM, the preferred_class will be > > LO_REGS for any chain of GENERAL_REGS. Actually for a chain of GENERAL_REGS the preferred class should be GENERAL_REGS - using LO_REGS in that case would be counterproductive. You'd prefer LO_REGS only when *some* of the patterns in a chain are LO_REGS. But since the callback only ever sees the union of all classes in a chain, I don't think could make the correct decision. > As a matter of fact, that might not be true, something like > > + if (TARGET_THUMB2 && reg_class_subset_p (rclass, GENERAL_REGS)) > return LO_REGS; I tried something similar, as well as always returning LO_REGS or NO_REGS, but this makes no difference at all due to pass 2 ignoring the preference...
[Bug rtl-optimization/70961] Regrename ignores preferred_rename_class
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70961 --- Comment #5 from Wilco --- As for a simple example, Proc_4 in Dhrystone is a good one. With -O2 and -fno-rename-registers I get the following on Thumb-2: 00c8 : c8: b430push{r4, r5} ca: f240 0300 movwr3, #0 ce: f240 0400 movwr4, #0 d2: f2c0 0300 movtr3, #0 d6: f2c0 0400 movtr4, #0 da: f240 0100 movwr1, #0 de: 681aldr r2, [r3, #0] e0: f2c0 0100 movtr1, #0 e4: 7824ldrbr4, [r4, #0] e6: 2542movsr5, #66 ; 0x42 e8: 700dstrbr5, [r1, #0] ea: 2c41cmp r4, #65 ; 0x41 ec: bf08it eq ee: f042 0201 orreq.w r2, r2, #1 f2: 601astr r2, [r3, #0] f4: bc30pop {r4, r5} f6: 4770bx lr With -frename-registers: 00c8 : c8: b430push{r4, r5} ca: f240 0300 movwr3, #0 ce: f240 0400 movwr4, #0 d2: f2c0 0300 movtr3, #0 d6: f2c0 0400 movtr4, #0 da: f240 0100 movwr1, #0 de: 681aldr r2, [r3, #0] e0: f2c0 0100 movtr1, #0 e4: f894 c000 ldrb.w ip, [r4] e8: 2542movsr5, #66 ; 0x42 ea: 700dstrbr5, [r1, #0] ec: f1bc 0f41 cmp.w ip, #65 ; 0x41 f0: bf08it eq f2: f042 0201 orreq.w r2, r2, #1 f6: 601astr r2, [r3, #0] f8: bc30pop {r4, r5} fa: 4770bx lr So here it changed 2 16-bit instructions into 32-bit ones by changing R4 to IP. However there is no benefit in doing so as it doesn't remove a move or create additionally scheduling opportunities (the first liverange of R4 has only a single use, so when it goes dead it is actually best to reuse R4 rather than use a different register).
[Bug rtl-optimization/71022] New: GCC prefers register moves over move immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71022 Bug ID: 71022 Summary: GCC prefers register moves over move immediate Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- When assigning the same immediate value to different registers, GCC will always CSE the immediate and emit a register move for subsequent uses. This creates unnecessary register dependencies and increases latency. When the cost of an immediate move is the same as a register move (which should be true for most targets), it should prefer the former. A register move is better only when the immediate requires multiple instructions or is larger with -Os. It's not obvious where this is best done. The various cprop phases before IRA do the right thing, but cse2 (which runs later) then undoes it. And cprop_hardreg doesn't appear to be able to deal with immediates. int f1(int x) { int y = 1, z = 1; while (x--) { y += z; z += x; } return y + z; } void g(float, float); void f2(void) { g(1.0, 1.0); g(3.3, 3.3); } On AArch64 I get: f1: sub w1, w0, #1 cbz w0, .L12 mov w0, 1 mov w2, w0 *** mov w2, 1 .p2align 2 .L11: add w2, w2, w0 add w0, w0, w1 sub w1, w1, #1 cmn w1, #1 bne .L11 add w0, w2, w0 ret .L12: mov w0, 2 ret f2: fmovs1, 1.0e+0 str x30, [sp, -16]! fmovs0, s1*** fmov s0, 1.0 bl g adrpx0, .LC1 ldr x30, [sp], 16 ldr s1, [x0, #:lo12:.LC1] fmovs0, s1*** ldr s0, [x0, #:lo12:.LC1] b g
[Bug tree-optimization/71026] New: Missing division optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 Bug ID: 71026 Summary: Missing division optimizations Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- With -Ofast GCC doesn't reassociate constant multiplies or negates away from divisors to allow for more reciprocal division optimizations. It is also possible to avoid divisions (or multiplies) involving immediates in comparisons that check for a positive/negative result. float f1(float x, float y) { return x / (y * y); }// -> x * (1/y) * (1/y) float f2(float x, float y) { return x / (y * 3.0f); } // -> (x/3) / y float f3(float x, float y) { return x / -y; } // -> (-x) / y int f4(float x) { return (1.0f / x) < 0.0f; } // -> x < 0.0f int f5(float x) { return (x / 2.0f) <= 0.0f; }// -> x <= 0.0f A quick experiment shows the first transformation could remove almost 100 divisions from SPEC2006, the 2nd 50. The first transformation is only useful if there is at least one other division by y, so likely best done in the division reciprocal optimization phase.
[Bug rtl-optimization/71022] GCC prefers register moves over move immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71022 --- Comment #2 from Wilco --- (In reply to Richard Biener from comment #1) > IRA might choose to do this as part of life-range splitting/shortening. Note > that reg-reg moves may be cheaper code-size wise (like on CISC archs with > non-fixed insn lengths). RTL cost should tell you, of course. The cases I spotted are typically conflicting liveranges that just happen to share the same initialization value (0 being the most common). This transformation can be done in a generic phase using RTL costs, the moves even have the correct regnotes.
[Bug middle-end/71443] New: [7 regression] test case gcc.dg/plugin/must-tail-call-2.c reports error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71443 Bug ID: 71443 Summary: [7 regression] test case gcc.dg/plugin/must-tail-call-2.c reports error Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- There are 2 new failures in the tail-call-2.c test on recent trunk builds: FAIL: gcc.dg/plugin/must-tail-call-2.c -fplugin=./must_tail_call_plugin.so (test for errors, line 32) FAIL: gcc.dg/plugin/must-tail-call-2.c -fplugin=./must_tail_call_plugin.so (test for excess errors) The error reported is: /home/wdijkstr/gcc/gcc/testsuite/gcc.dg/plugin/must-tail-call-2.c: In function 'test_2_caller':^M /home/wdijkstr/gcc/gcc/testsuite/gcc.dg/plugin/must-tail-call-2.c:32:10: error: cannot tail-call: argument must be passed by copying^M The argument is a large structure which is passed by reference on AArch64. So it looks like the test needs to allow for this possibility. This appears related to PR71293 for Power.
[Bug target/71951] libgcc_s built with -fomit-frame-pointer on aarch64 is broken
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71951 --- Comment #11 from Wilco --- (In reply to Icenowy Zheng from comment #10) > In my environment (glibc 2.25, and both the building scripts of glibc and > gcc have -fomit-frame-pointer automatically enabled), this bug is not fully > resolved yet. > > With GCC upgraded to 6.4.0, GDB debugger started to work correctly. (With > GCC 6.3.0 GDB cannot even work and segfault at unwind code in libgcc). > > However, if I still build GCC with -fomit-frame-pointer in CFLAGS, the > backtrace() function of glibc cannot work, and segfault at line 240 of > libgcc/unwind-dw2.c . > > By reading the source code, I think the unwind code is still trying to get > CFA from the register x29, and when debugging I found that the x29 register > in the unwind context is 0 (because of -fomit-frame-pointer), so line 240 is > dereferencing a NULL pointer, so it segfaulted. > > Maybe the behavior that accessing x29 register to get CFA is not correct? Well if everything built with -fomit-frame-pointer then it is definitely wrong to read x29. Can you give more info similar to comment #3?
[Bug middle-end/78468] [8 regression] libgomp.c/reduction-10.c and many more FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78468 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #38 from Wilco --- (In reply to Rainer Orth from comment #37) > This patch > > 2017-09-05 Wilco Dijkstra > > * explow.c (get_dynamic_stack_size): Improve dynamic alignment. > > brought the exact same set of failures back on sparc-sun-solaris2.11. > > Rainer The existing alloca code relies on STACK_BOUNDARY being set correctly. Has the value been fixed already for the OS variants mentioned? If stack alignment can't be guaranteed by OS/runtime/prolog, STACK_BOUNDARY should be 8.
[Bug middle-end/78468] [8 regression] libgomp.c/reduction-10.c and many more FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78468 --- Comment #40 from Wilco --- (In reply to Eric Botcazou from comment #39) > > The existing alloca code relies on STACK_BOUNDARY being set correctly. Has > > the value been fixed already for the OS variants mentioned? If stack > > alignment can't be guaranteed by OS/runtime/prolog, STACK_BOUNDARY should be > > 8. > > No, no, no, please read the ??? note I put in emit-rtl.c. STACK_BOUNDARY is > correct, the problem is the declared alignment of the virtual registers. If you cannot guarantee the alignment of the pointers to STACK_BOUNDARY then STACK_BOUNDARY is incorrect. GCC uses the STACK_BOUNDARY guarantee in optimizations so it is essential to get this right if you want correct code generation.
[Bug middle-end/77484] [6/7 Regression] Static branch predictor causes ~6-8% regression of SPEC2000 GAP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77484 --- Comment #29 from Wilco --- (In reply to Jan Hubicka from comment #28) > > On SPEC2000 the latest changes look good, compared to the old predictor gap > > improved by 10% and INT/FP by 0.8%/0.6%. I'll run SPEC2006 tonight. > > It is rather surprising you are seeing such large changes for one branch > predictor > change. Is most of it really comming just from the bb-reorder changes? On > x86 the > effect is mostly within noise and on Itanium Gap improve by 2-3%. > It may be interesting to experiment with reorderin and prediction more on > this target. When I looked at gap at the time, the main change was the reordering of a few if statements in several hot functions. Incorrect block frequencies also change register allocation in a bad way, but I didn't notice anything obvious in gap. And many optimizations are being disabled on blocks with an incorrect frequency - this happens all over the place and is the issue causing the huge Coremark regression. I could do some experiments but I believe the key underlying problem is that GCC treats the block frequencies as accurate when they are really very vague estimates (often incorrect) and so should only be used to break ties. In fact I would claim that even modelling if-statements as a balanced 50/50 is incorrect. It suggests that a block that is guarded by multiple if-statements handling exceptional cases is much less important than the very same block that isn't, even if they are both always executed. Without profile data providing actual frequencies we should not optimize the outer block for speed and the inner block for size.
[Bug middle-end/77484] [6/7 Regression] Static branch predictor causes ~6-8% regression of SPEC2000 GAP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77484 --- Comment #31 from Wilco --- (In reply to Jan Hubicka from comment #30) > > > > When I looked at gap at the time, the main change was the reordering of a > > few > > if statements in several hot functions. Incorrect block frequencies also > > change > > register allocation in a bad way, but I didn't notice anything obvious in > > gap. > > And many optimizations are being disabled on blocks with an incorrect > > frequency > > - this happens all over the place and is the issue causing the huge Coremark > > regression. > > This is the issue with jump threading code no longer sanely updating profile, > right? I will try to find time to look into it this week. I don't know the exact details but James proved that the blocks are incorrectly assumed cold so part of the optimization doesn't trigger as expected. I'm not sure whether that is because the frequencies got too low, set incorrectly or not set at all. > > I could do some experiments but I believe the key underlying problem is that > > GCC treats the block frequencies as accurate when they are really very vague > > estimates (often incorrect) and so should only be used to break ties. > > > > In fact I would claim that even modelling if-statements as a balanced 50/50 > > is > > incorrect. It suggests that a block that is guarded by multiple > > if-statements > > handling exceptional cases is much less important than the very same block > > that > > isn't, even if they are both always executed. Without profile data providing > > actual frequencies we should not optimize the outer block for speed and the > > inner block for size. > > There are --param options to control this. They was originally tuned based on > Spec2000 and x86_64 scores (in GCC 3.x timeframe). if you can get resonable > data that they are not working very well anymore (or for ARM), we could try > to > tune them better. > > I have WIP patches to get the propagation bit more fine grained and > propagate i.e. > info if BB is reachable only bo known to be cold path (such that one that has > EH edge on it). This may make the logic bit more reliable. I'll have a look, but I think the key is to think in terms of block importance (from cold to hot). Apart from highly skewed cases (eg. exception edges or loops), most blocks should be equally important to optimize.
[Bug target/71951] libgcc_s built with -fomit-frame-pointer on aarch64 is broken
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71951 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #8 from Wilco --- Likely related to PR77455 which was backported to GCC6. Unwinding most definitely doesn't require a frame pointer (on AArch64 the frame pointer is completely unused unless you use eg. alloca).
[Bug target/81357] Extra mov for zero extend of add
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81357 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #7 from Wilco --- (In reply to Qing Zhao from comment #5) > Not sure why when -fschedule-insns is ON, the destination of the add insns > becomes W1? add w1, w0, 1 mov w0, w1 uxtwx1, w1 That's because the move assigns the return value which overwrites w0. However the uxtw still needs to read the result of the addition. So the register allocator is forced to use w1 as w0 cannot be used. I don't think there is an easy fix for this example. The compiler believes there are 2 distinct values so it uses 2 registers irrespectively of the order of the mov and uxtw.
[Bug target/82439] Missing (x | y) == x simplifications
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82439 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Andrew Pinski from comment #1) > "(x | y) == x" is simpler than "(y & ~x) == 0" on the tree level. 2 gimple > vs 3. > So converting it on the gimple level is not going to happen. Indeed, so not match.pd, but maybe target.pd could work. There is no BIC tree/RTL operator, neither is there a target independent way to ask whether a target supports BIC.
[Bug middle-end/82479] missing popcount builtin detection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82479 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #6 from Wilco --- (In reply to Andrew Pinski from comment #5) > Was added to LLVM back in 2012: > http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20121119/156272.html > > Again I don't know how useful it is compared to the compile time that it > would take. I'd be more worried that the LLVM sequence isn't better in common usage. Given it has a fixed high latency, you need quite a few set bits before it's faster.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #4 from Wilco --- (In reply to Qing Zhao from comment #3) > with the latest upstream GCC, we got the following assembly for the testing > case with -O2 on aarch64: > > t1: > adrpx1, .LC0 > add x1, x1, :lo12:.LC0 > b strcmp > > t2: > adrpx1, .LC0 > add x1, x1, :lo12:.LC0 > b strcmp Yes the inlining was recently removed from GLIBC since the goal is to inline in GCC.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #8 from Wilco --- > /home/qinzhao/Install/latest/bin/gcc -O2 t_p_1.c t_p.c > non-inlined version > 20.84user 0.00system 0:20.83elapsed 100%CPU (0avgtext+0avgdata > 360maxresident)k > 0inputs+0outputs (0major+135minor)pagefaults 0swaps > > From the data, we can see the inlined version of strcmp (by glibc) is much > slower than the direct call to strcmp. (this is for size 2) > I am using GCC farm machine gcc116: This result doesn't make sense - it looks like GCC is moving the strcmp call in the 2nd case as a loop invariant, so you're just measuring a loop with just a subtract and orr instruction...
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #9 from Wilco --- (In reply to Qing Zhao from comment #7) str(n)cmp with a constant string can be changed into memcmp if the string has a known alignment or is an array of known size. We should check the common cases are implemented. Given it takes just 3 instructions per character on almost all targets, it is reasonable to inline strcmp with a small string if it cannot be changed into memcmp. This will be faster than calling strcmp due to the overhead of the call, the plt redirection and alignment checks. Note also that many strcmp implementations will do a byte-by-byte check if the strings are not aligned.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #16 from Wilco --- (In reply to Qing Zhao from comment #15) > (In reply to Wilco from comment 14) > > The only reason we have to do a character by character comparison is > > because we > > cannot read beyond the end of a string. However when we know the size or > > alignment we can safely process a string one word at a time. > > is it possible that “NULL_terminator” is in the middle of the string even > though we > know the size or alignment? for example: > > const char s[8] = “abcd\0abc”; // null byte in the middle of the string > int f2(void) { return __builtin_strcmp(s, "abc") != 0; } > int f3(void) { return __builtin_strcmp(s, “abc”); } > > can either of the above f2 or f3 been optimized to memcmp? seems not. You never get that to the null byte as the memcmp only compares strlen("abc"+1) characters. However do you mean an input string which is shorter than the constant string? That's fine as this will compare not-equal in the memcmp.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #18 from Wilco --- (In reply to Qing Zhao from comment #17) > (In reply to Wilco from comment #16) > > >> const char s[8] = “abcd\0abc”; // null byte in the middle of the string > >> int f2(void) { return __builtin_strcmp(s, "abc") != 0; } > >> int f3(void) { return __builtin_strcmp(s, “abc”); } > >> > >> can either of the above f2 or f3 been optimized to memcmp? seems not. > > > > You never get that to the null byte as the memcmp only compares > > strlen("abc"+1) > > characters. > > Yes, this is correct for memcmp, but not for strcmp. strcmp will get to the > null byte. > as a result, > > const char s[8] = “abcd\0abc”; > strcmp (s, “abc”) != 0. // s = “abcd", which is != “abc" > strncmp (s, “abc”, 3) == 0 > memcmp(s, “abc”, 3) == 0 > > So, strcmp cannot optimized to memcmp No that should be: strcmp (s, “abc”) != 0 strncmp (s, “abc”, 4) != 0 memcmp(s, “abc”, 4) != 0 You need to compare the null terminator as well. > > However do you mean an input string which is shorter than the > > constant string? That's fine as this will compare not-equal in the memcmp. > > for the input string is shorter than the constant string, for example: > > const char s[8] = “ab\0\0abcd”; > strcmp (s, “abc”) != 0 > strncmp (s, “abc”, 3) != 0 > memcmp (s, “abc”,3) != 0 > > In a summary, since it’s NOT easy for the compiler to know where is the > “Null_terminator” > in the string, strcmp is NOT reasonable to be optimized to memcmp whenever > its result is > used to compare with zero or not. The compiler knows where the null terminator is in the constant string so it can easily figure out when it is legal as well as faster than doing a byte by byte expansion of strcmp. strcmp (s, STR) -> memcmp (s, strlen (STR) + 1) iff max(sizeof_array(s), alignment(s)) > strlen (STR). > But for strncmp, if the result is to compare with zero, it might be > reasonable to optimized it > to the corresponding memcmp, i.e > > strncmp (s, “abc”, 3) != 0 > > could be optimized to > > memcmp (s, “abc”, 3) != 0 If the strncmp size is smaller than the strlen of the constant string (and alignment is right), yes. But strncmp (s, "abc", C) is equivalent to strcmp (s, "abc") if C >= 4.
[Bug target/77455] New: [AArch64] eh_return implementation fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77455 Bug ID: 77455 Summary: [AArch64] eh_return implementation fails Product: gcc Version: 4.8.4 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- The __builtin_eh_return implementation on AArch64 generates incorrect code for many cases due to using an incorrect offset/pointer when writing the new return address to the stack. Also optimizations may remove the write due to a missing scheduling barrier. As a result in most cases eh_return does not work properly.
[Bug target/77455] [AArch64] eh_return implementation fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77455 Wilco changed: What|Removed |Added Target||AArch64 Known to fail||4.8.4 --- Comment #1 from Wilco --- See https://gcc.gnu.org/ml/gcc-patches/2016-09/msg00077.html for more details, a simpler reimplementation and a testcase.
[Bug tree-optimization/66946] Spurious uninitialized warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66946 Wilco changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |INVALID --- Comment #6 from Wilco --- (In reply to Manuel López-Ibáñez from comment #5) > Still valid? No, I can't trigger it anymore, so marking as resolved.
[Bug middle-end/77484] New: Static branch predictor causes ~6-8% regression of SPEC2000 GAP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77484 Bug ID: 77484 Summary: Static branch predictor causes ~6-8% regression of SPEC2000 GAP Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- Changes in the static branch predictor (around August last year) caused regressions on SPEC2000. The PRED_CALL predictor causes GAP to regress by 6-8% on AArch64, and this hasn't been fixed on trunk. With this predictor turned off, INT is 0.6% faster and FP 0.4%. The reason is that the predictor causes calls that are guarded by if-statements to be placed at the end of the function. For Gap this is bad as it often executes several such statements in a row, resulting in 2 extra taken branches and additional I-cache misses per if-statement. So it seems that on average this prediction makes things worse. Overall the static prediction and -freorder-blocks provide a benefit. However does the gain of each static prediction being correct outweigh the cost of the prediction being incorrect? Has this been measured for each of the static predictors across multiple targets?
[Bug tree-optimization/65068] Improve rewriting for address type induction variables in IVOPT
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65068 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #3 from Wilco --- Note gcc.target/aarch64/ldp_vec_64_1.c shows the same issue. Same story there, most cores produce inefficient code, only -mcpu=vulcan gets close: foo: add x1, x1, 8 add x2, x0, 8192 .p2align 3 .L2: ldr d0, [x1, -8] ldr d1, [x1], 16 add v0.2s, v0.2s, v1.2s str d0, [x0], 16 cmp x2, x0 bne .L2 ret That still shows an odd pre-increment in the loop header, which blocks the use of ldp. Also in all cases aarch64_legitimize_address is called with offset 32760 on V2SI - this offset does not occur anywhere in the source, so it should not matter how we split it. However what we do for that case affects IVOpt, which is clearly a bug.
[Bug middle-end/77568] New: [7 regression] CSE/PRE/Hoisting blocks common instruction contractions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77568 Bug ID: 77568 Summary: [7 regression] CSE/PRE/Hoisting blocks common instruction contractions Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- The recently introduced code hoisting aggressively moves common subexpressions that might otherwise be mergeable with other operations. This caused a large regression in one benchmark. A simple reduced test shows the issue: float f(float x, float y, float z, int a) { if (a > 100) x += y * z; else x -= y * z; return x; } This now produces on AArch64: f: fmuls2, s1, s2 cmp w0, 100 fadds1, s0, s2 fsubs0, s0, s2 fcsel s0, s0, s1, le ret Note the issue is not limited to hoisting, CSE/PRE cause similar issues: void g(int, int); int f2(int x) { g(x, x+1); g(x, x+1); return x+1; } f2: stp x29, x30, [sp, -32]! add x29, sp, 0 stp x19, x20, [sp, 16] add w19, w0, 1 mov w20, w0 mov w1, w19 bl g mov w1, w19 mov w0, w20 bl g mov w0, w19 ldp x19, x20, [sp, 16] ldp x29, x30, [sp], 32 ret Given x+1 is used as a function argument, there is no benefit in making it available as a CSE after each call - repeating the addition is cheaper than using an extra callee-save and copying it several times. This shows a similar issue for bit tests. Most targets support ANDS or bit test as a single instruction (or even bit test+branch), so CSEing the (x & C) actually makes things worse: void f3(char *p, int x) { if (x & 1) p[0] = 0; if (x & 2) p[1] = 0; if (x & 4) p[2] = 0; if (x & 8) p[2] = 0; g(0,0); if (x & 1) p[3] = 0; if (x & 2) p[4] = 0; if (x & 4) p[5] = 0; if (x & 8) p[6] = 0; } This uses 4 callee-saves to hold the (x & C) CSEs. Doing several such bit tests in a more complex function means you quickly run out of registers... Given it would be much harder to undo these CSEs at RTL level (combine does most contractions but can only do it intra-block), would it be reasonable to block CSEs for these special cases?
[Bug middle-end/77568] [7 regression] CSE/PRE/Hoisting blocks common instruction contractions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77568 --- Comment #3 from Wilco --- (In reply to Andrew Pinski from comment #1) > I think this is just a pass ordering issue. We create fmas after PRE. > Maybe we should do it both before and after ... > Or enhance the pass which produces FMA to walk through to another bb ... FMAs are not created in the tree, Expand can do simple cases, and Combine finds other cases. Given more and more targets support FMA, there is certainly an argument for adding an FMA tree operator.
[Bug middle-end/77568] [7 regression] CSE/PRE/Hoisting blocks common instruction contractions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77568 --- Comment #5 from Wilco --- (In reply to Andrew Pinski from comment #2) > Note there are two different issues here. Well they are 3 examples of the same underlying issue - don't do a CSE when it's not profitable. How they are resolved might be different of course. > One for the FMA issue which can/should be solved at the tree level (and that > is a regression). > > The other is the CSE, we should be able to uncse (rematization) during > register allocation to reduce register pressure; I think this has been filed > already. A more general rematerialization pass would certainly be very useful for lots of cases. However it wouldn't improve any of my examples as they are not about spilling (when you run out of registers it is more efficient to do more work to recompute values rather than spill and reload them).
[Bug middle-end/77580] New: Improve devirtualization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77580 Bug ID: 77580 Summary: Improve devirtualization Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: wdijkstr at arm dot com Target Milestone: --- A commonly used benchmark contains a hot loop which calls one of 2 virtual functions via a static variable which is set just before. A reduced example is: int f1(int x) { return x + 1; } int f2(int x) { return x + 2; } static int (*virt)(int); int f(int *p, int n, int x) { int i; if (x > 10) virt = f1; else virt = f2; for (i = 0; i < n; i++) p[i] = virt (i); } This is obviously very stupid code, however GCC could do better. virt is not address-taken, neither f1 nor f2 make a call, so virt is effectively a local variable. So the loop could be transformed to p[i] = (x > 10) ? f1 (i) : f2 (i); to enable inlining after which the if can be lifted: int f_opt(int *p, int n, int x) { int i; if (x > 10) virt = f1; else virt = f2; if (x > 10) for (i = 0; i < n; i++) p[i] = f1 (i); else for (i = 0; i < n; i++) p[i] = f2 (i); }
[Bug tree-optimization/32650] Convert p+strlen(p) to strchr(p, '\0') if profitable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=32650 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- This is codesize optimization as strchr is significantly slower than strlen - see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61056 which changes strchr into strlen.
[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #2 from Wilco --- (In reply to Bernd Edlinger from comment #1) > some background about this bug can be found here: > > https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html The di3_neon pattern does not constrain the input not to overlap with the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code does not handle partial overlaps. However it is feasible to fix that by swapping the order for the various cases.
[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041 --- Comment #4 from Wilco --- (In reply to Bernd Edlinger from comment #3) > (In reply to Wilco from comment #2) > > (In reply to Bernd Edlinger from comment #1) > > > some background about this bug can be found here: > > > > > > https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html > > > > The di3_neon pattern does not constrain the input not to overlap with > > the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code > > does not handle partial overlaps. However it is feasible to fix that by > > swapping the order for the various cases. > > from di3_neon: > > if (INTVAL (operands[2]) < 1) > { > emit_insn (gen_movdi (operands[0], operands[1])); > DONE; > } > > Will the movdi pattern work with partial overlaps? > It does basically this: > > emit_move_insn (gen_lowpart (SImode, operands[0]), > gen_lowpart (SImode, operands[1])); > emit_move_insn (gen_highpart (SImode, operands[0]), > gen_highpart (SImode, operands[1])); I think it's OK - that code only triggers if a movdi has a physical register that is not a valid DI register which is not the case we're dealing with. movdi has a split that does check for partial overlap around line 5900 in arm.md: /* Handle a partial overlap. */ if (rtx_equal_p (operands[0], operands[3])) ... However dealing with partial overlaps is complex so maybe the best option would be to add alternatives to di3_neon to either allow full overlap "r 0 X X X" or no overlap "&r r X X X". The shift code works with full overlap.
[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041 --- Comment #8 from Wilco --- (In reply to Bernd Edlinger from comment #7) > (In reply to Richard Earnshaw from comment #6) > > (In reply to Bernd Edlinger from comment #5) > > > (In reply to Wilco from comment #4) > > > > However dealing with partial overlaps is complex so maybe the best > > > > option > > > > would be to add alternatives to di3_neon to either allow full > > > > overlap > > > > "r 0 X X X" or no overlap "&r r X X X". The shift code works with full > > > > overlap. > > > > > > That sounds like a good idea. > > > > > > Then this condition in di3_neon could go away too: > > > > > > && (!reg_overlap_mentioned_p (operands[0], operands[1]) > > > || REGNO (operands[0]) == REGNO (operands[1]))) > > > > Note that we don't want to restrict complete overlaps, only partial > > overlaps. Restricting complete overlaps leads to significant increase in > > register pressure and a lot of redundant copying. > > Yes. > > That is Wilco's idea: instead of =r 0r X X X > use =r 0 X X X and =&r r X X X, that should ensure that > no partial overlap happens, just full overlap or nothing. > > That's what arm_emit_coreregs_64bit_shift > and arm_ashldi3_1bit can handle. > > Who will do it? I've got a patch that fixes it, it's being tested. While looking at how DI mode operations get expanded, I noticed there is a CQ issue with your shift change. Shifts that are expanded early now use extra registers due to the DI mode write of zero. Given all other DI mode operations are expanded after reload, it may be better to do the same for shifts too.
[Bug target/77308] surprisingly large stack usage for sha512 on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308 --- Comment #12 from Wilco --- It looks like we need a different approach, I've seen the extra SETs use up more registers in some cases, and in other cases being optimized away early on... Doing shift expansion at the same time as all other DI mode operations should result in the same stack size as -fpu=neon. However that's still well behind Thumb-1, and I would expect ARM/Thumb-2 to beat Thumb-1 easily with 6 extra registers. The spill code for Thumb-2 seems incorrect: (insn 11576 8090 9941 5 (set (reg:SI 3 r3 [11890]) (plus:SI (reg/f:SI 13 sp) (const_int 480 [0x1e0]))) sha512.c:147 4 {*arm_addsi3} (nil)) (insn 9941 11576 2978 5 (set (reg:DI 2 r2 [4210]) (mem/c:DI (reg:SI 3 r3 [11890]) [5 %sfpD.4158+-3112 S8 A64])) sha512.c:147 170 {*arm_movdi} (nil)) LDRD has a range of 1020 on Thumb-2 so I would expect this to be a single instruction.
[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041 --- Comment #11 from Wilco --- (In reply to ktkachov from comment #10) > Confirmed then. Wilco, if you're working on this can you please assign it to > yourself? Unfortunately the form doesn't allow me to do anything with the headers...
[Bug target/77308] surprisingly large stack usage for sha512 on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308 --- Comment #14 from Wilco --- (In reply to Bernd Edlinger from comment #13) > I am still trying to understand why thumb1 seems to outperform thumb2. > > Obviously thumb1 does not have the shiftdi3 pattern, > but even if I remove these from thumb2, the result is still > not par with thumb2. Apparently other patterns still produce di > values that are not enabled with thumb1, they are > xordi3 and anddi3, these are often used. Then there is > adddi3 that is enabled in thumb1 and thumb2, I also disabled > this one, and now the sha512 gets down to inclredible 1152 > bytes frame (-Os -march=armv7 -mthumb -float-abi=soft): > > I know this is a hack, but 1K stack is what we should expect... > > --- arm.md 2016-10-25 19:54:16.425736721 +0200 > +++ arm.md.orig 2016-10-17 19:46:59.0 +0200 > @@ -448,7 +448,7 @@ > (plus:DI (match_operand:DI 1 "s_register_operand" "") >(match_operand:DI 2 "arm_adddi_operand" ""))) > (clobber (reg:CC CC_REGNUM))])] > - "TARGET_EITHER && !TARGET_THUMB2" > + "TARGET_EITHER" So you're actually turning the these instructions off for Thumb-2? What does it do instead then? Do the number of instructions go down? I noticed that with or without -mfpu=neon, using -marm is significantly smaller than -mthumb. Most of the extra instructions appear to be moves, which means something is wrong (I would expect Thumb-2 to do better as it supports LDRD with larger offsets than ARM).
[Bug target/77308] surprisingly large stack usage for sha512 on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308 --- Comment #32 from Wilco --- (In reply to Bernd Edlinger from comment #31) > Sure, combine cant help, especially because it runs before split1. > > But I wondered why this peephole2 is not enabled: > > (define_peephole2 ; ldrd > [(set (match_operand:SI 0 "arm_general_register_operand" "") > (match_operand:SI 2 "memory_operand" "")) >(set (match_operand:SI 1 "arm_general_register_operand" "") > (match_operand:SI 3 "memory_operand" ""))] > "TARGET_LDRD > && current_tune->prefer_ldrd_strd > && !optimize_function_for_size_p (cfun)" > [(const_int 0)] > > > I have -march=armv7-a / -mcpu=cortex-a9 and thus for me > current_tune-> prefer_ldrd_strd is FALSE. > > Furthermore, if I want to do -Os the third condition is FALSE too. > But one ldrd must be shorter than two ldr ? > > That seems wrong... Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM should only be tried on Thumb-1. Emitting LDRD from a peephole when the offset is in range will never increase code size so should always be enabled.
[Bug rtl-optimization/65862] [MIPS] IRA/LRA issue: integers spilled to floating-point registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65862 --- Comment #4 from Wilco --- (In reply to Vladimir Makarov from comment #3) But I can not just revert the patch making ALL_REGS available > to make coloring heuristic more fotunate for your particular case, as it > reopens the old PR for which the patch was created and i have no other > solutions for the old PR. I tried reverting the ALL_REGS patch and I don't see any regressions - in fact allocations are slightly better (fewer registers with ALL_REGS preference which is what we need - a strong decision to allocate to either FP or int regs). So what was the motivation for it? Note that it would be trivial to prefer ALL_REGS for some operands if necessary. The only case I can imagine is load and store which on some targets are quite orthogonal. I tried doing m=r#w and m=w#r on AArch64 and that works fine (this tells the preferencing code that ALL_REGS is best but it still keeps a clear INT/FP separation in the patterns which you may need for disassembly etc). IMHO that is a better solution than to automatically change the patterns r=r+r; w=w+w into rw=rw+rw and assume that ALL_REGS preference on all operands has zero cost eventhough the cost calculation explicitly states otherwise.
[Bug rtl-optimization/65862] [MIPS] IRA/LRA issue: integers spilled to floating-point registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65862 --- Comment #13 from Wilco --- (In reply to Vladimir Makarov from comment #9) > Created attachment 35503 [details] > ira-hook.patch > > Here is the patch. Could you try it and give me your opinion about it. > Thanks. I tried it out and when forcing ALL_REGS to either GENERAL_REGS or FP_REGS based on mode it generates significantly smaller spillcode, especially on some of the high register pressure SPECFP2006 benchmarks like gamess. It is better than avoiding ALL_REGS if the cost is higher (like the PPC patch mentioned earlier) - this indicates that the case for preferring ALL_REGS for generic loads/stores is pretty thin and likely not beneficial overall. I'm glad with this we're moving towards a more conventional allocation scheme where the decision of which register class to use is made early rather than independently for each operand during allocation. I didn't do a full performance regression test but early results show identical performance with smaller codesize. Note that this doesn't solve the lra-constraints issue where it ignores the allocno class during spilling and just chooses the first variant that matches.
[Bug target/63304] Aarch64 pc-relative load offset out of range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63304 --- Comment #33 from Wilco --- (In reply to Evandro from comment #32) > (In reply to Ramana Radhakrishnan from comment #31) > > (In reply to Evandro from comment #30) > > > The performance impact of always referring to constants as if they were > > > far > > > away is significant on targets which do not fuse ADRP and LDR together. > > > > What happens if you split them up and schedule them appropriately ? I didn't > > see any significant impact in my benchmarking on implementations that did > > not implement such fusion. Where people want performance in these cases they > > can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there > > already. > > Because of side effects of the Haiffa scheduler, the loads now pile up, and > the ADRPs may affect the load issue rate rather badly if not fused. At leas > on our processor. ADRP latency to load-address should be zero on any OoO core - ADRP is basically a move-immediate, so can execute early and hide any latency. > Which brings another point, shouldn't there be just one ADRP per BB or, > ideally, per function? Or am I missing something? That's not possible in this case as the section is mergeable. An alternative implementation using anchors may be feasible, but GCC is extremely bad at using anchors efficiently - functions using several global variables also end up with a large number of ADRPs when you'd expect a single ADRP.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #6 from Wilco --- I ran the assembler examples on A57 hardware with identical input. The FMADD code is ~20% faster irrespectively of the size of the input. This is not a surprise given that the FMADD latency is lower than the FADD and FMUL latency. The alignment of the loop or scheduling don't matter at all as the FMADD latency dominates by far - with serious optimization this code could run 4-5 times as fast and would only be limited by memory bandwidth on datasets larger than L2. So this particular example shows issues in LLVM, not in GCC.
[Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503 --- Comment #10 from Wilco --- The loops shown are not the correct inner loops for those options - with -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the question is why GCC doesn't unroll vectorized loops like LLVM? GCC: .L24: ldrq3, [x13, x5] addx6, x6, 1 ldrq2, [x16, x5] cmpx6, x12 addx5, x5, 16 fmlav1.2d, v3.2d, v2.2d bcc.L24 LLVM: .LBB2_12: ldurq2, [x8, #-16] ldrq3, [x8], #32 ldurq4, [x21, #-16] ldrq5, [x21], #32 fmlav1.2d, v2.2d, v4.2d fmlav0.2d, v3.2d, v5.2d subx30, x30, #4// =4 cbnzx30, .LBB2_12