[Bug tree-optimization/91246] vectorization failure for a small loop to search array element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91246 --- Comment #5 from avieira at gcc dot gnu.org --- I have posted a prototype on the mailing list https://gcc.gnu.org/pipermail/gcc-patches/2020-March/541908.html This is really just a prototype to investigate code-gen impact, I don't expect to commit this as is and whether it makes sense to do something like this.
[Bug target/94445] gcc.target/arm/cmse/cmse-15.c fails for cortex-m33
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94445 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2020-04-02 --- Comment #1 from avieira at gcc dot gnu.org --- Hi Christophe, This looks to me like an issue of not building distinct types for the ns_foo_t and s_bar_t function types. When I first wrote this code I tested for this and it was working, so I am wondering whether changes have been made in the way we create types in the c-frontend. I am trying to find out how all this works again, its been a while...
[Bug target/94445] gcc.target/arm/cmse/cmse-15.c fails for cortex-m33
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94445 --- Comment #2 from avieira at gcc dot gnu.org --- start_decl seems to be doing the right thing, investigation continues...
[Bug target/94445] gcc.target/arm/cmse/cmse-15.c fails for cortex-m33
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94445 --- Comment #4 from avieira at gcc dot gnu.org --- Yeah... So far I have checked that 'gimplify_call_expr' creates the right gimple, and up until 'gimplify_modify_expr' I can verify it does by using gimple_call_fntype . Though at expansion time, the 'gimple_call_fntype (stmt)' of '_5 = s_bar_p_2(D) (); [tail call]' now has the attribute ... So it must go wrong somewhere between gimplification and expansion, but that's a big window and dump files won't help us :(
[Bug target/94445] gcc.target/arm/cmse/cmse-15.c fails for cortex-m33
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94445 --- Comment #5 from avieira at gcc dot gnu.org --- Yeah... So far I have checked that 'gimplify_call_expr' creates the right gimple, and up until 'gimplify_modify_expr' I can verify it does by using gimple_call_fntype . Though at expansion time, the 'gimple_call_fntype (stmt)' of '_5 = s_bar_p_2(D) (); [tail call]' now has the attribute ... So it must go wrong somewhere between gimplification and expansion, but that's a big window and dump files won't help us :(
[Bug target/94445] gcc.target/arm/cmse/cmse-15.c fails for cortex-m33
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94445 --- Comment #6 from avieira at gcc dot gnu.org --- I have also identified that this only goes wrong in O2 or higher. And it happens sometime between tailcall optimization pass 1 and 2. But there's loads of passes in between.
[Bug target/94814] [8 Regression] ICE: RTL check: expected code 'const_int', have 'reg' in output_3367, at config/aarch64/atomics.md:755
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94814 avieira at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED CC||avieira at gcc dot gnu.org --- Comment #2 from avieira at gcc dot gnu.org --- I believe this is fixed with the above backport.
[Bug target/95646] arm-none-eabi function attribute 'cmse_nonsecure_entry' wipes register values with -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95646 avieira at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2020-06-15 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #1 from avieira at gcc dot gnu.org --- Reproduced and confirmed. This is because we special treat HI_REGS in Thumb-1 when optimizing for size. I have a fix ready, just doing some testing.
[Bug target/96795] New: MVE: issue with polymorphism and integer promotion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96795 Bug ID: 96795 Summary: MVE: issue with polymorphism and integer promotion Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- An example of this issue can be observed when trying to compile: #include uint16x8_t foo (uint16x8_t a, int16_t b) { return vaddq (a, (b<<3)); } This will lead to an __ARM_undef being selected. I believe this is because __ARM_mve_coerce only accepts one type for scalar parameters and should have accepted the same range of types for scalar as is done in __ARM_mve_typeid. A workaround for this is to cast (b<<3) to uint16_t.
[Bug tree-optimization/88915] Try smaller vectorisation factors in scalar fallback
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88915 --- Comment #5 from avieira at gcc dot gnu.org --- Author: avieira Date: Tue Oct 29 13:15:46 2019 New Revision: 277569 URL: https://gcc.gnu.org/viewcvs?rev=277569&root=gcc&view=rev Log: [vect]PR 88915: Vectorize epilogues when versioning loops gcc/ChangeLog: 2019-10-29 Andre Vieira PR 88915 * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter and make the valueize function pointer also take a void pointer. * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap around vn_valueize, to call it without a context. (process_bb): Use vn_valueize_wrapper instead of vn_valueize. * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (find_in_mapping): New helper function. (update_epilogue_loop_vinfo): New function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and call update_epilogue_loop_vinfo to update it. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. (vect_do_peeling, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already created loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-loop-niter.c trunk/gcc/tree-ssa-loop-niter.h trunk/gcc/tree-ssa-sccvn.c trunk/gcc/tree-vect-loop-manip.c trunk/gcc/tree-vect-loop.c trunk/gcc/tree-vectorizer.c trunk/gcc/tree-vectorizer.h
[Bug tree-optimization/92317] [10 Regression] ICE in slpeel_duplicate_current_defs_from_edges, at tree-vect-loop-manip.c:960 since r277569
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92317 --- Comment #1 from avieira at gcc dot gnu.org --- Confirmed. It seems get_loop_copy is returning NULL. I'm looking into it.
[Bug tree-optimization/92317] [10 Regression] ICE in slpeel_duplicate_current_defs_from_edges, at tree-vect-loop-manip.c:960 since r277569
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92317 --- Comment #2 from avieira at gcc dot gnu.org --- Actually upon a second look it has nothing to do with that, that get_loop_body doesn't make much sense there anyways. I believe that should have just been 'loop' as slpeel_tree_duplicate_loop_to_edge_cfg creates a copy of LOOP from LOOP if LOOP == SCALAR_LOOP. The problem here lies with using SCALAR_LOOP for an epilogue... not quite sure what is wrong though.
[Bug tree-optimization/92317] [10 Regression] ICE in slpeel_duplicate_current_defs_from_edges, at tree-vect-loop-manip.c:960 since r277569
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92317 --- Comment #3 from avieira at gcc dot gnu.org --- Author: avieira Date: Wed Nov 6 11:22:35 2019 New Revision: 277877 URL: https://gcc.gnu.org/viewcvs?rev=277877&root=gcc&view=rev Log: [vect] PR92317: fix skip_epilogue creation for epilogues gcc/ChangeLog: 2019-11-06 Andre Vieira PR tree-optimization/92317 * tree-vect-loop-manip.c (slpeel_update_phi_nodes_for_guard2): Also update phi's with constant phi arguments. gcc/testsuite/ChangeLog: 2019-11-06 Andre Vieira PR tree-optimization/92317 * gcc/testsuite/g++.dg/opt/pr92317.C: New test. Added: trunk/gcc/testsuite/g++.dg/opt/pr92317.C Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-loop-manip.c
[Bug tree-optimization/92317] [10 Regression] ICE in slpeel_duplicate_current_defs_from_edges, at tree-vect-loop-manip.c:960 since r277569
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92317 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from avieira at gcc dot gnu.org --- I believe that patch fixes the issue.
[Bug tree-optimization/92351] [10 Regression] Wrong code with -O3 -match=skylake since r277569
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92351 --- Comment #3 from avieira at gcc dot gnu.org --- Author: avieira Date: Fri Nov 8 13:52:56 2019 New Revision: 277974 URL: https://gcc.gnu.org/viewcvs?rev=277974&root=gcc&view=rev Log: [vect] PR 92351: When peeling for alignment make alignment of epilogues unknown gcc/ChangeLog: 2019-11-08 Andre Vieira PR tree-optimization/92351 * tree-vect-data-refs.c (vect_compute_data_ref_alignment): When we are peeling the main loop for alignment, make sure to set the misalignment of the epilogue's data references to DR_MISALIGNMENT_UNKNOWN. gcc/testsuite/ChangeLog: 2019-11-08 Andre Vieira PR tree-optimization/92351 * gcc.dg/vect/vect-peel-2.c: Disable epilogue vectorization and split the source of this test to... * gcc.dg/vect/vect-peel-2-src.c: ... This. * gcc.dg/vect/vect-peel-2-epilogues.c: New test. Added: trunk/gcc/testsuite/gcc.dg/vect/vect-peel-2-epilogues.c trunk/gcc/testsuite/gcc.dg/vect/vect-peel-2-src.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/vect/vect-peel-2.c trunk/gcc/tree-vect-data-refs.c
[Bug tree-optimization/92351] [10 Regression] Wrong code with -O3 -match=skylake since r277569
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92351 avieira at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from avieira at gcc dot gnu.org --- I believe the committed patch fixes this.
[Bug tree-optimization/92429] [10 Regression] ICE in vect_transform_stmt, at tree-vect-stmts.c:10918
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92429 --- Comment #2 from avieira at gcc dot gnu.org --- So I had a look at this, the ICE occurs because 'vectorizable_condition' does not know how to handle a constant cond_expr. The reason this cond_expr is constant in the epilogue is because 'simplify_replace_tree' folds the replacement and the replacement in this case is: "_34 < 0" where "_34 = _33 * _33", and fold-const is able to assert that _34 is therefore always positive or zero and can fold the check to false. The question now is, why was the original 'cond_expr' that we copied over not folded? I suspect its because of the -fno-tree-fre. If we want this to work I suggest we either: 1) teach 'vectorizable_condition' to learn how to deal with constant cond_expr's 2) change 'simplify_replace_tree' to optionally fold. I don't like 2) much because this doesn't guarantee we don't fold elsewhere. If we want the vectorizer to accept loop code in sub-optimal format I suggest we do 1).
[Bug tree-optimization/92347] [10 Regression] ICE in vect_get_vec_def_for_operand_1, at tree-vect-stmts.c:1537
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92347 --- Comment #3 from avieira at gcc dot gnu.org --- I had a look at the first testcase. I think the problem is I was setting the epilogue's safelen to the loop's safelen, after the loop->safelen had been cleared, as we do this after vectorization. Removing that update and letting epilogue keep the original safelen seems to solve the first ICE. The second is something different, looking at that now.
[Bug tree-optimization/92347] [10 Regression] ICE in vect_get_vec_def_for_operand_1, at tree-vect-stmts.c:1537
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92347 --- Comment #4 from avieira at gcc dot gnu.org --- The second case seems to be because vectorizable_simd_clone_call seems to be inserting values and phi-nodes on the epilogue's preheader edge which uses a value defined in the main loop's preheader edge (created by the main loop's call to vectorizable_simd_clone_call). However this definition does not dominate the use, as the main loop may have been skipped. Not entirely sure what the best action is here, I didn't get enough time to figure out what these values represent.
[Bug tree-optimization/92347] [10 Regression] ICE in vect_get_vec_def_for_operand_1, at tree-vect-stmts.c:1537
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92347 --- Comment #5 from avieira at gcc dot gnu.org --- Not quite sure the third case has anything to do with epilogue vectorization though... It still manifests itself with it turned off. Seems to be a lack of "folding" again. I think it would be useful to split testcases 2 and 3 into two new PR's as they are unrelated issues to 1.
[Bug tree-optimization/92347] [10 Regression] ICE in vect_get_vec_def_for_operand_1, at tree-vect-stmts.c:1537
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92347 --- Comment #7 from avieira at gcc dot gnu.org --- Thank you!
[Bug tree-optimization/92460] [10 Regression] ICE: verify_ssa failed (error: definition in block 13 does not dominate use in block 22)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92460 avieira at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-11-11 Ever confirmed|0 |1 --- Comment #1 from avieira at gcc dot gnu.org --- The ICE seems to be because vectorizable_simd_clone_call is inserting values and phi-nodes on the epilogue's preheader edge which uses a value defined in the main loop's preheader edge (created by the main loop's call to vectorizable_simd_clone_call). However this definition does not dominate the use, as the main loop may have been skipped. Not entirely sure what the best action is here, I didn't get enough time to figure out what these values represent. I suspect this is not because of my changes though, but it was a latent issue that now shows up because I turned on epilogue vectorization.
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 91573, which changed state. Bug 91573 Summary: Vectorization failure for a loop to do multiply-add because SLP loads unnecessarily require permutation https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91573 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/91573] Vectorization failure for a loop to do multiply-add because SLP loads unnecessarily require permutation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91573 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED CC||avieira at gcc dot gnu.org Resolution|--- |FIXED --- Comment #7 from avieira at gcc dot gnu.org --- This now vectorizes for aarch64 and x86_64 with avx2 and avx512. Closing this ticket.
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 91573, which changed state. Bug 91573 Summary: Vectorization failure for a loop to do multiply-add because SLP loads unnecessarily require permutation https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91573 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/92429] [10 Regression] ICE in vect_transform_stmt, at tree-vect-stmts.c:10918
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92429 --- Comment #5 from avieira at gcc dot gnu.org --- Hi Martin, Sorry about that, forgot to check it after I got back from holidays. I wrote up a patch, actually going with solution 2) (fixes both issues locally). Just running more tests now to make sure I didn't break anything else.
[Bug tree-optimization/92429] [10 Regression] ICE in vect_transform_stmt, at tree-vect-stmts.c:10918
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92429 avieira at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from avieira at gcc dot gnu.org --- I believe this is fixed, closing.
[Bug target/86487] [8 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 --- Comment #15 from avieira at gcc dot gnu.org --- Jeff seems to have backported this to gcc-8 already, so I guess we can close this?
[Bug target/88224] Wrong Cortex-R7 and Cortex-R8 FPU configuration
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88224 --- Comment #3 from avieira at gcc dot gnu.org --- Author: avieira Date: Fri Dec 14 09:04:24 2018 New Revision: 267124 URL: https://gcc.gnu.org/viewcvs?rev=267124&root=gcc&view=rev Log: PR target/88224: Fix FPU configuration of Cortex-R7 and Cortex-R8 gcc/ 2018-12-14 Andre Vieira Backport from mainline PR target/88224 * config/arm/arm-cpus.in (armv7-r): Add FP16conv configurations. (cortex-r7, cortex-r8): Update fpu and add new configuration. * doc/invoke.texi (armv7-r): Add two new vfp options. (nofp.dp): Add cortex-r7 and cortex-r8 to the list of targets that support this option. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/config/arm/arm-cpus.in branches/gcc-8-branch/gcc/doc/invoke.texi
[Bug target/88224] Wrong Cortex-R7 and Cortex-R8 FPU configuration
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88224 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from avieira at gcc dot gnu.org --- Fixed on trunk and gcc-8.
[Bug target/86487] [7/8/9 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 --- Comment #7 from avieira at gcc dot gnu.org --- Hi, This one sort of fell through the cracks on me. With help from Vlad and Richard S. I managed to track the issue to uses_hard_regs_p and the way it handles paradoxical subregs (or fails to). I have a patch for this, which I will rebase and test. Ill give your new testcase a whirl Oliver thanks! Cheers, Andre
[Bug target/86487] [7/8/9 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 --- Comment #8 from avieira at gcc dot gnu.org --- Oliver, Your new example doesn't seem to be hitting the same issue as the first one. The first failure was being caused by paradoxical subregs, the second one doesn't have paradoxical subregs. I'll try to investigate it.
[Bug target/86487] [7/8/9 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 --- Comment #10 from avieira at gcc dot gnu.org --- Hi Vlad, I don't think it is a duplication. I believe this PR is caused by an issue with 'uses_hard_regs_p' and paradoxical subregs. I proposed a patch in https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00307.html , though it has a mistake, I forgot to add '|| SUBREG_P (x)' to the 'if (REG_P (x))' line since x can now be a subreg. I haven't had much time lately, but I am now running the last bootstrap, have done arm and aarch64, now doing x86. I can't reproduce this on GCC 9 but I can on 8 and earlier and the latent bug is still there on 9. So I believe we should fix it regardless. Once the bootstrap is done Ill post the fixed patch + testcase (really only useful for gcc-8) on the mailing list. Cheers, Andre
[Bug target/83009] gcc.target/aarch64/store_v2vec_lanes.c fails with -mabi=ilp32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83009 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |ASSIGNED CC||avieira at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |avieira at gcc dot gnu.org
[Bug target/83009] gcc.target/aarch64/store_v2vec_lanes.c fails with -mabi=ilp32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83009 --- Comment #5 from avieira at gcc dot gnu.org --- I have been looking at this and the problem does indeed lie with the register not being a hard reg because aarch64_mem_pair_lanes_operand invokes aarch64_legitimate_address_p with 1 for the strict_p argument. Changing that to a 0 yields the desired results for this testcase. Also good to note that this is not an ilp32 issue only, because of this we would also miss cases where the argument hard-register would not be successfully combined into the load/store. Say if for instance the argument in the test function were a pointer to the pointer we are addressing. I will proceed to run tests now, if someone knows why that "strict_p" was being set to 1 please let me know, I am unfamiliar with this code and fear this might be too big a hammer.
[Bug target/83009] gcc.target/aarch64/store_v2vec_lanes.c fails with -mabi=ilp32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83009 --- Comment #7 from avieira at gcc dot gnu.org --- Bootstrap and regression testing looks good. Ill put the patch up on the ML when we enter stage 1 again.
[Bug target/83009] gcc.target/aarch64/store_v2vec_lanes.c fails with -mabi=ilp32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83009 --- Comment #8 from avieira at gcc dot gnu.org --- Author: avieira Date: Thu May 24 08:53:39 2018 New Revision: 260635 URL: https://gcc.gnu.org/viewcvs?rev=260635&root=gcc&view=rev Log: PR target/83009: Relax strict address checking for store pair lanes The operand constraint for the memory address of store/load pair lanes was enforcing strictly hardware registers be allowed as memory addresses. We want to relax that such that these patterns can be used by combine. During register allocation the register constraint will enforce the correct register is chosen. gcc 2018-05-24 Andre Vieira PR target/83009 * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Make address check not strict. gcc/testsuite 2018-05-24 Andre Vieira PR target/83009 * gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/predicates.md trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c
[Bug target/83009] gcc.target/aarch64/store_v2vec_lanes.c fails with -mabi=ilp32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83009 avieira at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #9 from avieira at gcc dot gnu.org --- I believe my patch fixes this.
[Bug target/83009] gcc.target/aarch64/store_v2vec_lanes.c fails with -mabi=ilp32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83009 --- Comment #10 from avieira at gcc dot gnu.org --- Author: avieira Date: Wed May 30 15:59:14 2018 New Revision: 260957 URL: https://gcc.gnu.org/viewcvs?rev=260957&root=gcc&view=rev Log: Reverting r260635 gcc 2018-05-30 Andre Vieira 2018-05-24 Andre Vieira PR target/83009 Revert: * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Make address check not strict. gcc/testsuite 2018-05-30 Andre Vieira 2018-05-24 Andre Vieira Revert PR target/83009 * gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@260635 138bc75d-0d04-0410-961f-82ee72b054a4 Modified: trunk/gcc/config/aarch64/predicates.md trunk/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 91460, which changed state. Bug 91460 Summary: gcc -mpreferred-vector-width=256 is slower than -mpreferred-vector-width=128 for some loops https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91460 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE
[Bug tree-optimization/88915] Try smaller vectorisation factors in scalar fallback
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88915 avieira at gcc dot gnu.org changed: What|Removed |Added CC||skpgkp2 at gmail dot com --- Comment #4 from avieira at gcc dot gnu.org --- *** Bug 91460 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/91460] gcc -mpreferred-vector-width=256 is slower than -mpreferred-vector-width=128 for some loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91460 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED CC||avieira at gcc dot gnu.org Resolution|--- |DUPLICATE --- Comment #4 from avieira at gcc dot gnu.org --- Yes this looks like a duplicate of PR 88915. I'll mark it as such. *** This bug has been marked as a duplicate of bug 88915 ***
[Bug target/86487] [7/8/9 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 --- Comment #12 from avieira at gcc dot gnu.org --- Author: avieira Date: Wed Feb 20 14:11:43 2019 New Revision: 269039 URL: https://gcc.gnu.org/viewcvs?rev=269039&root=gcc&view=rev Log: [GCC] PR target/86487: fix the way 'uses_hard_regs_p' handles paradoxical subregs gcc/ChangeLog: 2019-02-20 Andre Vieira PR target/86487 * lra-constraints.c(uses_hard_regs_p): Fix handling of paradoxical SUBREGS. gcc/testsuite/ChangeLog: 2019-02-20 Andre Vieira PR target/86487 * gcc.target/arm/pr86487.c: New. Added: trunk/gcc/testsuite/gcc.target/arm/pr86487.c Modified: trunk/gcc/ChangeLog trunk/gcc/lra-constraints.c trunk/gcc/testsuite/ChangeLog
[Bug lto/86366] [9 regression] gcc.dg/profile-dir-3.c fails starting with r262251
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86366 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #2 from avieira at gcc dot gnu.org --- Hi Martin, We have also seen profile-dir-1.gcda fail on aarch64-none-linux-gnu and arm-none-linux-gnueabihf, as well as profile-dir-3.gcda, recently. I am assuming this is all related. Cheers, Andre
[Bug target/86487] [7/8/9 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 avieira at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2018-07-16 CC||avieira at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from avieira at gcc dot gnu.org --- Confirmed with a local build.
[Bug target/83009] gcc.target/aarch64/store_v2vec_lanes.c fails with -mabi=ilp32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83009 --- Comment #11 from avieira at gcc dot gnu.org --- Author: avieira Date: Thu Jul 19 14:03:21 2018 New Revision: 262881 URL: https://gcc.gnu.org/viewcvs?rev=262881&root=gcc&view=rev Log: [AArch64][PATCH 2/2] PR target/83009: Relax strict address checking for store pair lanes gcc/ChangeLog 2018-07-19 Andre Vieira PR target/83009 * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Make address check not strict. gcc/testsuite/ChangeLog 2018-07-19 Andre Vieira PR target/83009 * gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/predicates.md trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/aarch64/store_v2vec_lanes.c
[Bug target/86487] [7/8/9 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 --- Comment #2 from avieira at gcc dot gnu.org --- I am having quite a lot of trouble understanding what is going wrong, or maybe I should say, what parts are going right. I believe it tries to match the fifth alternative for anddi3_insn here which is: '&r' 'r' 'De' This fails because of the early clobber, rightfully so because: (insn 13 11 14 2 (set (reg:DI 0 r0 [125]) (and:DI (reg:DI 1 r1 [+-4 ]) (const_int 1 [0x1]))) "../t.c":3 79 {*anddi3_insn} (nil)) DI r0 overlaps with DI r1, seeing you need two consecutive GPRs to contain a DImode. I decided to debug reload to find out why it had picked r1 and I find 'get_hard_regno' first picks r2 for (subreg:DI (SI 122)) in the same instruction. If we go up we see: (insn 10 9 11 2 (set (reg:SI 2 r2 [122]) (xor:SI (reg:SI 0 r0 [orig:123 a ] [123]) (const_int 1 [0x1]))) "../t.c":3 111 {*arm_xorsi3} (nil)) Then in 'get_hard_regno' it invokes 'subreg_regno_offset', that returns 'nregs_xmode - nregs_ymode' as offset in big endian for paradoxical subregs with offset 0, where, xmode is inner and ymode is outer. That is '-1' in our case (and always negative). So I believe reload is now seeing 'r1-r2' as the register pair for that first 'and' operand and 'r0-r1' as the destination operand. At first I was thinking this was a middle-end issue, specifically for paradoxical subregs. However, I also saw a bit of Aarch64 big endian assembly that used 'odd' registers to represent DI register pairs (V2DI). Given the comment in 'subreg_regno_offset': /* If this is a big endian paradoxical subreg, which uses more actual hard registers than the original register, we must return a negative offset so that we find the proper highpart of the register. We assume that the ordering of registers within a multi-register value has a consistent endianness: if bytes and register words have different endianness, the hard registers that make up a multi-register value must be at least word-sized. */ It made me start to think that GCC expects register pairs in big endian to be "called" by their Least Significant Register (LSR) and to be counted back from there. So '[r1, r0]' to be called (DI r1). I am not entirely sure about this though... I tried changing the arm back-end to only accept DI mode register pairs if the register is odd. That fixed this case but broke a lot of other things. I am thinking another way to fix it is to adapt Arm's 's_register_operand' to not accept paradoxical subregs in big endian, but I would first like to understand how the middle end expects/sees/generates register pairs if 'REG_WORDS_BIG_ENDIAN' is true.
[Bug target/86487] [7/8/9 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 --- Comment #3 from avieira at gcc dot gnu.org --- @Vlad: I added you to this ticket to see if maybe you can shine some light on how GCC's register allocator deals with register pairs in big endian, I am struggling to figure out how all of this works together, see comment before this. Thanks in advance!
[Bug fortran/25829] [F03] Asynchronous IO support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25829 --- Comment #46 from avieira at gcc dot gnu.org --- Author: avieira Date: Tue Jul 31 08:42:21 2018 New Revision: 263082 URL: https://gcc.gnu.org/viewcvs?rev=263082&root=gcc&view=rev Log: Reverting 'AsyncI/O patch committed' as it is breaking bare-metal builds. 2018-07-31 Andre Vieira Revert 'AsyncI/O patch committed' 2018-07-25 Nicolas Koenig Thomas Koenig PR fortran/25829 * gfortran.texi: Add description of asynchronous I/O. * trans-decl.c (gfc_finish_var_decl): Treat asynchronous variables as volatile. * trans-io.c (gfc_build_io_library_fndecls): Rename st_wait to st_wait_async and change argument spec from ".X" to ".w". (gfc_trans_wait): Pass ID argument via reference. 2018-07-31 Andre Vieira Revert 'AsyncI/O patch committed' 2018-07-25 Nicolas Koenig Thomas Koenig PR fortran/25829 * gfortran.dg/f2003_inquire_1.f03: Add write statement. * gfortran.dg/f2003_io_1.f03: Add wait statement. 2018-07-31 Andre Vieira Revert 'AsyncI/O patch committed' 2018-07-25 Nicolas Koenig Thomas Koenig PR fortran/25829 * Makefile.am: Add async.c to gfor_io_src. Add async.h to gfor_io_headers. * Makefile.in: Regenerated. * gfortran.map: Add _gfortran_st_wait_async. * io/async.c: New file. * io/async.h: New file. * io/close.c: Include async.h. (st_close): Call async_wait for an asynchronous unit. * io/file_pos.c (st_backspace): Likewise. (st_endfile): Likewise. (st_rewind): Likewise. (st_flush): Likewise. * io/inquire.c: Add handling for asynchronous PENDING and ID arguments. * io/io.h (st_parameter_dt): Add async bit. (st_parameter_wait): Correct. (gfc_unit): Add au pointer. (st_wait_async): Add prototype. (transfer_array_inner): Likewise. (st_write_done_worker): Likewise. * io/open.c: Include async.h. (new_unit): Initialize asynchronous unit. * io/transfer.c (async_opt): New struct. (wrap_scalar_transfer): New function. (transfer_integer): Call wrap_scalar_transfer to do the work. (transfer_real): Likewise. (transfer_real_write): Likewise. (transfer_character): Likewise. (transfer_character_wide): Likewise. (transfer_complex): Likewise. (transfer_array_inner): New function. (transfer_array): Call transfer_array_inner. (transfer_derived): Call wrap_scalar_transfer. (data_transfer_init): Check for asynchronous I/O. Perform a wait operation on any pending asynchronous I/O if the data transfer is synchronous. Copy PDT and enqueue thread for data transfer. (st_read_done_worker): New function. (st_read_done): Enqueue transfer or call st_read_done_worker. (st_write_done_worker): New function. (st_write_done): Enqueue transfer or call st_read_done_worker. (st_wait): Document as no-op for compatibility reasons. (st_wait_async): New function. * io/unit.c (insert_unit): Use macros LOCK, UNLOCK and TRYLOCK; add NOTE where necessary. (get_gfc_unit): Likewise. (init_units): Likewise. (close_unit_1): Likewise. Call async_close if asynchronous. (close_unit): Use macros LOCK and UNLOCK. (finish_last_advance_record): Likewise. (newunit_alloc): Likewise. * io/unix.c (find_file): Likewise. (flush_all_units_1): Likewise. (flush_all_units): Likewise. * libgfortran.h (generate_error_common): Add prototype. * runtime/error.c: Include io.h and async.h. (generate_error_common): New function. 2018-07-31 Andre Vieira Revert 'AsyncI/O patch committed'. 2018-07-25 Nicolas Koenig Thomas Koenig PR fortran/25829 * testsuite/libgomp.fortran/async_io_1.f90: New test. * testsuite/libgomp.fortran/async_io_2.f90: New test. * testsuite/libgomp.fortran/async_io_3.f90: New test. * testsuite/libgomp.fortran/async_io_4.f90: New test. * testsuite/libgomp.fortran/async_io_5.f90: New test. * testsuite/libgomp.fortran/async_io_6.f90: New test. * testsuite/libgomp.fortran/async_io_7.f90: New test. Removed: trunk/libgfortran/io/async.c trunk/libgfortran/io/async.h trunk/libgomp/testsuite/libgomp.fortran/async_io_1.f90 trunk/libgomp/testsuite/libgomp.fortran/async_io_2.f90 trunk/libgomp/testsuite/libgomp.fortran/async_io_3.f90 trunk/libgomp/testsuite/libgomp.fortran/async_io_4.f90 trunk/libgomp/testsuite/libgomp.fortran/async_io_5.f90 trun
[Bug target/86487] [7/8/9 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 --- Comment #5 from avieira at gcc dot gnu.org --- I can confirm the ICE no longer occurs, but I am not entirely convinced the issue was "fixed" by this. I fear the underlying fault is still there, it is simply hidden now.
[Bug target/88224] New: Wrong Cortex-R7 and Cortex-R8 FPU configuration
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88224 Bug ID: 88224 Summary: Wrong Cortex-R7 and Cortex-R8 FPU configuration Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- The Cortex-R7 and Cortex-R8 TRM's* indicate that both CPUs can be configured with one of the following FPU options: 1) No FPU 2) Single precision-only VFPv3, with 16 double-precision registers and with FP16 conversion instructions extension 3) Single and double-precision VFPv3, with 16 double-precision registers and with FP16 conversion instructions extension. Currently GCC configures R7 and R8 without FP16 conversion instructions when using -mcpu=cortex-r7/cortex-r8 and it does not offer the single-precision only configuration (i.e. no +npfp.dp) *) https://static.docs.arm.com/ddi0458/c/DDI0458C_cortex_r7_r0p1_trm.pdf https://static.docs.arm.com/100400/0001/arm_cortexr8_mpcore_processor_trm_100400_0001_03_en.pdf
[Bug target/88224] Wrong Cortex-R7 and Cortex-R8 FPU configuration
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88224 --- Comment #2 from avieira at gcc dot gnu.org --- Author: avieira Date: Thu Nov 29 10:20:13 2018 New Revision: 266612 URL: https://gcc.gnu.org/viewcvs?rev=266612&root=gcc&view=rev Log: [PATCH] [Arm] Fix fpu configurations for Cortex-R7 and Cortex-R8 gcc/ChangeLog: 2018-11-29 Andre Vieira PR target/88224 * config/arm/arm-cpus.in (armv7-r): Add FP16conv configurations. (cortex-r7, cortex-r8): Update default and add new configuration. * doc/invoke.texi (armv7-r): Add two new vfp options. (nofp.dp): Add cortex-r7 and cortex-r8 to the list of targets that support this option. Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm-cpus.in trunk/gcc/doc/invoke.texi
[Bug target/61578] [4.9 regression] Code size increase for ARM thumb compared to 4.8.x when compiling with -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61578 --- Comment #43 from avieira at gcc dot gnu.org --- Author: avieira Date: Mon Jun 13 09:58:34 2016 New Revision: 237369 URL: https://gcc.gnu.org/viewcvs?rev=237369&root=gcc&view=rev Log: Backport from Mainline 2015-09-01 Vladimir Makarov PR target/61578 * lra-lives.c (process_bb_lives): Process move pseudos with the same value for copies and preferences * lra-constraints.c (match_reload): Create match reload pseudo with the same value from single dying input pseudo. Modified: branches/ARM/embedded-5-branch/gcc/ChangeLog.arm branches/ARM/embedded-5-branch/gcc/lra-constraints.c branches/ARM/embedded-5-branch/gcc/lra-lives.c
[Bug target/61578] [4.9 regression] Code size increase for ARM thumb compared to 4.8.x when compiling with -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61578 --- Comment #44 from avieira at gcc dot gnu.org --- Author: avieira Date: Mon Jun 13 10:03:30 2016 New Revision: 237371 URL: https://gcc.gnu.org/viewcvs?rev=237371&root=gcc&view=rev Log: Backport from Mainline 2015-09-25 Vladimir Makarov PR target/61578 * lra-constarints.c (match_reload): Check presence of the input pseudo in the output pseudo. Modified: branches/ARM/embedded-5-branch/gcc/ChangeLog.arm branches/ARM/embedded-5-branch/gcc/lra-constraints.c
[Bug target/78255] [5/6/7 regression] Indirect sibling call causing wrong code generation for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78255 avieira at gcc dot gnu.org changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #4 from avieira at gcc dot gnu.org --- OK so after some extra debugging and digging I found that the postreload pass is basically turning the direct sibcall into an indirect sibcall. It takes cost into consideration, but does this only looking at the operands of the call, i.e. the cost of a symbolref vs the cost of a register. It does not take into consideration that it is doing a call. This doesn't seem like a good idea to me. Apart from that, I am now looking into letting arm_get_frame_offsets recalculate the offsets and registers to push and pop past reload_completed. I am not convinced this is entirely safe yet...
[Bug target/71607] [5/6/7 Regression] [ARM] ice due to forbidden enabled attribute dependency on instruction operands
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71607 --- Comment #8 from avieira at gcc dot gnu.org --- Author: avieira Date: Mon Dec 5 17:36:03 2016 New Revision: 243266 URL: https://gcc.gnu.org/viewcvs?rev=243266&root=gcc&view=rev Log: [ARM] PR71607: New approach to arm_disable_literal_pool gcc/ChangeLog.arm: 2016-12-05 Andre Vieira PR target/71607 * config/arm/arm.md (use_literal_pool): Removes. (64-bit immediate split): No longer takes cost into consideration if 'arm_disable_literal_pool' is enabled. * config/arm/arm.c (arm_use_blocks_for_constant_p): New. (TARGET_USE_BLOCKS_FOR_CONSTANT_P): Define. (arm_max_const_double_inline_cost): Remove use of arm_disable_literal_pool. * config/arm/vfp.md (no_literal_pool_df_immediate): New. (no_literal_pool_sf_immediate): New. gcc/testsuite/ChangeLog.arm: 2016-12-05 Andre Vieira Thomas Preud'homme PR target/71607 * gcc.target/arm/thumb2-slow-flash-data.c: Renamed to ... * gcc.target/arm/thumb2-slow-flash-data-1.c: ... this. * gcc.target/arm/thumb2-slow-flash-data-2.c: New. * gcc.target/arm/thumb2-slow-flash-data-3.c: New. * gcc.target/arm/thumb2-slow-flash-data-4.c: New. * gcc.target/arm/thumb2-slow-flash-data-5.c: New. Added: branches/ARM/embedded-6-branch/gcc/testsuite/gcc.target/arm/thumb2-slow-flash-data-1.c branches/ARM/embedded-6-branch/gcc/testsuite/gcc.target/arm/thumb2-slow-flash-data-2.c branches/ARM/embedded-6-branch/gcc/testsuite/gcc.target/arm/thumb2-slow-flash-data-3.c branches/ARM/embedded-6-branch/gcc/testsuite/gcc.target/arm/thumb2-slow-flash-data-4.c branches/ARM/embedded-6-branch/gcc/testsuite/gcc.target/arm/thumb2-slow-flash-data-5.c Removed: branches/ARM/embedded-6-branch/gcc/testsuite/gcc.target/arm/thumb2-slow-flash-data.c Modified: branches/ARM/embedded-6-branch/gcc/ChangeLog.arm branches/ARM/embedded-6-branch/gcc/config/arm/arm.c branches/ARM/embedded-6-branch/gcc/config/arm/arm.md branches/ARM/embedded-6-branch/gcc/config/arm/vfp.md branches/ARM/embedded-6-branch/gcc/testsuite/ChangeLog.arm
[Bug rtl-optimization/78255] [5/6/7 regression] Indirect sibling call causing wrong code generation for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78255 --- Comment #12 from avieira at gcc dot gnu.org --- Author: avieira Date: Fri Dec 9 16:46:42 2016 New Revision: 243494 URL: https://gcc.gnu.org/viewcvs?rev=243494&root=gcc&view=rev Log: PR78255: Make postreload aware of NO_FUNCTION_CSE gcc/ChangeLog: 2016-12-09 Andre Vieira PR rtl-optimization/78255 * gcc/postreload.c (reload_cse_simplify): Do not CSE a function if NO_FUNCTION_CSE is true. gcc/testsuite/ChangeLog: 2016-12-09 Andre Vieira PR rtl-optimization/78255 * gcc.target/aarch64/pr78255.c: New. * gcc.target/arm/pr78255-1.c: New. * gcc.target/arm/pr78255-2.c: New. Added: trunk/gcc/testsuite/gcc.target/aarch64/pr78255.c trunk/gcc/testsuite/gcc.target/arm/pr78255-1.c trunk/gcc/testsuite/gcc.target/arm/pr78255-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/postreload.c trunk/gcc/testsuite/ChangeLog
[Bug rtl-optimization/78255] [5/6/7 regression] Indirect sibling call causing wrong code generation for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78255 --- Comment #13 from avieira at gcc dot gnu.org --- Author: avieira Date: Fri Dec 9 17:22:20 2016 New Revision: 243496 URL: https://gcc.gnu.org/viewcvs?rev=243496&root=gcc&view=rev Log: PR78255: Make postreload aware of NO_FUNCTION_CSE gcc/ChangeLog.arm: 2016-12-09 Andre Vieira Backport from mainline 2016-12-09 Andre Vieira PR rtl-optimization/78255 * gcc/postreload.c (reload_cse_simplify): Do not CSE a function if NO_FUNCTION_CSE is true. gcc/testsuite/ChangeLog.arm: 2016-12-09 Andre Vieira Backport from mainline 2016-12-09 Andre Vieira PR rtl-optimization/78255 * gcc.target/aarch64/pr78255.c: New. * gcc.target/arm/pr78255-1.c: New. * gcc.target/arm/pr78255-2.c: New. Added: branches/ARM/embedded-6-branch/gcc/testsuite/gcc.target/aarch64/pr78255.c branches/ARM/embedded-6-branch/gcc/testsuite/gcc.target/arm/pr78255-1.c branches/ARM/embedded-6-branch/gcc/testsuite/gcc.target/arm/pr78255-2.c Modified: branches/ARM/embedded-6-branch/gcc/ChangeLog.arm branches/ARM/embedded-6-branch/gcc/postreload.c branches/ARM/embedded-6-branch/gcc/testsuite/ChangeLog.arm
[Bug rtl-optimization/78255] [5/6 regression] Indirect sibling call causing wrong code generation for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78255 --- Comment #14 from avieira at gcc dot gnu.org --- Author: avieira Date: Mon Jan 9 09:58:54 2017 New Revision: 244220 URL: https://gcc.gnu.org/viewcvs?rev=244220&root=gcc&view=rev Log: PR78255: Make postreload aware of NO_FUNCTION_CSE gcc/ChangeLog: 2017-01-09 Andre Vieira Backport from mainline 2016-12-09 Andre Vieira PR rtl-optimization/78255 * gcc/postreload.c (reload_cse_simplify): Do not CSE a function if NO_FUNCTION_CSE is true. gcc/testsuite/ChangeLog: 2017-01-09 Andre Vieira Backport from mainline 2016-12-20 Andre Vieira * gcc.target/arm/pr78255-2.c: Fix to work for targets that do not optimize for tailcall. 2017-01-09 Andre Vieira Backport from mainline 2016-12-09 Andre Vieira PR rtl-optimization/78255 * gcc.target/aarch64/pr78255.c: New. * gcc.target/arm/pr78255-1.c: New. * gcc.target/arm/pr78255-2.c: New. Added: branches/gcc-6-branch/gcc/testsuite/gcc.target/aarch64/pr78255.c branches/gcc-6-branch/gcc/testsuite/gcc.target/arm/pr78255-1.c branches/gcc-6-branch/gcc/testsuite/gcc.target/arm/pr78255-2.c Modified: branches/gcc-6-branch/gcc/ChangeLog branches/gcc-6-branch/gcc/postreload.c branches/gcc-6-branch/gcc/testsuite/ChangeLog
[Bug rtl-optimization/78255] [5/6 regression] Indirect sibling call causing wrong code generation for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78255 --- Comment #15 from avieira at gcc dot gnu.org --- Author: avieira Date: Wed Jan 11 15:08:25 2017 New Revision: 244319 URL: https://gcc.gnu.org/viewcvs?rev=244319&root=gcc&view=rev Log: PR78255: Make postreload aware of NO_FUNCTION_CSE gcc/ChangeLog: 2017-01-11 Andre Vieira Backport from mainline 2016-12-09 Andre Vieira PR rtl-optimization/78255 * gcc/postreload.c (reload_cse_simplify): Do not CSE a function if NO_FUNCTION_CSE is true. gcc/testsuite/ChangeLog: 2017-01-11 Andre Vieira Backport from mainline 2016-12-20 Andre Vieira * gcc.target/arm/pr78255-2.c: Fix to work for targets that do not optimize for tailcall. 2017-01-11 Andre Vieira Backport from mainline 2016-12-09 Andre Vieira PR rtl-optimization/78255 * gcc.target/aarch64/pr78255.c: New. * gcc.target/arm/pr78255-1.c: New. * gcc.target/arm/pr78255-2.c: New. Added: branches/gcc-5-branch/gcc/testsuite/gcc.target/aarch64/pr78255.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/pr78255-1.c branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/pr78255-2.c Modified: branches/gcc-5-branch/gcc/ChangeLog branches/gcc-5-branch/gcc/postreload.c branches/gcc-5-branch/gcc/testsuite/ChangeLog
[Bug target/79237] [5/6/7 Regression] ARMv7-M ICE in extract_constrain_insn, insn does not satisfy its constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79237 --- Comment #3 from avieira at gcc dot gnu.org --- Hi, My outstanding patch for PR71607 fixes this ICE too. I am currently retesting it after some comments upstream and should be posting a new version soon. Cheers, Andre
[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498 avieira at gcc dot gnu.org changed: What|Removed |Added Target||arm-none-eabi CC||avieira at gcc dot gnu.org --- Comment #2 from avieira at gcc dot gnu.org --- I am observing some regressions for arm-none-eabi on a Cortex-M0+ for a popular embedded benchmark following this patch. I believe register pressure might also be the root cause of this given the significant increase of loads and registers from and to the stack. Though I need to have a better look. Passing the option -fno-code-hoisting brings the performance numbers back up.
[Bug rtl-optimization/77499] Regression after code-hoisting, due to combine pass failing to evaluate known value range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77499 avieira at gcc dot gnu.org changed: What|Removed |Added CC||rguenth at gcc dot gnu.org, ||segher at gcc dot gnu.org --- Comment #1 from avieira at gcc dot gnu.org --- Adding Richard, since this was exposed after Richard's code-hoisting patch and Segher because I believe the root of the problem might be related to the way combine works.
[Bug rtl-optimization/77499] New: Regression after code-hoisting, due to combine pass failing to evaluate known value range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77499 Bug ID: 77499 Summary: Regression after code-hoisting, due to combine pass failing to evaluate known value range Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- Hello, We are seeing a regression for arm-none-eabi on a Cortex-M7. This regression was observed after Biener's and Bosscher's GIMPLE code-hoisting patch (https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00360.html). The example below will illustrate the regression: $ cat t.c unsigned short foo (unsigned short x, int c, int d, int e) { unsigned short y; while (c > d) { if (c % 3) { x = (x >> 1) ^ 0xB121U; } else x = x >> 1; c-= e; } return x; } Comparing: $ arm-none-eabi-gcc -mcpu=cortex-m7 -mthumb -O2 -S t.c vs $ arm-none-eabi-gcc -mcpu=cortex-m7 -mthumb -O2 -S t.c -fno-code-hoisting Will illustrate that the code-hoisting version has an extra zero_extension of HImode to SImode. After some digging I found out that during the combine phase, the no-code-hoisting version is able to recognize that the 'last_set_nonzero_bits' are 0x7fff whereas for the code-hoisted version it seems to have lost this knowledge. Looking at the graph dump for the no-code-hoisting t.c.246r.ud_dce.dot I see the following insns: 23: r125:SI=r113:SI 0>>0x1 24: r111:SI=zero_extend(r125:SI#0) 27: r128:SI=r111:SI^r131:SI 28: r113:SI=zero_extend(r128:SI#0) These are all in the same basic block. For the code-hoisting version we have: BB A: ... 12: r116:SI=r112:SI 0>>0x1 13: r112:SI=zero_extend(r116:SI#0) ... BB B: 27: r127:SI=r112:SI^r129:SI 28: r112:SI=zero_extend(r127:SI#0) Now from what I have observed by debugging the combine pass is that combine will first combine instructions 23 and 24. Here combine is able to optimize away the zero_extend, because in 'reg_nonzero_bits_for_combine' the reg_stat[113] has its 'last_set_value' to 'r0' (the unsigned short argument) and its corresponding 'last_set_nonzero_bits' to 0x. Which means the zero_extend is pointless. The code-hoisting version also combines 12 and 13, optimizing away the zero_extend. However, the next set of instructions is where things get tricky. In the no-code-hoisting case it will end up combining all 4 instructions one by one from the top down and it will end up figuring out that the last zero_extend is also not required. For the code-hoisting case, when it tries to combine 28 with 27 (remember they are not in the basic block as 13 and 14, so it will never try to combine all 4), it will eventually try to evaluate the nonzero bits based on r112 and see that the last_set_value for r112 is: (lshiftrt:SI (clobber:SI (const_int 0 [0])) (const_int 1 [0x1])) The last_set_nonzero_bits will be 0x7fff, instead of the expected 0x7fff. This looks like the result of the code in 'record_value_for_reg' in combine.c that sits bellow the comment: /* The value being assigned might refer to X (like in "x++;"). In that case, we must replace it with (clobber (const_int 0)) to prevent infinite loops. */ Given that 12 and 13 were combined into: r112:SI=r112:SI 0>>0x1 This seems to be invalidating the last_set_value and thus leaving us with a weaker 'last_set_nonzero_bits' which isn't enough to optimize away the zero_extend. Any clue on how to "fix" this? Cheers, Andre PS: I am not sure I completely understand the way the last_set_value stuff works for pseudo's in combine, but it looks to me like each instruction is visited in a top down order per basic block again in a top-down order. And each pseudo will have its 'last_set_value' according to the last time that register was seen being set, without any regards to loop or proper dataflow analysis. Can anyone explain to me how this doesnt go horribly wrong?
[Bug rtl-optimization/77499] [7 Regression] Regression after code-hoisting, due to combine pass failing to evaluate known value range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77499 --- Comment #6 from avieira at gcc dot gnu.org --- > so we are talking about the uxthne insn (I don't know arm / thumb very well). Yes, the uxthne is the "zero_extend" that is otherwise optimized away if you turn off code-hoisting. This is because the way the code gets transformed leads to: r112:SI=r112:SI 0>>0x1, this is the combination of instructions 12 and 13 in my example earlier. r112 is also the first operand of the xor instruction and because of the way combine does its "nonzero bit analysis" it always looks at the last set value for each pseudo. For r112 here, thats an infinite loop and so it will not be able to recognize that r112 originated from r0, thus loosing the information that it is at most an unsigned short. Leading to the decision not to get rid of the zero_extend. I'll have a look at if-convert.
[Bug rtl-optimization/77499] [7 Regression] Regression after code-hoisting, due to combine pass failing to evaluate known value range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77499 --- Comment #7 from avieira at gcc dot gnu.org --- if-convert is a no go here, for the reason Andrew pointed out, sorry missed that comment! So I dont know... Only thing I can think of is better "value-range"-like analysis for combine, but that might be too costly? The fact is that for the code-hoisting to work here, the pseudo for r112 has to be shared among both code-paths, so unless you add an extra move: BB0: r112:SI = r0:SI BB 1: ... r116:SI=r112:SI 0>>0x1 rNEW:SI=zero_extend(r116:SI#0) ... if CC goto BB2 else BB Extra BB 2: r127:SI=rNEW:SI^r129:SI r112:SI=zero_extend(r127:SI#0) if LOOP: goto BB1 else BB exit BB EXTRA: r112:SI=rNEW:SI if LOOP: goto BB1 else BB exit And you end up with an extra move rather than a zero_extend. But maybe the move can be optimized away in later stages? Or maybe put in the same conditional execution block as the XOR...
[Bug rtl-optimization/77499] [7 Regression] Regression after code-hoisting, due to combine pass failing to evaluate known value range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77499 --- Comment #9 from avieira at gcc dot gnu.org --- > > So I dont know... Only thing I can think of is better "value-range"-like > > analysis for combine, but that might be too costly? > > So we are not really looking for combine to combine the shift stmt > with the xor stmt? Because combine doesn't consider that because of > the multi-use. AFAIK, combine will not combine the shift and xor because they are in different basic blocks. The multi-use prevents it from tracking the origin of r112 back to a point where it knows that it its higher bits are all 0. > > > > And you end up with an extra move rather than a zero_extend. But maybe the > > move > > can be optimized away in later stages? Or maybe put in the same conditional > > execution block as the XOR... > > Well, we run into a general issue of the RTL combiner -- fwprop and > ree are other passes that are supposed to remove extensions in some > cases. > > Really, the user could have written the code in a way CSEing the > shift himself -- it's unfortunate that we now fail to optimize the > non-CSEd source but that can only be a reason to enhance downstream > passes. True, if say the unused 'y' I left in there for some odd reason were used to CSE (x >> 1) outside the if-then-else, then you would end up with the zero_extend in both -fcode-hoisting and -fno-code-hoisting.
[Bug rtl-optimization/77499] [7 Regression] Regression after code-hoisting, due to combine pass failing to evaluate known value range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77499 --- Comment #11 from avieira at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #10) > That is what nonzero_bits etc. is about. We could do much better nowadays > with the generic DF framework. > I am not familiar with the generic DF framework, could you point me to it? > Is code hoisting making the code better at all here? (At RTL level) Not as is, but I was hoping that if the zero_extend gets removed, we could end up with: movwr6, #45345 .L4: smull r5, r4, r7, r1 lsrsr0, r0, #1 sub r4, r4, r1, asr #31 -eor r5, r0, r6 add r4, r4, r4, lsl #1 cmp r1, r4 sub r1, r1, r3 it ne -uxthne r0, r5 +eorne r0, r0, r6 cmp r2, r1 blt .L4 So compared to the no-code-hoisting case it would realize it needs to do the same shift in both cases and only do it once.
[Bug rtl-optimization/77499] [7 Regression] Regression after code-hoisting, due to combine pass failing to evaluate known value range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77499 avieira at gcc dot gnu.org changed: What|Removed |Added CC||kugan at gcc dot gnu.org --- Comment #12 from avieira at gcc dot gnu.org --- I heard Kugan was working on getting rid of superfluous zero_extends. Adding him to the watch list. @Kugan: Could your work help this case? And when do you plan to have it submitted?
[Bug bootstrap/77695] [7 Regression] bootstrap failure due to undeclared hook_uint_uintp_false
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77695 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #4 from avieira at gcc dot gnu.org --- Sorry about that and thank you for the fix. I'm curious as to why my aarch64 bootstrap didnt pick this up, it was with an earlier version (2 months ago) but I dont see why that would make a difference in this case. Anyhow, again sorry for breaking the world. Cheers, Andre
[Bug debug/77773] New: [7/6 regression] Segfault when compiling __simd64_float16_t using arm-none-eabi-g++ with debug information
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3 Bug ID: 3 Summary: [7/6 regression] Segfault when compiling __simd64_float16_t using arm-none-eabi-g++ with debug information Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: debug Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- Hello, When compiling the following: $ cat t.c typedef __simd64_float16_t float16x4_t; with: $ arm-none-eabi-g++ -S t.c -mfloat-abi=hard -march=armv7-a -g t.c:1:28: internal compiler error: Segmentation fault 0xd33a1f crash_signal src/gcc/gcc/toplev.c:336 0x881b8f tree_class_check src/gcc/gcc/tree.h:3148 0x881b8f c_pretty_printer::simple_type_specifier(tree_node*) src/gcc/gcc/c-family/c-pretty-print.c:351 0x7ce46e cxx_pretty_printer::simple_type_specifier(tree_node*) src/gcc/gcc/cp/cxx-pretty-print.c:1324 0x884dec pp_c_specifier_qualifier_list(c_pretty_printer*, tree_node*) src/gcc/gcc/c-family/c-pretty-print.c:478 0x884dde pp_c_specifier_qualifier_list(c_pretty_printer*, tree_node*) src/gcc/gcc/c-family/c-pretty-print.c:474 0x7ccbe2 pp_cxx_type_specifier_seq src/gcc/gcc/cp/cxx-pretty-print.c:1379 0x6b4cd4 dump_type src/gcc/gcc/cp/error.c:467 0x6be905 dump_type_prefix src/gcc/gcc/cp/error.c:811 0x6b26b2 dump_simple_decl src/gcc/gcc/cp/error.c:970 0x6b2e00 dump_decl src/gcc/gcc/cp/error.c:1057 0x6beaf1 decl_as_string(tree_node*, int) src/gcc/gcc/cp/error.c:2882 0x6beb1f decl_as_dwarf_string(tree_node*, int) src/gcc/gcc/cp/error.c:2871 0x59a171 cxx_dwarf_name src/gcc/gcc/cp/cp-lang.c:119 0x97f8be type_tag src/gcc/gcc/dwarf2out.c:19191 0x9a1369 gen_array_type_die src/gcc/gcc/dwarf2out.c:19367 0x9a1369 gen_type_die_with_usage src/gcc/gcc/dwarf2out.c:23080 0x9a1c8b gen_type_die src/gcc/gcc/dwarf2out.c:23142 0x9ab9d7 modified_type_die src/gcc/gcc/dwarf2out.c:11469 0x9abf9c add_type_attribute src/gcc/gcc/dwarf2out.c:19123 Removing -g makes it compile without errors.
[Bug debug/77773] [7/6 regression] Segfault when compiling __simd64_float16_t using arm-none-eabi-g++ with debug information
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3 --- Comment #1 from avieira at gcc dot gnu.org --- When I say without errors I meant without segfaulting. It will print out the following error for version 5 if you dont include '-mfpu=neon': t.c:1:9: error: '__simd64_float16_t' does not name a type typedef __simd64_float16_t float16x4_t;
[Bug debug/77773] [6/7 regression] Segfault when compiling __simd64_float16_t using arm-none-eabi-g++ with debug information
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3 --- Comment #3 from avieira at gcc dot gnu.org --- Just to make it clear: The command I showed without the '-g' did use to error on gcc-5, but it doesnt on 6 and 7: $ gcc-5/arm-none-eabi-g++ -S t.c -mfloat-abi=hard -march=armv7-a t.c:1:9: error: '__simd64_float16_t' does not name a type typedef __simd64_float16_t float16x4_t; $ gcc-6/arm-none-eabi-g++ -S t.c -mfloat-abi=hard -march=armv7-a $ gcc-7/arm-none-eabi-g++ -S t.c -mfloat-abi=hard -march=armv7-a Adding -mfpu=neon to gcc-5 gets rid of the error: $ gcc-5/arm-none-eabi-g++ -S t.c -mfloat-abi=hard -march=armv7-a -mfpu=neon Adding -mfpu=neon to eitehr gcc-6 or 7 is irrelevant to both compilations with or without '-g'.
[Bug target/71607] [5/6/7 Regression] [ARM] ice due to forbidden enabled attribute dependency on instruction operands
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71607 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #7 from avieira at gcc dot gnu.org --- Got a patch up for review on gcc-patches that fixes this, see https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00377.html
[Bug target/78255] New: [5/6/7 regression] Indirect sibling call causing wrong code generation for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78255 Bug ID: 78255 Summary: [5/6/7 regression] Indirect sibling call causing wrong code generation for ARM Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- As first reported by Andrew on https://bugs.launchpad.net/gcc-arm-embedded/+bug/1616992 To reproduce on trunk: $ cat test.c #include struct table_s { void (*fun0) ( void ); void (*fun1) ( void ); void (*fun2) ( void ); void (*fun3) ( void ); void (*fun4) ( void ); void (*fun5) ( void ); void (*fun6) ( void ); void (*fun7) ( void ); } table; void callback0(){__asm("mov r0, r0 \n\t");} void callback1(){__asm("mov r0, r0 \n\t");} void callback2(){__asm("mov r0, r0 \n\t");} void callback3(){__asm("mov r0, r0 \n\t");} void callback4(){__asm("mov r0, r0 \n\t");} void test(void) { memset(&table, 0, sizeof table); asm volatile ("" : : : "r3"); table.fun0 = callback0; table.fun1 = callback1; table.fun2 = callback2; table.fun3 = callback3; table.fun4 = callback4; table.fun0(); } $ arm-none-eabi-gcc -S -O2 -mthumb -mcpu=cortex-m3 test.c $ cat test.s ... ldr r5, .L8+4 ldr r3, .L8+8 ldr r0, .L8+12 ldr r1, .L8+16 ldr r2, .L8+20 str r5, [r4] str r0, [r4, #4] str r1, [r4, #8] str r2, [r4, #12] str r3, [r4, #16] pop {r3, r4, r5, lr} bx r3 @ indirect register sibling call ... As reported, we see that r3 is "restored" before being used to do the sibling call. So it will no longer contain the address of the call. I believe this is because 'arm_get_frame_offsets' is called to determine whether we can safely use 'r3' to align the stack using the function 'any_sibcall_could_use_r3'. This is done before the address of the sibcall is assigned a hard register, so 'any_sibcall_could_use_r3' returns 'false' and we push and pop 'r3' in the pro- and epilogue.
[Bug target/78255] [5/6/7 regression] Indirect sibling call causing wrong code generation for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78255 --- Comment #1 from avieira at gcc dot gnu.org --- OK I think I assigned the blame to the wrong function, I think it is the responsibility of 'is_indirect_tailcall_p' to catch this. Though I believe the last time it is called during the postreload pass, the call rtx still has a symbolref in it and only later in the pass is it replaced with a register. Too late for this function to catch it and after that 'reload_completed' is set to true and 'arm_get_frame_offsets' only returns the precomputed offsets. I have a workaround where I add a use clause to the sibling patterns, which seems to work, but I am not entirely sure why it works and I am not sure it is the right approach either.
[Bug target/69538] gcc.dg/torture/stackalign/builtin-apply-4.c fails with flto for aarch32 targets with single precision FPU
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69538 --- Comment #6 from avieira at gcc dot gnu.org --- I had a look at this and after some digging I found the bug is not due to LTO, but rather with "local" functions. If you make bar static you will end up with the same faulty behavior. After some more digging I found myself going through the 'untyped_call' code in arm.md. Here I found both 'untyped_call' and 'untyped_return' had not been adjusted to be able to cope with HardFP ABI's. I wrote a patch to mend this, needs a bit more work, but I think it's correct and I might put it on gcc-patches at a later time. However, when I started thinking of how I was going to "fix" this wrong-code generation, I realized that due to the way untyped_call's and untyped_return's are constructed and the nature of '__builtin_return' and '__builtin_apply', you do not know which registers are actually used to return the values, you only know it might be 'r0-r4' and 'd0-d7'. So even though I know the call-site would expect a return of type 'double' in 'r0-r1', because this is local function (aka 'ARM_PCS_AAPCS_LOCAL') and the target does not support double precision, there is no way for me to know in which of the registers the function is actually returning, so I dont know what registers to move to 'r0-r1'. So I don't think we can get this builtin to work for single precision VFPs, without compromising on the way we use local function returns.
[Bug target/78255] [5/6/7 regression] Indirect sibling call causing wrong code generation for ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78255 --- Comment #2 from avieira at gcc dot gnu.org --- The approach I had doesnt work, it ICE's elsewhere... At the time I am not sure how to fix this without disabling indirect tail calls for the current function if any sibcall is done within it. This might be too big a hammer... If anyone has any tips they are very welcome.
[Bug rtl-optimization/98791] [11 Regression] ICE in paradoxical_subreg_p (in ira) with SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98791 avieira at gcc dot gnu.org changed: What|Removed |Added Known to work|10.2.1 | Known to fail||10.2.1 Status|RESOLVED|REOPENED Resolution|FIXED |--- --- Comment #6 from avieira at gcc dot gnu.org --- Hi Jeffrey, I was leaving thos open to remind me to backport the fix to gcc-10. I see the ticket falsely claims it works for gcc-10. Reopening for backport.
[Bug rtl-optimization/98791] [10 Regression] ICE in paradoxical_subreg_p (in ira) with SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98791 --- Comment #8 from avieira at gcc dot gnu.org --- Aye my bad there, Thanks for the change.
[Bug rtl-optimization/98791] [10 Regression] ICE in paradoxical_subreg_p (in ira) with SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98791 avieira at gcc dot gnu.org changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution|--- |FIXED --- Comment #10 from avieira at gcc dot gnu.org --- Closing now as backport is done.
[Bug target/97327] -mcpu=cortex-m55 warns without -mfloat-abi=hard or -march=armv8.1-m.main
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97327 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #2 from avieira at gcc dot gnu.org --- The last two should conflict though right? I never quite understood this warning to be fair. My personal preference would be to warn for any invocation where both -mcpu and -march are passed, but I understand that for legacy reasons that might be undesirable. Though yeah -mcpu=cortex-m55 with a -mfloat-abi=soft should not warn for anything obviously.
[Bug target/97327] -mcpu=cortex-m55 warns without -mfloat-abi=hard or -march=armv8.1-m.main
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97327 --- Comment #4 from avieira at gcc dot gnu.org --- With -mcpu=cortex-m55+nomve should be equivalent to -march=armv8.1-m.main+dsp
[Bug target/97327] -mcpu=cortex-m55 warns without -mfloat-abi=hard or -march=armv8.1-m.main
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97327 --- Comment #5 from avieira at gcc dot gnu.org --- Your other one: -mcpu=cortex-m55+nomve -march=armv8.1-m.main+mve -mfloat-abi=softfp This has cpu without mve and arch with mve. Another fun caveat to look at is in: -mcpu=cortex-m55 -mfloat-abi=soft float-abi=soft disables vector instructions, so it makes sense to remove mve.fp and fp.dp/fp. However, we must make sure that +mve is still passed to the assembler because +mve enables new scalar shift instructions. If we want to be in-sync with legacy though I don't think we even need to look at all these complicated cases as. Since it seems in the past we ignore fp extensions, take for instance: arm-none-eabi-gcc -mcpu=cortex-m7 -march=armv7e-m -mfloat-abi=hard arm-none-eabi-gcc -mcpu=cortex-m7 -march=armv7e-m+fp -mfloat-abi=hard arm-none-eabi-gcc -mcpu=cortex-m7+nofp -march=armv7e-m -mfloat-abi=soft arm-none-eabi-gcc -mcpu=cortex-m7+nofp -march=armv7e-m+fp None of these give the warning, so maybe the solution is to ignore MVE as well as the FP extension when checking for this? There is a bit in the warning code that says: /* And if the target ISA lacks floating point, ignore any extensions that depend on that. */ if (!bitmap_bit_p (target->isa, isa_bit_vfpv2)) bitmap_and_compl (isa_delta, isa_delta, isa_all_fpbits); Maybe we need to 'ignore any extension that depends on mve'? But I don't quite understand how this works with the case where we do have isa_bit_vfpv2... For Srinath's sake it would be good to agree on what the behaviour should be and then work towards that. I personally don't have a strong feeling about this other then: passing '-mcpu=cortex-m55' shouldn't give warnings ... since well that's insane :P
[Bug target/96914] missing MVE intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96914 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #5 from avieira at gcc dot gnu.org --- Hi Christophe, The docs are right and so are you, those instructions should only have a signed variant as the hardware instructions also only supports .S suffixes or in the case of vmlaldavax do not support the cross 'X' variant with unsigned datatypes.
[Bug target/93053] [9 Regression] libgcc build failure with old binutils on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93053 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #17 from avieira at gcc dot gnu.org --- I believe this has been fixed on all relevant branches.
[Bug target/95646] [GCC 9/10] arm-none-eabi function attribute 'cmse_nonsecure_entry' wipes register values with -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95646 avieira at gcc dot gnu.org changed: What|Removed |Added Summary|arm-none-eabi function |[GCC 9/10] arm-none-eabi |attribute |function attribute |'cmse_nonsecure_entry' |'cmse_nonsecure_entry' |wipes register values with |wipes register values with |-Os |-Os --- Comment #4 from avieira at gcc dot gnu.org --- Changed title to reflect that this still needs backports to GCC 9 and 10.
[Bug target/97528] [9/10 Regression] ICE in decompose_automod_address, at rtlanal.c:6298 (arm-linux-gnueabihf)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97528 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #7 from avieira at gcc dot gnu.org --- Hi, I am seeing this same fault cause a wrong-code gen on gcc-9 with the code below: void foo(uint16_t *dest, uint16x8_t a, unsigned long long stride) { int i = 3; stride >>= 1; do { vst1_u16(dest, vget_low_u16(a)); dest += stride; i = i - 1; } while (i != 0); } leading to: foo: vst1.16 {d0}, [r0], r0 vst1.16 {d0}, [r0], r0 vst1.16 {d0}, [r0] bx lr which is obviously wrong. Can we backport this to gcc-9?
[Bug target/97528] [9/10 Regression] ICE in decompose_automod_address, at rtlanal.c:6298 (arm-linux-gnueabihf)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97528 --- Comment #12 from avieira at gcc dot gnu.org --- @jakub: backported to gcc-8 and gcc-9. OK to close this?
[Bug middle-end/98974] New: ICE in vectorizable_condition after STMT_VINFO_VEC_STMTS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98974 Bug ID: 98974 Summary: ICE in vectorizable_condition after STMT_VINFO_VEC_STMTS Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- Hi, After https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=b05d5563f4be13b4a0d0951375a82adf483973c0 we found vectorizable_condition to ICE when autovectorizing for SVE. The reduced fortran testcase is an example of this: $ cat foo.F90 module module_foobar integer,parameter :: fp_kind = selected_real_kind(15) contains subroutine foobar( foo, ix ,jx ,kx,iy,ky) real, dimension( ix, kx, jx ) :: foo real(fp_kind), dimension( iy, ky, 3 ) :: bar, baz j_loop: do j=jts,enddo do k=0,ky do i=0,iy if ( baz(i,k,1) > 0. ) then bar(i,k,1) = 0 endif foo(i,nk,j) = baz0 * bar(i,k,1) enddo enddo enddo j_loop end end And the following command will cause it to ICE: $ gfortran -Ofast -mcpu=neoverse-v1 foo.F90 -S I have debugged this and I believe the issue is that before Richi's change vectorizable_condition used to set vec_oprnds0 to vec_cond_lhs for each copy. Now it is collected for all copies at the same time. However, when calling vect_get_loop_mask we pass vec_num * ncopies as the nvectors parameter, where vec_num has been set to the length of vec_oprnds0. I believe that because we are now doing all ncopies at the same time we no longer need to multiply it by ncopies. I'll be posting a patch for this soon.
[Bug middle-end/98974] ICE in vectorizable_condition after STMT_VINFO_VEC_STMTS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98974 --- Comment #1 from avieira at gcc dot gnu.org --- The testcase above issues a warning, around do j=jts,enddo To use it as a testcase in my patch I'd like to get rid of it so if someone proficient in Fortran knows a way to get rid of it that'd be great!
[Bug middle-end/98974] [11 Regression] ICE in vectorizable_condition after STMT_VINFO_VEC_STMTS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98974 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #5 from avieira at gcc dot gnu.org --- That should fix it.
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 98974, which changed state. Bug 98974 Summary: [11 Regression] ICE in vectorizable_condition after STMT_VINFO_VEC_STMTS https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98974 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/98657] [11 Regression] SVE: ICE (unrecognizable insn) with shift at -O3 -msve-vector-bits=256
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98657 avieira at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #4 from avieira at gcc dot gnu.org --- That should have fixed it. Closing.
[Bug tree-optimization/98726] [10/11 Regression] SVE: tree check: expected integer_cst, have poly_int_cst in to_wide, at tree.h:5984
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98726 --- Comment #7 from avieira at gcc dot gnu.org --- I'm looking at this and I have a feeling there is a disconnect on how some passes define VECTOR_CST and how the expand pass handles it. So the problem here seems to lie with the V4SImode VECTOR_CST at expand time: { POLY_INT_CST [24, 16], POLY_INT_CST [25, 16], POLY_INT_CST [26, 16], POLY_INT_CST [27, 16] } The problem seems to be that const_vector_from_tree only adds the first VECTOR_CST_NPATTERNS * VECTOR_CST_NELTS_PER_PATTERN and this has: VECTOR_CST_NPATTERNS: 1 VECTOR_CST_NELTS_PER_PATTERN: 3 The mode however dictates 4 elements (constant-sized V4SImode). So rtx_vector_builder::build adds the first three and then tries to derive the fourth (even though it is right there), at this point it fails as it uses wi::sub and that doesn't seem to work for POLY_INT's. This is where I started investigating how it should work. I looked at cases of actual patterns involving POLY_INT's, like: { POLY_INT_CST [8, 8], POLY_INT_CST [9, 8], POLY_INT_CST [10, 8], ... } These have a VLA mode, so because there is no constant element number rtx_vector_builder::build uses the 'encoded_nelts' which are again the VECTOR_CST_NPATTERNS * VECTOR_CST_NELTS_PER_PATTERN elements and never needs to derive a step. I also looked at how a VECTOR_CST with N random integers is built and there it seems VECTOR_CST_NPATTERNS * VECTOR_CST_NELTS_PER_PATTERN describe the full length of the VECTOR_CST. At this point I don't know whether the construction of the VECTOR_CST is wrong, or whether the building is, I just know there seems to be a disconnect. There are a variety of things that we could do: 1) Change how the VECTOR_CST is being created so that VECTOR_CST_NPATTERNS * VECTOR_CST_NELTS_PER_PATTERN == GET_MODE_NUNITS (m_mode).is_constant (&nelts) for constant sized modes. 2) Change const_vector_from_tree to check whether a POLY_INT VECTOR_CST has a constant sized mode, construct the RTVEC_ELT itself and use rtx_vector_builder::build(rtvec v) 3) Teach rtx_vector_builder::step and apply_step how to deal with POLY_INT's Out of all 2 is my favourite. Though we should aim to look at 1 too. Because I have seen a big descrepancy in how these VECTOR_CST's are formed, I've also seen: {1, 1, 1, 1, 1, 1, 1, 1} being described using: VECTOR_CST_NPATTERNS: 1 VECTOR_CST_NELTS_PER_PATTERN: 3 Which is unnecessary... {1, ...} would have sufficed with both NPATTERNS and NELTS_PER_PATTERN set to 1 for instance, or make it so they multiply to 8. Unless we want this flexibility?
[Bug tree-optimization/98726] [10/11 Regression] SVE: tree check: expected integer_cst, have poly_int_cst in to_wide, at tree.h:5984
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98726 --- Comment #8 from avieira at gcc dot gnu.org --- Also at some point we should figure out why the vectorizer is generating this much code for this example, where I think it should be able to optimized it to: a = 22; b &= c;
[Bug target/86487] [8 Regression] insn does not satisfy its constraints on arm big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86487 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #17 from avieira at gcc dot gnu.org --- Closing as it has been backported to 8 and 7 is closed.
[Bug tree-optimization/100981] ICE in info_for_reduction, at tree-vect-loop.c:4897
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100981 --- Comment #5 from avieira at gcc dot gnu.org --- Yeah that works. Ran it as is, no abort, ran it with s/ne/eq/ and it aborts.
[Bug tree-optimization/100981] ICE in info_for_reduction, at tree-vect-loop.c:4897
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100981 --- Comment #6 from avieira at gcc dot gnu.org --- FYI Tamar asked me to make sure the instructions were being generated. I checked and they were, but not being used as it decides to inline MAIN__ and inlining seems to break (as in not apply/missed oppurtunity) the complex optimization. So for this specific test I'd use -fno-inline, it executes the fcmla instructions that way and it runs fine.
[Bug target/108442] New: arm: MVE's vld1* and vst1* do not work when __ARM_MVE_PRESERVE_USER_NAMESPACE is defined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108442 Bug ID: 108442 Summary: arm: MVE's vld1* and vst1* do not work when __ARM_MVE_PRESERVE_USER_NAMESPACE is defined Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- When compiling: $ cat t.c #include uint32x4_t foo (uint32_t *p) { return __arm_vld1q_u32 (p); } with: $ arm-none-eabi-gcc -march=armv8.1-m.main+mve -mfloat-abi=hard -D__ARM_MVE_PRESERVE_USER_NAMESPACE it will fail to compile as __arm_vld1q_u32 is defined in arm_mve.h as calling vldrwq_u32 which will not exist when __ARM_MVE_PRESERVE_USER_NAMESPACE is defined.