[Bug tree-optimization/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #10 from alalaw01 at gcc dot gnu.org --- The stores are getting optimized out because equal_mem_array_ref_p considers equal pairs of MEM_REFS like fmcom.x[_168] and fmcom.x[_208] That is, a ARRAY_REF whose first operand is a COMPONENT_REF fmcom.x (of a VAR_DECL and a FIELD_DECL), and whose second operand is an SSA_NAME _168 or _208; I don't see anything obvious to suggest that they should be equal). get_ref_base_and_extent then returns base=fmcom, size=64, max_size=64 (so not a variable-sized access), and offset 0 :-(.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #20 from alalaw01 at gcc dot gnu.org --- Hmmm, hang on. In unport.fppized.f, shouldn't we be using the 'F2C/GCC COMPILER ON PC RUNNING UNIX (LINUX,BSD386,ETC)' version? In which case X has size (1) everywhere?
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Resolution|DUPLICATE |FIXED --- Comment #23 from alalaw01 at gcc dot gnu.org --- Well, this one is not fixed by -fno-aggressive-loop-optimizations.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Resolution|INVALID |FIXED --- Comment #27 from alalaw01 at gcc dot gnu.org --- (In reply to Richard Biener from comment #25) > (In reply to alalaw01 from comment #23) > > Well, this one is not fixed by -fno-aggressive-loop-optimizations. > > No, that just disabled one symptom of the issue at that point in time. > Fixing the issue also fixes this occurance (well, I hope so ;)) So by "fixing the issue" - we mean, making --std=legacy prevent this (as although against the SPEC, colleagues with more FORTRAN knowledge than I suggest this is common)? SPEC seem to be saying they will not change the source: https://www.spec.org/cpu2006/Docs/faq.html#Run.05 As Jakub suggested in comment #13: > So, perhaps we want some flag on the Fortran COMMON decls that would be set > on > COMMON that ends with an array and would tell get_ref_base_and_extent > (and > other spots?) that accesses can be beyond end of the decl? but only if --std=legacy ? ? ? Should I raise a new bug for this, as both this and 53068 are CLOSED?
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #32 from alalaw01 at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #31) > > Thus a "fix" for the case where treating a[i] as a[0] is the issue > would be > > Index: gcc/tree-dfa.c > === > --- gcc/tree-dfa.c (revision 233172) > +++ gcc/tree-dfa.c (working copy) > @@ -617,7 +617,11 @@ get_ref_base_and_extent (tree exp, HOST_ >if (maxsize == -1 > && DECL_SIZE (exp) > && TREE_CODE (DECL_SIZE (exp)) == INTEGER_CST) > - maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset; > + { > + maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset; > + if (maxsize == size) > + maxsize = -1; > + } > } >else if (CONSTANT_CLASS_P (exp)) > { Maybe if we only did that for DECL_COMMONs if -std=legacy was in force? Tho as you say: > but that wouldn't fix the aggressive-loop optimization issue as that is > _not_ looking at DECL_SIZE but at the array types domain. I wonder if we can't get both places looking at the same thing (DECL_SIZE or array type domain), but I haven't looked into that at all.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED |--- --- Comment #33 from alalaw01 at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #31) > > Thus a "fix" for the case where treating a[i] as a[0] is the issue > would be > > Index: gcc/tree-dfa.c > === > --- gcc/tree-dfa.c (revision 233172) > +++ gcc/tree-dfa.c (working copy) > @@ -617,7 +617,11 @@ get_ref_base_and_extent (tree exp, HOST_ >if (maxsize == -1 > && DECL_SIZE (exp) > && TREE_CODE (DECL_SIZE (exp)) == INTEGER_CST) > - maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset; > + { > + maxsize = wi::to_offset (DECL_SIZE (exp)) - bit_offset; > + if (maxsize == size) > + maxsize = -1; > + } > } >else if (CONSTANT_CLASS_P (exp)) > { So is there a case where we want this for C ? If I declare a struct with a VLA, and access it through a pointer - GCC recognizes the VLA idiom and keeps the accesses. If I access it from a decl, yes we optimize away the out-of-bounds accesses (in FRE, long before we reach the tree-ssa-scopedtables changes). So OK, if I access it from a extern or __attribute__((weak) decl, which I then get the linker to replace with a bigger decl, then I get "wrong" code (it ignores the extra elements in the bigger decl) - but I'd say that was invalid code. So if this is Fortran-only, we probably have to hook off --std=legacy, right?
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #37 from alalaw01 at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #36) > As Richard said, you can do similar (invalid too) stuff in C too, say: > struct S { int a[1]; } s; > in one TU and > struct S { int a[1]; } s; > > int > foo (int x) > { > return s.a[x]; > } > > int > bar (int x) > { > return s.a[1 + x] + s.a[0] + s.a[x]; > } > > GCC 5 would compile it to what the author might have meant, while GCC 6 will > optimize bar into s.a[0] * 3; Yes, this was what I meant in comment #33. The question is, do we care? (Or, do we only care in the FORTRAN case?) If so, then we presumably want a -fbroken-common-blocks (or something!) that is not FE-specific.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #39 from alalaw01 at gcc dot gnu.org --- Created attachment 37726 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37726&action=edit Proposed patch (without flag). Here's a prototype patch, that sets TYPE_SIZE to NULL_TREE but leaves DECL_SIZE intact. For the moment I'm applying this universally, rather than gating under a flag, to ease testing check-fortran. Only gfortran.dg/gomp/appendix-a/a.24.1.f90 fails; in practice I think it's OK just to not use the new code in conjunction with -fopenmp. On AArch64, it fixes the 416.gamess issue, and allows compiling 416.gamess without the -fno-aggressive-loop-optimizations previously required. Also bootstraps and passes check-gcc check-fortran check-g++, on aarch64 and x86_64, except as noted above. I expect to add a Fortran-only flag to gate the trans-common.c changes before taking this to gcc-patches@ . The worry is that while many cases in the mid-end were happy with a null TYPE_SIZE, I still had to patch up a couple, so the worry is I might not have got them all. (Indeed, omp-low.c had too many!) I'm not sure this is any worse than adding a new flag to the decl (indicating that the DECL_SIZE is not to be trusted) and then trying to find all the cases where the DECL_SIZE is wrongly relied upon - with the latter approach, the compiler would generate invalid code, rather than "failing fast". Thoughts welcome!
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #43 from alalaw01 at gcc dot gnu.org --- Yeah, I plan to add a fortran-specific option for this, it's easy enough, but I can't run the gfortran testsuite with that, because there are lots of C files in there too, for which the compiler doesn't accept the option... I'm having trouble writing a testcase though. My subroutine with IMPLICIT DOUBLE PRECISION (X) COMMON /MYCOMMON / X(1) produces "mycommon.x" a COMPONENT_REF, but with "mycommon" being a MEM_REF, which requires only the hunk to tree-dfa.c to handle correctly; whereas in SPEC2006, what looks to me to be equivalent FORTRAN, ends up with "mycommon" being a VAR_DECL, which requires the much-bigger patch to the fortran FE... I've very little fortran experience here, any tips? Thanks, Alan
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #53 from alalaw01 at gcc dot gnu.org --- (In reply to Thomas Koenig from comment #44) > I don't have access to SPEC, so I can only guess... Is there maybe an > equivalence involved, something like Turns out the COMMON is accessed via a MEM_REF in a loop, or as a VAR_DECL inside. Go figure! :) (In reply to Dominique d'Humieres from comment #49) > I don't see the point to add yet another option just because "SPEC does not > want to change the invalid Fortran". I think SPEC should be run with the > option(s) causing the problem disabled. Anecdotally I hear from Fortran-using colleagues this may occur in other places too. Moreover, the list of phases using get_ref_base_and_extent, is long; we could end up compiling with an ever-growing -fno-this -fno-that as more and more phases make use of the "bad" analysis results (that is correct by the language spec after all). In this case, there are a few other equivalences found due to the tree-ssa-scopedtables.c changes, that we'd lose with -fno-tree-dominator-opts, too. (In reply to H.J. Lu from comment #52) > >So, there is nothing to fix in GCC? Why isn't this bug closed as invalid? Not everyone wants to patch SPEC sources.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #77 from alalaw01 at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #72) > > Patch as posted passed bootstrap & regtest. Adjusted according to > comments but not tested otherwise - please somebody throw at > unpatched 416.gamess. Still miscompares on aarch64, I'm afraid. (Both with and without -fno-aggressive-loop-optimizations.) Also where Jakub wrote: > If you want to go this way, I'd at least key it off DECL_COMMON on the decl. > And instead of multiplying max_size by 2 perhaps just add BITS_PER_UNIT? I wonder why you prefer setting such an arbitrary guess at max_size rather than going with -1 which is defined as "unknown" ?
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #79 from alalaw01 at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #78) > > That would pessimize it too much IMHO. I'm not sure how to evaluate the pessimization, given it's thought to be a widespread pseudo-FORTRAN construct; so I probably have to defer to your judgement here. However... Given maxsize of an array as two elements, say, would the compiler not be entitled to optimize an index selection down to, say, computing only the LSBit of the actual index? Whereas 'unknown' means, well, exactly what is the case. So I fear this is storing problems up for the future. Is the concern that we can't hide this behind an option, as that would "drive people away from gfortran" ? If that's the case, can we hide it behind an option that defaults to pessimization (?? at least for fortran)??
[Bug middle-end/66877] [6 Regression] FAIL: gcc.dg/vect/vect-over-widen-3-big-array.c -flto -ffat-lto-objects scan-tree-dump-times vect "vect_recog_over_widening_pattern: detected" 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66877 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #8 from alalaw01 at gcc dot gnu.org --- Fix committed r232720.
[Bug tree-optimization/65963] Missed vectorization of loads strided with << when equivalent * succeeds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #5 from alalaw01 at gcc dot gnu.org --- Can I class this as fixed?
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #82 from alalaw01 at gcc dot gnu.org --- For those who haven't seen it, I've put forward this patch on the mailing list: https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01746.html based on a suggestion from Jakub. (Unlike Richi's comment72 patch, this fixes 416.gamess on AArch64.)
[Bug bootstrap/60632] ICE in regcprop.c (copyprop_hardreg_forward_1)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60632 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|WAITING |RESOLVED CC||alalaw01 at gcc dot gnu.org Resolution|--- |WORKSFORME --- Comment #2 from alalaw01 at gcc dot gnu.org --- Sorry, no idea...
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #84 from alalaw01 at gcc dot gnu.org --- Bah. Do you normally use -fno-aggressive-loop-optimizations? With -funknown-commons, did you try with/out aggressive loop opts? Powerpc{,64}{be,le} ? The unknown-commons testcase I included in that patch looks to pass on powerpc64le-unknown-linux-gnu. Does HJ Lu's spec source-patching work on powerpc following r232559? I am not a lawyer...but I don't think the SPEC2006 license allows me to upload onto the GCC Compile Farm and runspec. So if you could narrow down to an object file that's broken with a recent compiler and -funknown-commons, with the rest compiled with a gcc prior to r232508, that'd be very helpful - then I could see what assembly I'm changing (and what expressions equal_mem_array_ref is falsely declaring equivalent)...?
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 --- Comment #87 from alalaw01 at gcc dot gnu.org --- Great, many thanks for the tests, I was worried if we had hit another distinct issue. (Of course this would be better on gcc-patches!)
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #4 from alalaw01 at gcc dot gnu.org --- Hmmm. First thing I notice is that the type of d (struct S0[2]) is not scalarizable_type_p, but passes type_internals_preclude_sra_p. Changing the latter to bail out on DECL_BIT_FIELD (as the former does) fixes the ICE, but I'm not yet sure we want to do that.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #5 from alalaw01 at gcc dot gnu.org --- Prior to SRA, we have d = *.LC0; d$0$f0_7 = MEM[(struct S0[2] *)&*.LC0].f0; e$f0_9 = MEM[(struct S0[2] *)&d + 3B].f0; _3 = (int) d$0$f0_7; c = _3; _5 = (int) e$f0_9; __builtin_printf ("%x\n", _5); sra_modify_assign for d=*.LC0 ends up in load_assign_lhs_subreplacements, where d has two children; the second is grp_to_be_replaced, but because we did not completely_scalarize LC0, there is an access to only the first half of *.LC0, and no corresponding RHS for the second half of d ('racc = find_access_in_subtree (sad->top_racc, offset, lacc->size' returns null). So we generate the bad d$3$f0_14 = MEM[(struct S0[2] *)&d + 3B].f0; that is, initializing the scalar replacement for the second half of d, with a value read from the first half of d.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #6 from alalaw01 at gcc dot gnu.org --- Ugh, initializing the scalar replacement for the first half of d, with a value read from the first half of d (should be from the first half of *.LC0).
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #7 from alalaw01 at gcc dot gnu.org --- *second* half, sorry. grp_to_be_replaced is here true, but grp_unscalarized_data is false, so handle_unscalarized_data_in_subtree sets sad->refreshed=UDH_LEFT and we build the access to the LHS. (Then, load_assign_lhs_subreplacements exits, and the caller sees UDH_LEFT and removes the original block move statement.) In contrast, on a similar testcase using a parameter rather than *.LC0, grp_unscalarized_data is true, handle_unscalarized_data_in_subtree sets sad->refreshed=UDH_RIGHT and we build an access to the RHS, which is OK; and leave the block move statement in place, hence correctness.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #9 from alalaw01 at gcc dot gnu.org --- In analyze_access_subtree (since r147980, "New implementation of SRA", 2009): else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL) root->grp_unscalarized_data = 1; /* not covered and written to */ adding a case for constant_decl_p alongside the PARM_DECL case, fixes the ICE; AArch64 bootstrap in progress.
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #3 from alalaw01 at gcc dot gnu.org --- So in the not-vectorized case (-DFOO=1), we get for the inner loop: : # i_27 = PHI _8 = (long unsigned int) i_27; _9 = _8 * 4; _11 = data_10(D) + _9; _13 = *_11; _14 = _13 + j_23; *_11 = _14; i_16 = i_27 + 1; if (i_16 <= max_24) goto ; else goto ; : goto ; : # i_32 = PHI the loop exit phi, i_32=PHI, makes i_16=i_27+1 relevant (vec_stmt_relevant_p: used out of loop.), so we go through that on the worklist and then i_27=PHI, marking the phi as STMT_VINFO_LIVE_P, and hence "not vectorized: value used after loop". Kind of as expected, FORNOW. In the -DFOO=0 case, a bunch of loop peeling, header-copying, and other transforms, end up with this input to vectorization: : //header of inner loop # i_2 = PHI _8 = (long unsigned int) i_2; _9 = _8 * 4; _11 = data_10(D) + _9; _12 = *_11; _13 = _12 + j_26; *_11 = _13; i_15 = i_2 + 1; if (max_7 >= i_15) goto ; else goto ; : goto ; : //bb 5 is only predecessor _19 = (unsigned int) i_25; _18 = (unsigned int) max_7; _17 = (unsigned int) i_25; _5 = _18 - _17; _4 = _5 + _19; _3 = _4 + 1; i_21 = (int) _3; : # i_23 = PHI //tests outer loop note bb7 use i_25, not i_2; so neither i_15 nor i_2 escape the loop, and we don't have the problem from above. (Yes bb7 is taking i_25 away from max_7 and then adding it back on again, before adding 1, to give the value of i after the inner loop.) This arrangement of multiple i's live at the same time, is not present in 107t.ch2. 130t.loopinit introduces i_21, computed by an exit phi on leaving the inner loop. 135t.sccp then changes this to the max_7-i_25+i_25 sequence which removes the dependency on i_15 and allows vectorization.
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #4 from alalaw01 at gcc dot gnu.org --- loopinit introduces the exit phi in much the same way for both -DFOO=0 and -DFOO=1, so the difference is in sccp. In the -DFOO=0 case, sccp does this (removing TODO_cleanup_cfg from pass_data_scev_cprop to make the diff easier, still vectorizes): ;; Function addlog2 (addlog2, funcdef_no=0, decl_uid=2749, cgraph_uid=0, symbol_order=0) + +final value replacement: + i_21 = PHI + with + i_21 = (int) _3; + ...[snip]... : - # i_21 = PHI + _19 = (unsigned int) i_25; + _18 = (unsigned int) max_7; + _17 = (unsigned int) i_25; + _5 = _18 - _17; + _4 = _5 + _19; + _3 = _4 + 1; + i_21 = (int) _3; In the -DFOO=1 case, sccp doesn't do anything; and adding -fno-tree-scev-cprop prevents vectorization of the -DFOO=0 case.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #10 from alalaw01 at gcc dot gnu.org --- Hmmm, so this fixes the ICE, generating: SR.5_12 = MEM[(struct S0[2] *)&*.LC0].f0; MEM[(struct S0[2] *)&*.LC0].f0 = SR.5_12; d = *.LC0; d$3$f0_14 = MEM[(struct S0[2] *)&*.LC0 + 3B].f0; d$0$f0_7 = SR.5_12; e$f0_9 = d$3$f0_14; _3 = (int) d$0$f0_7; c = _3; _5 = (int) e$f0_9; __builtin_printf ("%x\n", _5); d ={v} {CLOBBER}; return 0; which in -fdump-tree-optimized (at -O1) looks like: SR.5_12 = MEM[(struct S0[2] *)&*.LC0].f0; d$3$f0_14 = MEM[(struct S0[2] *)&*.LC0 + 3B].f0; _3 = (int) SR.5_12; c = _3; _5 = (int) d$3$f0_14; __builtin_printf ("%x\n", _5); return 0; which is much saner. But I don't really understand why the PARM_DECL case that I'm adding to here is that way (since r147980 "New implementation of SRA" in 2009, https://gcc.gnu.org/ml/gcc-patches/2009-04/msg02218.html)... Bootstrapped+regtest on AArch64 (c,c++) and ARM (c,c++,ada), no regressions. (Constants don't get pushed into the pool on x86.) diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c index 72157edd02e3235e57b786bbf460c94b0c52b2c5..24eac6ae7c4dcd41358b1a020047076afe1a8106 100644 --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -2427,7 +2427,8 @@ analyze_access_subtree (struct access *root, struct access *parent, if (!hole || root->grp_total_scalarization) root->grp_covered = 1; - else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL) + else if (root->grp_write || TREE_CODE (root->base) == PARM_DECL + || constant_decl_p (root->base)) root->grp_unscalarized_data = 1; /* not covered and written to */ return sth_created; }
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #5 from alalaw01 at gcc dot gnu.org --- In the -DFOO=0 case, we have peeled an extra copy of the inner loop condition, i <= max_7, above the loop. scalar evolution (final_value_replacement_loop) works, because it sees the inner loop goes round niter = (unsigned int) max_7 - (unsigned int) i_25 iterations, and compute_overall_effect_of_inner_loop gives us (int) (((unsigned int) i_25 + ((unsigned int) max_7 - (unsigned int) i_25)) + 1) which is not expression_expensive_p, so we do it. Hence the add/subtract above. When -DFOO=1, we have not done that peeling, so niter = i_22 <= max_24 ? (unsigned int) max_24 - (unsigned int) i_22 : 0, and compute_overall_effect_of_inner_loop gives us (i_22 + 1) + (i_22 <= max_24 ? (int) ((unsigned int) max_24 - (unsigned int) i_22) : 0) which is expression_expensive_p, so we don't do the final value replacement.
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #7 from alalaw01 at gcc dot gnu.org --- Looking at where the peeling happens. In both -DFOO=0 and -DFOO=1 cases, 107.ch2 peels the inner loop header, so there is an i<=max test in the outer loop before the inner loop. However, in the -DFOO=1 case, this is dominated by the extra i>max test (that breaks out of the outer loop), so 110.dom2 removes the peeled i<=max. Thus, just before sccp, in the -DFOO=0 case, we have: : # i_25 = PHI # j_26 = PHI max_7 = 1 << j_26; if (max_7 >= i_25) goto ; else goto ; //skip inner loop : //inner loop header # i_2 = PHI _8 = (long unsigned int) i_2; _9 = _8 * 4; _11 = data_10(D) + _9; _12 = *_11; _13 = _12 + j_26; *_11 = _13; i_15 = i_2 + 1; if (max_7 >= i_15) goto ; //cleaned, actually via latch else goto ; note the inner loop exits if !(max_7 >= i_15), and when we hit the inner loop, we know that (max_7 >= i_25). Whereas in the -DFOO=1 case: : goto ; : //in outer loop max_7 = 1 << j_17; if (max_7 < i_32) goto ; else goto ; : //outer loop header # max_24 = PHI # i_22 = PHI # j_23 = PHI : //inner loop header # i_27 = PHI _8 = (long unsigned int) i_27; _9 = _8 * 4; _11 = data_10(D) + _9; _13 = *_11; _14 = _13 + j_23; *_11 = _14; i_16 = i_27 + 1; if (i_16 <= max_24) goto ; //cleaned, actually via latch else goto ; the inner loop exits if !(max_24 >= i_16), but max_24 is defined as PHI, and we only have that max_7max) break" out of the loop, such that the outer loop now executes "if (i>max) break" after the inner loop (rather than testing "if (i>max) break" before the inner loop, as it still did following 107.ch2). So as an alternative, possibly tweaking the jump-threading/loop-peeling heuristics might help (?).
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #8 from alalaw01 at gcc dot gnu.org --- Indeed, the -DFOO=1 case vectorizes with -fno-tree-dominator-opts.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #12 from alalaw01 at gcc dot gnu.org --- Thanks, Martin - yes, I see. Patch posted at https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00680.html after full regtest.
[Bug tree-optimization/70013] [6 Regression] packed structure tree-sra loses initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70013 --- Comment #13 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Fri Mar 11 12:08:01 2016 New Revision: 234138 URL: https://gcc.gnu.org/viewcvs?rev=234138&root=gcc&view=rev Log: Fix PR/70013 gcc: PR tree-optimization/70013 * tree-sra.c (analyze_access_subtree): Also set grp_unscalarized_data for constant-pool entries. gcc/testsuite: * gcc.dg/tree-ssa/sra-20.c: New. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-20.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
[Bug middle-end/70189] New: Combine constant-pool logic from gimplify + SRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70189 Bug ID: 70189 Summary: Combine constant-pool logic from gimplify + SRA Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Following PR/63679 (r232506), gimplify.c (gimplify_init_constructor) uses lots of heuristics to choose between pushing initializers out to the constant pool (by calling tree_output_constant_def) or outputting many elementwise statements. Then, in tree-sra.c (analyze_all_variable_accesses), we use more heuristics to decide which constant-pool loads to completely_scalarize, turning those back into elementwise statements. (These get pulled back in from the constant pool and the constant-pool entry deleted.) Both of these sets of heuristics are platform dependent (gimplify.c uses can_move_by_pieces, CLEAR_RATIO; tree-sra.c uses get_move_ratio). Instead we should put all this logic in one place; this would make it clearer, and we'd probably get better overall decisions. The suggestion is for gimplify.c to always push out to the constant pool, as this makes initial tree the same on all platforms, and for all the logic/heuristics to go into SRA (as, being later, we then have more information available to maybe make better decisions in the future).
[Bug target/63679] [5/6 Regression][AArch64] Failure to constant fold.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63679 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #43 from alalaw01 at gcc dot gnu.org --- I think this can be closed now? I've raised PR/70189 for the followup enhancement.
[Bug fortran/69368] [6 Regression] spec2006 test case 416.gamess fails with the g++ 6.0 compiler starting with r232508
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69368 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Assignee|alalaw01 at gcc dot gnu.org|unassigned at gcc dot gnu.org --- Comment #88 from alalaw01 at gcc dot gnu.org --- Can this now be closed, or should I leave open for possible Fortran FE warnings?
[Bug target/60825] [AArch64] int64x1_t, uint64x1_t and float64x1_t are not treated as vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60825 --- Comment #3 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Jun 23 12:46:52 2014 New Revision: 211892 URL: https://gcc.gnu.org/viewcvs?rev=211892&root=gcc&view=rev Log: PR/60825 Make float64x1_t in arm_neon.h a proper vector type gcc/ChangeLog: PR target/60825 * config/aarch64/aarch64.c (aarch64_simd_mangle_map): Add entry for V1DFmode. * config/aarch64/aarch64-builtins.c (aarch64_simd_builtin_type_mode): add V1DFmode (BUILTIN_VD1): New. (BUILTIN_VD_RE): Remove. (aarch64_init_simd_builtins): Add V1DF to modes/modenames. (aarch64_fold_builtin): Update reinterpret patterns, df becomes v1df. * config/aarch64/aarch64-simd-builtins.def (create): Make a v1df variant but not df. (vreinterpretv1df*, vreinterpret*v1df): New. (vreinterpretdf*, vreinterpret*df): Remove. * config/aarch64/aarch64-simd.md (aarch64_create, aarch64_reinterpret*): Generate V1DFmode pattern not DFmode. * config/aarch64/iterators.md (VD_RE): Include V1DF, remove DF. (VD1): New. * config/aarch64/arm_neon.h (float64x1_t): typedef with gcc extensions. (vcreate_f64): Remove cast, use v1df builtin. (vcombine_f64): Remove cast, get elements with gcc vector extensions. (vget_low_f64, vabs_f64, vceq_f64, vceqz_f64, vcge_f64, vgfez_f64, vcgt_f64, vcgtz_f64, vcle_f64, vclez_f64, vclt_f64, vcltz_f64, vdup_n_f64, vdupq_lane_f64, vld1_f64, vld2_f64, vld3_f64, vld4_f64, vmov_n_f64, vst1_f64): Use gcc vector extensions. (vget_lane_f64, vdupd_lane_f64, vmulq_lane_f64, ): Use gcc extensions, add range check using __builtin_aarch64_im_lane_boundsi. (vfma_lane_f64, vfmad_lane_f64, vfma_laneq_f64, vfmaq_lane_f64, vfms_lane_f64, vfmsd_lane_f64, vfms_laneq_f64, vfmsq_lane_f64): Fix type signature, use gcc vector extensions. (vreinterpret_p8_f64, vreinterpret_p16_f64, vreinterpret_f32_f64, vreinterpret_f64_f32, vreinterpret_f64_p8, vreinterpret_f64_p16, vreinterpret_f64_s8, vreinterpret_f64_s16, vreinterpret_f64_s32, vreinterpret_f64_s64, vreinterpret_f64_u8, vreinterpret_f64_u16, vreinterpret_f64_u32, vreinterpret_f64_u64, vreinterpret_s8_f64, vreinterpret_s16_f64, vreinterpret_s32_f64, vreinterpret_s64_f64, vreinterpret_u8_f64, vreinterpret_u16_f64, vreinterpret_u32_f64, vreinterpret_u64_f64): Use v1df builtin not df. gcc/testsuite/ChangeLog: * g++.dg/abi/mangle-neon-aarch64.C: Also test mangling of float64x1_t. * gcc.target/aarch64/aapcs/test_64x1_1.c: New test. * gcc.target/aarch64/aapcs/func-ret-64x1_1.c: New test. * gcc.target/aarch64/simd/ext_f64_1.c (main): Compare vector elements. * gcc.target/aarch64/vadd_f64.c: Rewrite with macro to use vector types. * gcc.target/aarch64/vsub_f64.c: Likewise. * gcc.target/aarch64/vdiv_f.c (INDEX*, RUN_TEST): Remove indexing scheme as now the same for all variants. * gcc.target/aarch64/vrnd_f64_1.c (compare_f64): Return float64_t not float64x1_t. Added: trunk/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c trunk/gcc/testsuite/gcc.target/aarch64/aapcs64/test_64x1_1.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64-builtins.c trunk/gcc/config/aarch64/aarch64-simd-builtins.def trunk/gcc/config/aarch64/aarch64-simd.md trunk/gcc/config/aarch64/aarch64.c trunk/gcc/config/aarch64/arm_neon.h trunk/gcc/config/aarch64/iterators.md trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/g++.dg/abi/mangle-neon-aarch64.C trunk/gcc/testsuite/gcc.target/aarch64/simd/ext_f64_1.c trunk/gcc/testsuite/gcc.target/aarch64/vadd_f64.c trunk/gcc/testsuite/gcc.target/aarch64/vdiv_f.c trunk/gcc/testsuite/gcc.target/aarch64/vrnd_f64_1.c trunk/gcc/testsuite/gcc.target/aarch64/vsub_f64.c
[Bug target/60825] [AArch64] int64x1_t, uint64x1_t and float64x1_t are not treated as vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60825 --- Comment #4 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Jun 23 14:07:42 2014 New Revision: 211894 URL: https://gcc.gnu.org/viewcvs?rev=211894&root=gcc&view=rev Log: PR/60825 Make {int,uint}64x1_t in arm_neon.h a proper vector type gcc/ChangeLog: PR target/60825 * config/aarch64/aarch64-builtins.c (aarch64_types_unop_qualifiers): Ignore third operand if present by marking qualifier_internal. * config/aarch64/aarch64-simd-builtins.def (abs): Comment. * config/aarch64/arm_neon.h (int64x1_t, uint64x1_t): Typedef to GCC vector extension. (aarch64_vget_lane_s64, aarch64_vdup_lane_s64, arch64_vdupq_lane_s64, aarch64_vdupq_lane_u64): Remove macro. (vqadd_s64, vqadd_u64, vqsub_s64, vqsub_u64, vqneg_s64, vqabs_s64, vcreate_s64, vcreate_u64, vreinterpret_s64_f64, vreinterpret_u64_f64, vcombine_u64, vbsl_s64, vbsl_u64, vceq_s64, vceq_u64, vceqz_s64, vceqz_u64, vcge_s64, vcge_u64, vcgez_s64, vcgt_s64, vcgt_u64, vcgtz_s64, vcle_s64, vcle_u64, vclez_s64, vclt_s64, vclt_u64, vcltz_s64, vdup_n_s64, vdup_n_u64, vld1_s64, vld1_u64, vmov_n_s64, vmov_n_u64, vqdmlals_lane_s32, vqdmlsls_lane_s32, vqdmulls_lane_s32, vqrshl_s64, vqrshl_u64, vqrshl_u64, vqshl_s64, vqshl_u64, vqshl_n_s64, vqshl_n_u64, vqshl_n_s64, vqshl_n_u64, vqshlu_n_s64, vrshl_s64, vrshl_u64, vrshr_n_s64, vrshr_n_u64, vrsra_n_s64, vrsra_n_u64, vshl_n_s64, vshl_n_u64, vshl_s64, vshl_u64, vshr_n_s64, vshr_n_u64, vsli_n_s64, vsli_n_u64, vsqadd_u64, vsra_n_s64, vsra_n_u64, vsri_n_s64, vsri_n_u64, vst1_s64, vst1_u64, vtst_s64, vtst_u64, vuqadd_s64): Wrap existing logic in GCC vector extensions (vpaddd_s64, vaddd_s64, vaddd_u64, vceqd_s64, vceqd_u64, vceqzd_s64 vceqzd_u64, vcged_s64, vcged_u64, vcgezd_s64, vcgtd_s64, vcgtd_u64, vcgtzd_s64, vcled_s64, vcled_u64, vclezd_s64, vcltd_s64, vcltd_u64, vcltzd_s64, vqdmlals_s32, vqdmlsls_s32, vqmovnd_s64, vqmovnd_u64 vqmovund_s64, vqrshld_s64, vqrshld_u64, vqrshrnd_n_s64, vqrshrnd_n_u64, vqrshrund_n_s64, vqshld_s64, vqshld_u64, vqshld_n_u64, vqshrnd_n_s64, vqshrnd_n_u64, vqshrund_n_s64, vrshld_u64, vrshrd_n_u64, vrsrad_n_u64, vshld_n_u64, vshld_s64, vshld_u64, vslid_n_u64, vsqaddd_u64, vsrad_n_u64, vsrid_n_u64, vsubd_s64, vsubd_u64, vtstd_s64, vtstd_u64): Fix type signature. (vabs_s64): Use GCC vector extensions; call __builtin_aarch64_absdi. (vget_high_s64, vget_high_u64): Reimplement with GCC vector extensions. (__GET_LOW, vget_low_u64): Wrap result using vcreate_u64. (vget_low_s64): Use __GET_LOW macro. (vget_lane_s64, vget_lane_u64, vdupq_lane_s64, vdupq_lane_u64): Use gcc vector extensions, add call to __builtin_aarch64_lane_boundsi. (vdup_lane_s64, vdup_lane_u64,): Add __builtin_aarch64_lane_bound_si. (vdupd_lane_s64, vdupd_lane_u64): Fix type signature, add __builtin_aarch64_lane_boundsi, use GCC vector extensions. (vcombine_s64): Use GCC vector extensions; remove cast. (vqaddd_s64, vqaddd_u64, vqdmulls_s32, vqshld_n_s64, vqshlud_n_s64, vqsubd_s64, vqsubd_u64, vrshld_s64, vrshrd_n_s64, vrsrad_n_s64, vshld_n_s64, vshrd_n_s64, vslid_n_s64, vsrad_n_s64, vsrid_n_s64): Fix type signature; remove cast. gcc/testsuite/ChangeLog: * g++.dg/abi/mangle-neon-aarch64.C (f22, f23): New tests of [u]int64x1_t. * gcc.target/aarch64/aapcs64/func-ret-64x1_1.c: Add {u,}int64x1 cases. * gcc.target/aarch64/aapcs64/test_64x1_1.c: Likewise. * gcc.target/aarch64/scalar_intrinsics.c (test_vaddd_u64, test_vaddd_s64, test_vceqd_s64, test_vceqzd_s64, test_vcged_s64, test_vcled_s64, test_vcgezd_s64, test_vcged_u64, test_vcgtd_s64, test_vcltd_s64, test_vcgtzd_s64, test_vcgtd_u64, test_vclezd_s64, test_vcltzd_s64, test_vqaddd_u64, test_vqaddd_s64, test_vqdmlals_s32, test_vqdmlsls_s32, test_vqdmulls_s32, test_vuqaddd_s64, test_vsqaddd_u64, test_vqmovund_s64, test_vqmovnd_s64, test_vqmovnd_u64, test_vsubd_u64, test_vsubd_s64, test_vqsubd_u64, test_vqsubd_s64, test_vshld_s64, test_vshld_u64, test_vrshld_s64, test_vrshld_u64, test_vshrd_n_s64, test_vshrd_n_u64, test_vsrad_n_s64, test_vsrad_n_u64, test_vrshrd_n_s64, test_vrshrd_n_u64, test_vrsrad_n_s64, test_vrsrad_n_u64, test_vqrshld_s64, test_vqrshld_u64, test_vqshlud_n_s64, test_vqshld_s64, test_vqshld_u64, test_vqshld_n_u64, test_vqshrund_n_s64, test_vqrshrund_n_s64, test_vqshrnd_n_s64, test_vqshrnd_n_u64, test_vqrshrnd_n_s64, test_vqrshrnd_n_u64, test_vshld_n_s64, test_vshdl_n_u64, test_vslid_n_s64, test_vslid_n_u64, test_vsrid_n_s64, test_vsrid_n_u64): Fix signature to match intrinsic. (test_vabs_s64): Remove. (test_vaddd_s64_2, test_vsubd_s64_2): Use force_simd. (test_vdupd_lane_s64): Rename to... (test_vdupd_laneq_s64): ...and remove a call to force_simd. (test_vdupd_lane_u64): R
[Bug testsuite/65506] [5 Regression] FAIL: gcc.dg/pr29215.c scan-tree-dump-not gimple "memcpy"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65506 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #8 from alalaw01 at gcc dot gnu.org --- This test was also failing for target arm-none-eabi, also fixed by Jakub's r221607.
[Bug libstdc++/33394] Add test case for Thread race segfault in std::string::append with -O and -s
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33394 --- Comment #4 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Wed Mar 25 15:46:58 2015 New Revision: 221666 URL: https://gcc.gnu.org/viewcvs?rev=221666&root=gcc&view=rev Log: PR libstdc++/33394 * testsuite/21_strings/basic_string/pthread33394.cc: Use dg-additional-options. Modified: trunk/libstdc++-v3/ChangeLog trunk/libstdc++-v3/testsuite/21_strings/basic_string/pthread33394.cc
[Bug target/65689] New: [AArch64] S constraint fails for inline asm at -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65689 Bug ID: 65689 Summary: [AArch64] S constraint fails for inline asm at -O0 Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Starting with r221532 (https://gcc.gnu.org/ml/gcc-patches/2015-03/msg01064.html), void test (void) { __asm__ ("@ %c0" : : "S" (&test + 4)); } fails to compile at -O0 on all aarch64 targets with: c-output-template-3.c: In function 'test': c-output-template-3.c:7:5: error: impossible constraint in 'asm' __asm__ ("@ %c0" : : "S" (&test + 4)); (This is gcc.target/aarch64/c-output-template-3.c, without the -O added in r221905, as that leads to successful compilation - however, the testcase should compile without -O too.)
[Bug target/65689] [AArch64] S constraint fails for inline asm at -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65689 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Problem stems from parse_input_constraint (in stmt.c): if (reg_class_for_constraint (cn) != NO_REGS || insn_extra_address_constraint (cn)) *allows_reg = true; else if (insn_extra_memory_constraint (cn)) *allows_mem = true; else { /* Otherwise we can't assume anything about the nature of the constraint except that it isn't purely registers. Treat it like "g" and hope for the best. */ *allows_reg = true; *allows_mem = true; } which causes expand_asm_operands to use (reg/f:DI ...), which fails the definition of the S constraint. If instead parse_input_constraint set both allows_reg and allows_mem to false (as it does for e.g. an "i" constraint, via a special-case), expand_asm_operands would follow the register to its definition: (const:DI (plus:DI (symbol_ref:DI ("test") [flags 0x3] ) (const_int 4 [0x4]))) (as also happens with -O), which satisfies the S constraint. One solution could be to generalize the special case in parse_input_constraint.
[Bug target/65689] [5 Regression][AArch64] S constraint fails for inline asm at -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65689 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Priority|P1 |P2
[Bug target/65689] [5 Regression][AArch64] S constraint fails for inline asm at -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65689 --- Comment #6 from alalaw01 at gcc dot gnu.org --- Whilst I think this probably would fix the problem - surely this will change the meaning of loads of constraints, on loads of platforms? I will of course defer to the release manager(s) (!), but IMHO this feels rather risky to do at this late stage, i.e. potentially "the cure is worse than the disease"...? Secondly, do I understand correctly, that the constraint-parsing mechanism will only come into play for plain ol' define_constraints, whereeas define_register_constraint / define_memory_constraint would provide/override with their own values? Does this still leave us with consistent meaning for all three kinds of define...constraint?
[Bug target/65689] [5 Regression][AArch64] S constraint fails for inline asm at -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65689 --- Comment #8 from alalaw01 at gcc dot gnu.org --- Well, meaning/behaviour. But thanks for the patch - I've bootstrapped and check-gcc'd on AArch64 and arm hf (Cortex-A15 + Neon) with no regressions.
[Bug target/65770] New: [AArch64] vst2_lane broken on bigendian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65770 Bug ID: 65770 Summary: [AArch64] vst2_lane broken on bigendian Product: gcc Version: 5.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target: aarch64_be Testcase: void test_vst2_lane_s32 (int32x2x2_t vals) { int32_t buf[2]; vst2_lane_s32 (buf, vals, 0); for (int i = 0; i < 2; i++) if (buf[i] != vget_lane_s32 (vals.val[i], 0)) abort(); } int main (int argc, char **argv) { int32_t load[4] = { 11, 12, 21, 22 }; test_vst2_lane_s32 (vld2_s32 (load)); } Passes on aarch64-none-elf, but fails on aarch64_be-none-elf: the generated assembly, contains st2 {v0.s - v1.s}[3], [x1] Which (1) has flipped endianness, and (2) has flipped endianness relative to a Q register (int32x4_t) not a D register (int32x2_t). A similar testcase for int32x4x2_t, also flips endianness, although at least relative to the right vector length ;).
[Bug target/64134] (vector float){0, 0, b, a} Uses stores when it does not need to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64134 --- Comment #3 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Apr 20 10:29:26 2015 New Revision: 29 URL: https://gcc.gnu.org/viewcvs?rev=29&root=gcc&view=rev Log: [AArch64] PR/64134: Make aarch64_expand_vector_init use 'ins' more often gcc/: PR target/64134 * config/aarch64/aarch64.c (aarch64_expand_vector_init): Load constant and overwrite variable parts if <= 1/2 the elements are variable. gcc/testsuite/: PR target/64134 * gcc.target/aarch64/vec_init_1.c: New test. Added: trunk/gcc/testsuite/gcc.target/aarch64/vec_init_1.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.c trunk/gcc/testsuite/ChangeLog
[Bug tree-optimization/35226] Induction with multiplication are not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35226 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-04-30 CC||alalaw01 at gcc dot gnu.org Version|4.3.0 |6.0 Summary|Reduction and induction |Induction with |with multiplication are not |multiplication are not |vectorized |vectorized Ever confirmed|0 |1 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Multiplication reductions are supported, certainly in gcc 4.9, I think longer. However, the following induction does not vectorize on gcc 6 development branch (x86_64, -O3, with or without -mavx or -msse2): int a[24]; int main (int argc, char **argv) { int p = 1; for (int i = 0; i < 24; i++, p*=2) a[i] *= p; } -fdump-tree-vect-details suggests the multiplication is recognized as a reduction but not as an induction: test_induc.c:7:3: note: Analyze phi: p_13 = PHI test_induc.c:7:3: note: reduction used in loop. test_induc.c:7:3: note: Unknown def-use cycle pattern. test_induc.c:7:3: note: === vect_pattern_recog === test_induc.c:7:3: note: vect_is_simple_use: operand _5 test_induc.c:7:3: note: def_stmt: _5 = a[i_14]; test_induc.c:7:3: note: type of def: 3. test_induc.c:7:3: note: vect_is_simple_use: operand p_13 test_induc.c:7:3: note: def_stmt: p_13 = PHI test_induc.c:7:3: note: Unsupported pattern. ... test_induc.c:7:3: note: def_stmt: p_13 = PHI test_induc.c:7:3: note: Unsupported pattern. test_induc.c:7:3: note: not vectorized: unsupported use in stmt. test_induc.c:7:3: note: unexpected pattern. test_induc.c:4:1: note: vectorized 0 loops in function. ... : # p_13 = PHI ... p_9 = p_13 * 2;
[Bug middle-end/65946] New: Simple loop with if-statement not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65946 Bug ID: 65946 Summary: Simple loop with if-statement not vectorized Product: gcc Version: 5.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: x86_64 This testcase: #define N 32 int a[N], b[N]; int foo () { for (int i = 0; i < N ; i++) { int m = (a[i] & i) ? 5 : 4; b[i] = a[i] * m; } } does not vectorize at -O3 on x86_64 or other platforms. Following dom1, jump threading partially peels the loop to give: : goto ; : # i_11 = PHI _5 = a[i_11]; _6 = i_11 & _5; if (_6 != 0) goto ; else goto ; : : # m_14 = PHI <5(4), 4(3)> : # m_2 = PHI # _15 = PHI <_5(5), _10(8)> # i_16 = PHI _7 = m_2 * _15; b[i_16] = _7; i_9 = i_16 + 1; if (i_9 != 32) goto ; else goto ; : return; : # i_1 = PHI <0(2)> _10 = a[i_1]; _3 = i_1 & _10; goto ; which form cannot be if-converted (tree-if-conv.c): /* If one of the loop header's edge is an exit edge then do not apply if-conversion. */ FOR_EACH_EDGE (e, ei, loop->header->succs) if (loop_exit_edge_p (loop, e)) return false; and even if it were, the PHI nodes at loop entry cannot be handled by the vectorizer.
[Bug middle-end/65946] Simple loop with if-statement not vectorized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65946 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2015-04-30 Assignee|unassigned at gcc dot gnu.org |alalaw01 at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Discussion here: https://gcc.gnu.org/ml/gcc/2015-04/msg00351.html Suggestion is to use loop-header-copying to rotate the loop to a form that both if-conversion and the vectorizer can handle.
[Bug middle-end/65947] New: Vectorizer misses conditional assignment of constant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65947 Bug ID: 65947 Summary: Vectorizer misses conditional assignment of constant Product: gcc Version: 5.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- This testcase: int a[32]; int main(int argc, char **argv) { int res = 3; for (int i = 0; i < 32; i++) if (a[i]) res = 7; return res; } does not vectorize at -O3 on x86_64 or aarch64. tree-if-conversion succeeds, giving a loop of form: : # res_10 = PHI # i_11 = PHI # ivtmp_9 = PHI _5 = a[i_11]; res_1 = _5 != 0 ? 7 : res_10; i_6 = i_11 + 1; ivtmp_2 = ivtmp_9 - 1; if (ivtmp_2 != 0) goto ; else goto ; : goto ; but -fdump-tree-vect-details shows: test.c:9:3: note: Analyze phi: res_10 = PHI test.c:9:3: note: reduction: not commutative/associative: res_1 = _5 != 0 ? 7 : res_10; test.c:9:3: note: Unknown def-use cycle pattern. ... test.c:9:3: note: vect_is_simple_use: operand res_10 test.c:9:3: note: def_stmt: res_10 = PHI test.c:9:3: note: Unsupported pattern. test.c:9:3: note: not vectorized: unsupported use in stmt. test.c:9:3: note: unexpected pattern.
[Bug middle-end/65947] Vectorizer misses conditional assignment of constant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65947 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2015-04-30 Assignee|unassigned at gcc dot gnu.org |alalaw01 at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Of course, the conditional assignment _is_ commutative and associative (wrt reordering iterations).
[Bug target/65951] New: [AArch64] Will not vectorize multiplication by long constant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65951 Bug ID: 65951 Summary: [AArch64] Will not vectorize multiplication by long constant Product: gcc Version: 5.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64 This loop: void foo (long *arr) { for (int i = 0; i < 256; i++) arr[i] *= 19594L; } will not vectorize on AArch64, but does on x86. On AArch64, -fdump-tree-vect-details reveals: test.c:4:3: note: ==> examining statement: _9 = _8 * 19594; test.c:4:3: note: vect_is_simple_use: operand _8 test.c:4:3: note: def_stmt: _8 = *_7; test.c:4:3: note: type of def: 3. test.c:4:3: note: vect_is_simple_use: operand 19594 test.c:4:3: note: op not supported by target. test.c:4:3: note: not vectorized: relevant stmt not supported: _9 = _8 * 19594; on x86, vectorization fails with vectorization_factor = 4 (V4DI), but succeeds at V2DI. We could vectorize this on AArch64 even if we have to perform a multiple-instruction load of that constant (invariant!) before the loop...right?
[Bug target/65952] New: [AArch64] Will not vectorize copying pointers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952 Bug ID: 65952 Summary: [AArch64] Will not vectorize copying pointers Product: gcc Version: 5.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64 typedef struct { int a, b, c, d; } my_struct; my_struct *array; my_struct *ptrs[4]; void loop () { for (int i = 0; i < 4; i++) ptrs[i] = &array[i]; } Vectorizes on x86, but not on AArch64. From -fdump-tree-vect-details: test.c:13:3: note: vectorization factor = 4 ... test.c:13:3: note: not vectorized: relevant stmt not supported: _6 = _5 * 16; test.c:13:3: note: bad operation or unsupported loop bound. test.c:13:3: note: * Re-trying analysis with vector size 8 ... test.c:13:3: note: not vectorized: no vectype for stmt: ptrs[i_12] = _7; scalar_type: struct my_struct * test.c:13:3: note: bad data references. test.c:11:1: note: vectorized 0 loops in function.
[Bug target/65951] [AArch64] Will not vectorize 64bit integer multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65951 --- Comment #3 from alalaw01 at gcc dot gnu.org --- Yes you are right, we have no V2DI multiply. We do have V2DI shifts + add, however, which would work well for some constants, e.g. the multiply by 16 in PR/65952; perhaps the vectorizer does not consider such possibilities (whereas we do for scalar code).
[Bug target/65952] [AArch64] Will not vectorize storing induction of pointer addresses for LP64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65952 --- Comment #4 from alalaw01 at gcc dot gnu.org --- Hmmm. Yes. Well, x * 16 = x << 4, of course. Or, in theory something like VRP could let us see that # i_12 = PHI # ivtmp_18 = PHI _5 = (long unsigned int) i_12; _6 = _5 * 16; _7 = pretmp_11 + _6; ptrs[i_12] = _7; i_9 = i_12 + 1; could be rewritten to something like # i_12 = PHI # ivtmp_18 = PHI _5 = _12 * 16; _6 = (long unsigned int) _5; _7 = pretmp_11 + _6; ptrs[i_12] = _7; i_9 = i_12 + 1; which would then be vectorizable.
[Bug middle-end/65962] New: Missed vectorization of strided stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962 Bug ID: 65962 Summary: Missed vectorization of strided stores Product: gcc Version: 5.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- This does not vectorize at -O3 on x86_64/-mavx or aarch64: int loop (int *data) { int tot = 0; for (int i = 0; i < 256; i++) data[i * 2] += 7; return tot; } -fdump-tree-vect-details reveals: loadstore.c:6:3: note: === vect_analyze_data_ref_accesses === loadstore.c:6:3: note: Detected single element interleaving *_8 step 8 loadstore.c:6:3: note: Data access with gaps requires scalar epilogue loop loadstore.c:6:3: note: not consecutive access *_8 = _10; loadstore.c:6:3: note: not vectorized: complicated access pattern. loadstore.c:6:3: note: bad data access. However, a similar testcase that only reads from those locations, vectorizes ok: int loop_12 (int *data) { int tot = 0; for (int i = 0; i < 256; i++) tot += data[i * 2]; return tot; } blocksort.c:6:3: note: === vect_analyze_data_ref_accesses === blocksort.c:6:3: note: Detected single element interleaving *_7 step 8 blocksort.c:6:3: note: Data access with gaps requires scalar epilogue loop
[Bug middle-end/65962] Missed vectorization of strided stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962 --- Comment #1 from alalaw01 at gcc dot gnu.org --- I believe this is a known issue, but have not identified an existing PR.
[Bug middle-end/65963] New: Missed vectorization of loads strided with << when equivalent * succeeds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963 Bug ID: 65963 Summary: Missed vectorization of loads strided with << when equivalent * succeeds Product: gcc Version: 5.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- This testcase does not vectorize at -O3 on x86_64/-mavx or AArch64: void loop (int *in, int *out) { for (int i = 0; i < 256; i++) { out[i] = in[i << 1] + 7; } } -fdump-tree-vect-details reveals: Creating dr for *_12 analyze_innermost: failed: evolution of base is not affine. base_address: offset from base address: constant offset from base address: step: aligned to: base_object: *_12 However, this testcase succeeds: void loop (int *in, int *out) { for (int i = 0; i < 256; i++) { out[i] = in[i * 2] + 7; } } The relevant extract of -fdump-tree-vect-details showing: Creating dr for *_12 analyze_innermost: success. base_address: in_11(D) offset from base address: 0 constant offset from base address: 0 step: 8 aligned to: 256 base_object: *in_11(D) Access function 0: {0B, +, 8}_1 The only difference is the multiplication: $ diff splice{,2}.c.131t.ifcvt 27c27 < _8 = i_19 * 2; --- > _8 = i_19 << 1; $
[Bug middle-end/65965] New: Straight-line memcpy/memset not vectorized when equivalent loop is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65965 Bug ID: 65965 Summary: Straight-line memcpy/memset not vectorized when equivalent loop is Product: gcc Version: 5.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Testcase: void test(int *__restrict__ a, int *__restrict__ b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[2]; a[3] = b[3]; a[5] = 0; a[6] = 0; a[7] = 0; a[8] = 0; } produces (at -O3) on AArch64: test: ldp w4, w3, [x1] ldp w2, w1, [x1, 8] stp w4, w3, [x0] stp w2, w1, [x0, 8] stp wzr, wzr, [x0, 20] stp wzr, wzr, [x0, 28] ret or on x86_64/-mavx: test: .LFB0: movl(%rsi), %eax movl$0, 20(%rdi) movl$0, 24(%rdi) movl$0, 28(%rdi) movl$0, 32(%rdi) movl%eax, (%rdi) movl4(%rsi), %eax movl%eax, 4(%rdi) movl8(%rsi), %eax movl%eax, 8(%rdi) movl12(%rsi), %eax movl%eax, 12(%rdi) ret (there is no -fdump-tree-vect) In contrast, testcase void test(int *__restrict__ a, int *__restrict__ b) { for (int i = 0; i < 4; i++) a[i] = b[i]; for (int i = 0; i < 4; i++) a[i+4] = 0; } the memcpy is recognized by ldist, and the 'memset' by slp1 (neither of which triggers on the first case), producing (superior) AArch64: test: moviv0.4s, 0 ldp x2, x3, [x1] stp x2, x3, [x0] str q0, [x0, 16] ret or x86_64: test: .LFB0: movq(%rsi), %rax movq8(%rsi), %rdx vpxor %xmm0, %xmm0, %xmm0 movq%rax, (%rdi) movq%rdx, 8(%rdi) vmovups %xmm0, 16(%rdi) ret
[Bug target/65951] [AArch64] Will not vectorize 64bit integer multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65951 --- Comment #5 from alalaw01 at gcc dot gnu.org --- I believe the definitive algorithm for converting multiply-by-constant into adds+shifts(+etc.) lives in expmed.c. I don't at present have a plan for how to reuse that, but if we could do so _in_some_form_ then that would be the ideal??
[Bug middle-end/65947] Vectorizer misses conditional assignment of constant
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65947 --- Comment #3 from alalaw01 at gcc dot gnu.org --- Yeah, you're right, it's not commutative, but then, it doesn't need to be. If f(x,y) is "(a[x] ? 7 : y)", then f(0, f(1, ...)) = f(1, f(0, ...)) (associative but not commutative), which is all we need to reorder the iterations of the loop? So if at the end of the loop we have a vector v_tmp_result = { f(8, f(4, f(0, ))), f(9, f(5, f(1, ))), f(10, f(6, f(2, ))), f(11, f(7, f(3, ))) } obtained by standard technique for reductions, we then need to reduce the vector to a scalar, which could be (a) if any of the vector elements are equal to the constant 7, then return the constant 7, else the initial value: cond_expr (vec_reduc_or (vec_equals (v_tmp_result, 7)), 7, ) indeed you might just vectorize to get the predicates v_tmp2 = { a[8] | a[4] | a[0], a[9] | a[5] | a[1], a[10] | a[6] | a[2], a[11] | a[7] | a[3] } and then reduce to scalar with cond_expr (vec_reduc_or (v_tmp2), 7, 3) (b) alternatively one could exploit the initial value (3) also being a constant and choose an appropriate operator from {max, min, or, and}, e.g. for 3 and 7 either reduc_max_expr(3,7) or reduc_or_expr(3,7) would work.
[Bug tree-optimization/46029] -ftree-loop-if-convert-stores causes FAIL: libstdc++-v3/testsuite/ext/pb_ds/example/tree_intervals.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46029 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #4 from alalaw01 at gcc dot gnu.org --- I'm still seeing this problem with -ftree-loop-if-convert-stores, introducing faults by converting conditional to unconditional loads. It doesn't look as if Sebastian Pop's patches went in (after being approved, https://gcc.gnu.org/ml/gcc-patches/2010-11/msg01670.html). Can anyone shed any light on this?
[Bug tree-optimization/57558] Issue with number of iterations calculation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57558 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-05-06 CC||alalaw01 at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #3 from alalaw01 at gcc dot gnu.org --- Seeing this too. Is another approach to fall back to an alternative (scalar?) path (perhaps just the epilogue?) if we can tell at the beginning of the loop that the iteration count will be infinite?
[Bug target/67439] ICE: unrecognizable insn compiling arm-fp16 testcases with -march=armv7-a and -mrestrict-it
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67439 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2015-09-03 CC||alalaw01 at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from alalaw01 at gcc dot gnu.org --- I can reproduce the ICE with -mthumb, both "-mfloat-abi=hard -mfpu=neon" and "-mfloat-abi=soft", but only with -mrestrict-it in both cases. "-mfloat-abi=hard -mfpu=neon-fp16" is OK with and without -mrestrict-it. I note the movhf patterns in vfp.md are only usable with neon-fp16; in other cases, we appear to be using arm32_movhf in arm.md.
[Bug target/63870] [Aarch64] [ARM] Errors in use of NEON intrinsics are reported incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63870 --- Comment #10 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Tue Sep 8 19:43:39 2015 New Revision: 227557 URL: https://gcc.gnu.org/viewcvs?rev=227557&root=gcc&view=rev Log: ARM/AArch64 Testsuite] Add float16 lane_f16_indices tests PR target/63870 * gcc.target/aarch64/advsimd-intrinsics/vld2_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld2q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld3_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld3q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld4_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vld4q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst2_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst2q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst3_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst3q_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst4_lane_f16_indices_1.c: New. * gcc.target/aarch64/advsimd-intrinsics/vst4q_lane_f16_indices_1.c: New. Added: trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld2_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld2q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld3_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld3q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld4_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld4q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst2_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst2q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst3_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst3q_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst4_lane_f16_indices_1.c trunk/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst4q_lane_f16_indices_1.c Modified: trunk/gcc/testsuite/ChangeLog
[Bug tree-optimization/67283] GCC regression over inlining of returned structures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283 --- Comment #13 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Fri Sep 18 10:55:11 2015 New Revision: 227901 URL: https://gcc.gnu.org/viewcvs?rev=227901&root=gcc&view=rev Log: completely_scalarize arrays as well as records. gcc/: PR tree-optimization/67283 * tree-sra.c (type_consists_of_records_p): Rename to... (scalarizable_type_p): ...this, add case for ARRAY_TYPE. (completely_scalarize_record): Rename to... (completely_scalarize): ...this, add ARRAY_TYPE case, move some code to: (scalarize_elem): New. (analyze_all_variable_accesses): Follow renamings. gcc/testsuite/: * gcc.dg/tree-ssa/sra-15.c: New. * gcc.dg/tree-ssa/sra-16.c: New. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-15.c trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-16.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
[Bug middle-end/65965] Straight-line memcpy/memset not vectorized when equivalent loop is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65965 --- Comment #4 from alalaw01 at gcc dot gnu.org --- (In reply to Richard Biener from comment #3) > Fixed for GCC 6. Indeed. I note that the same testcase does _not_ SLP/vectorize if I use consecutive indices: void test (int*__restrict a, int*__restrict b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[2]; a[3] = b[3]; a[4] = 0; a[5] = 0; a[6] = 0; a[7] = 0; } loop26a.c:6:13: note: Build SLP failed: different operation in stmt MEM[(int *)a _4(D) + 28B] = 0; loop26a.c:6:13: note: original stmt *a_4(D) = _3; loop26a.c:6:13: note: === vect_slp_analyze_data_ref_dependences === loop26a.c:6:13: note: === vect_slp_analyze_operations === loop26a.c:6:13: note: not vectorized: bad operation in basic block. Worth another bug?
[Bug tree-optimization/67681] New: Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 Bug ID: 67681 Summary: Missed vectorization: induction variable used after loop Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- The inner loop here: void addlog2 (int *data) { int i = 1; for (int j=0; j<=30; j++) { int max = 1 << j; if (FOO && i>max) break; for (; i <= max; i++) data[i] += j; } } does not vectorize if the if(FOO...) is present: $ /work/alalaw01/build-aarch64-none-elf/install/bin/aarch64-none-elf-gcc -S -O2 -ftree-vectorize -fdump-tree-vect-details=stdout loop9b.c -DFOO=1 | grep vectorized loop9b.c:1:6: note: not vectorized: inner-loop count not invariant. loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized === loop9b.c:8:5: note: not vectorized: value used after loop. loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized === loop9b.c:8:5: note: not vectorized: value used after loop. loop9b.c:1:6: note: vectorized 0 loops in function. $ aarch64-none-elf-gcc -S -O2 -ftree-vectorize -fdump-tree-vect-details=stdout loop9b.c -DFOO=0 | grep vectorized loop9b.c:4:3: note: not vectorized: inner-loop count not invariant. loop9b.c:8:5: note: === vect_mark_stmts_to_be_vectorized === loop9b.c:8:5: note: loop vectorized loop9b.c:1:6: note: vectorized 1 loops in function. Same with -O3. Of course clever analysis could figure out that i>max is never true, but even without that, we should be able to get 'i' back afterwards.
[Bug tree-optimization/67682] New: Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682 Bug ID: 67682 Summary: Missed vectorization: (another) straight-line memcpy/memset not vectorized when equivalent loop is Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64 This code: void test (int*__restrict a, int*__restrict b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[2]; a[3] = b[3]; a[4] = 0; a[5] = 0; a[6] = 0; a[7] = 0; } is not vectorized; -fdump-tree-slp-details reveals test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int *)a_4( D) + 28B] = 0; test.c:4:13: note: original stmt *a_4(D) = _3; test.c:4:13: note: === vect_slp_analyze_data_ref_dependences === test.c:4:13: note: === vect_slp_analyze_operations === test.c:4:13: note: not vectorized: bad operation in basic block. test.c:4:13: note: * Re-trying analysis with vector size 8 ... test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int *)a_4(D) + 28B] = 0; test.c:4:13: note: original stmt *a_4(D) = _3; test.c:4:13: note: === vect_slp_analyze_data_ref_dependences === test.c:4:13: note: === vect_slp_analyze_operations === test.c:4:13: note: not vectorized: bad operation in basic block. (the failure with vector size 8 is expected, but vector size 4 should succeed) Output is: test: ldp w4, w3, [x1] ldp w2, w1, [x1, 8] stp w4, w3, [x0] stp w2, w1, [x0, 8] stp wzr, wzr, [x0, 16] stp wzr, wzr, [x0, 24] ret Curiously, a similar code but writing elements a[0..3] and a[5..8] (missing out a[4]) is SLP'd, producing superior: test: ldr q0, [x1] moviv1.4s, 0 str q1, [x0, 20] str q0, [x0] ret And similarly for (equivalent to the first): void test (int*__restrict a, int*__restrict b) { for (int i = 0; i < 4; i++) a[i] = b[i]; for (int i = 4; i < 8; i++) a[i] = 0; } producing: test: moviv0.4s, 0 ldp x2, x3, [x1] stp x2, x3, [x0] str q0, [x0, 16] ret
[Bug tree-optimization/67683] New: Missed vectorization: shifts of an induction variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683 Bug ID: 67683 Summary: Missed vectorization: shifts of an induction variable Product: gcc Version: 6.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Blocks: 53947 Target Milestone: --- This testcase: void test (unsigned char *data, int max) { unsigned short val = 0xcdef; for(int i = 0; i < max; i++) { data[i] = (unsigned char)(val & 0xff); val >>= 1; } } does not vectorize on AArch64 or x86_64 at -O3. (I haven't yet looked at whether it's a mid-end deficiency or both back-ends are missing patterns.) Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/67681] Missed vectorization: induction variable used after loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67681 --- Comment #2 from alalaw01 at gcc dot gnu.org --- Being stupid here, but why does the outer loop having multiple exits matter - it's the inner loop that should be vectorized? FOO was a macro used to selectively make the test i>max disappear (enabling vectorization) - the two commandlines had -DFOO=0 (vectorizes) and -DFOO=1 (doesn't).
[Bug tree-optimization/57558] Loop not vectorized if iteration count could be infinite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57558 --- Comment #4 from alalaw01 at gcc dot gnu.org --- Here's another example, extracted from another benchmark - it vectorizes if INDEX is defined to 'long' but not if INDEX is 'short': #include unsigned char *t_run_test(unsigned char *in, int N) { unsigned char *out = malloc (N); for (unsigned INDEX i = 1; i < (N - 1); i++) out[i] = ((3 * in[i]) - in[i - 1] - in[i + 1]); return out; } However, the -Wunsafe-loop-optimizations doesn't give us anything useful here: (successful case, warning printed) $ aarch64-none-elf-gcc -O3 bmark2.c -DINDEX=long -S -Wunsafe-loop-optimizations -fdump-tree-vect-details=stdout | grep vectorized bmark2.c:7:3: note: === vect_mark_stmts_to_be_vectorized === bmark2.c:7:3: note: loop vectorized bmark2.c:3:16: note: vectorized 1 loops in function. bmark2.c: In function 't_run_test': bmark2.c:3:16: warning: cannot optimize loop, the loop counter may overflow [-Wunsafe-loop-optimizations] unsigned char *t_run_test(unsigned char *in, int N) (unsuccessful case, no warning) $ aarch64-none-elf-gcc -O3 bmark2.c -DINDEX=short -S -Wunsafe-loop-optimizations -fdump-tree-vect-details=stdout | grep vectorized bmark2.c:7:3: note: not vectorized: number of iterations cannot be computed. bmark2.c:3:16: note: vectorized 0 loops in function.
[Bug tree-optimization/67683] Missed vectorization: shifts of an induction variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683 alalaw01 at gcc dot gnu.org changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=35226 --- Comment #2 from alalaw01 at gcc dot gnu.org --- Is there a way to do this kind of thing other than extending polynomial_chrec's to understand operations other than addition ? Whilst beneficial, that looks to be quite a large task.
[Bug middle-end/68112] [6 Regression] FAIL: gcc.target/i386/avx512ifma-vpmaddhuq-2.c (test for excess errors)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68112 --- Comment #2 from alalaw01 at gcc dot gnu.org --- So (a << CONSTANT) is not equivalent to a * (1<
[Bug middle-end/68112] [6 Regression] FAIL: gcc.target/i386/avx512ifma-vpmaddhuq-2.c (test for excess errors)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68112 --- Comment #4 from alalaw01 at gcc dot gnu.org --- Sure, but gcc exploits undefinedness of multiply, so rewriting shift to multiply is not equivalent in the general case :(. One way forward might be to make definedness of overflow a bit finer-grained (either on types, i.e. TYPE_OVERFLOW_DEFINED, or maybe as a property of chrecs?)
[Bug tree-optimization/68165] New: Not constant-folding setting vector element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68165 Bug ID: 68165 Summary: Not constant-folding setting vector element Product: gcc Version: 6.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- I believe these two C functions are equivalent: typedef float __attribute__((__vector_size__ (2 * sizeof(float float32x2_t; float32x2_t test_cprop () { float32x2_t vec = {0.0, 0.0}; vec[0] = 3.14f; vec[1] = 2.71f; return vec * ((float32x2_t) { 1.5f, 4.5f }); } float32x2_t test_cprop2 () { float32x2_t vec = {3.14f, 2.71f}; return vec * ((float32x2_t) { 1.5f, 4.5f }); } at -O3 -fdump-tree-optimized, on AArch64: = ;; Function test_cprop (test_cprop, funcdef_no=0, decl_uid=2603, cgraph_uid=0, symbol_order=0) test_cprop () { float32x2_t vec; vector(2) float vec.0_5; float32x2_t _6; : vec = { 0.0, 0.0 }; BIT_FIELD_REF = 3.141049041748046875e+0; BIT_FIELD_REF = 2.7103814697265625e+0; vec.0_5 = vec; _6 = vec.0_5 * { 1.5e+0, 4.5e+0 }; vec ={v} {CLOBBER}; return _6; } ;; Function test_cprop2 (test_cprop2, funcdef_no=1, decl_uid=2607, cgraph_uid=1, symbol_order=1) test_cprop2 () { : return { 4.7103814697265625e+0, 1.219499969482421875e+1 }; } = x86 is identical for test_cprop2, worse in test_cprop: = test_cprop () { float32x2_t vec; vector(2) float vec.0_5; float32x2_t _6; float _8; float _9; float _10; float _11; : vec = { 0.0, 0.0 }; BIT_FIELD_REF = 3.141049041748046875e+0; BIT_FIELD_REF = 2.7103814697265625e+0; vec.0_5 = vec; _8 = BIT_FIELD_REF ; _9 = _8 * 1.5e+0; _10 = BIT_FIELD_REF ; _11 = _10 * 4.5e+0; _6 = {_9, _11}; vec ={v} {CLOBBER}; return _6; } = i.e. we are not understanding the result of assigning to the BIT_FIELD_REF on the whole vector, although we can resolve individual elements: float32x2_t test_cprop3 () { float32x2_t vec = {0.0, 0.0}; vec[0] = 3.14f; vec[1] = 2.71f; return (float32x2_t) {vec[0], vec[1]} * ((float32x2_t) { 1.5f, 4.5f }); } produces = test_cprop3 () { : return { 4.7103814697265625e+0, 1.219499969482421875e+1 }; }
[Bug tree-optimization/68165] Not constant-folding setting vector element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68165 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE --- Comment #3 from alalaw01 at gcc dot gnu.org --- Seems like a duplicate of 56118 to me. *** This bug has been marked as a duplicate of bug 56118 ***
[Bug tree-optimization/56118] Piecewise vector / complex initialization from constants not combined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56118 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #5 from alalaw01 at gcc dot gnu.org --- *** Bug 68165 has been marked as a duplicate of this bug. ***
[Bug rtl-optimization/68182] New: ICE in reorder_basic_blocks_simple building libitm/beginend.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68182 Bug ID: 68182 Summary: ICE in reorder_basic_blocks_simple building libitm/beginend.cc Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Host: x86_64 Target: x86_64 Preprocessed source attached; command-line $ /work/alalaw01/build/./gcc/xg++ -B/work/alalaw01/build/./gcc/ -mrtm -O1 -g -m32 -c temp.ii /work/alalaw01/src/gcc/libitm/beginend.cc: In static member function ‘static uint32_t GTM::gtm_thread::begin_transaction(uint32_t, const gtm_jmpbuf*)’: /work/alalaw01/src/gcc/libitm/beginend.cc:400:1: internal compiler error: in operator[], at vec.h:714 } ^ 0x1310783 vec::operator[](unsigned int) /work/alalaw01/src/gcc/gcc/vec.h:714 0x1310783 reorder_basic_blocks_simple /work/alalaw01/src/gcc/gcc/bb-reorder.c:2322 0x1310783 reorder_basic_blocks /work/alalaw01/src/gcc/gcc/bb-reorder.c:2450 0x1310783 execute /work/alalaw01/src/gcc/gcc/bb-reorder.c:2551
[Bug rtl-optimization/68182] ICE in reorder_basic_blocks_simple building libitm/beginend.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68182 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Created attachment 36636 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36636&action=edit Preprocessed source (compressed)
[Bug tree-optimization/65963] Missed vectorization of loads strided with << when equivalent * succeeds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963 --- Comment #2 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Thu Nov 5 18:39:38 2015 New Revision: 229825 URL: https://gcc.gnu.org/viewcvs?rev=229825&root=gcc&view=rev Log: [PATCH] tree-scalar-evolution.c: Handle LSHIFT by constant gcc/: PR tree-optimization/65963 * tree-scalar-evolution.c (interpret_rhs_expr): Try to handle LSHIFT_EXPRs as equivalent unsigned MULT_EXPRs. gcc/testsuite/: * gcc.dg/pr68112.c: New. * gcc.dg/vect/vect-strided-shift-1.c: New. Added: trunk/gcc/testsuite/gcc.dg/pr68112.c trunk/gcc/testsuite/gcc.dg/vect/vect-strided-shift-1.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-scalar-evolution.c
[Bug tree-optimization/65963] Missed vectorization of loads strided with << when equivalent * succeeds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65963 --- Comment #4 from alalaw01 at gcc dot gnu.org --- I confirm the testcase fails execution on armeb-none-eabi (also at -O0), but it does so both with and without the patch to tree-scalar-evolution.c, which did not change codegen (at -O2 -ftree-vectorize; the loop was not vectorized). So this looks to be exposing a different, pre-existing, bug.
[Bug c/68385] New: ICE building libstdc++ on arm-none-eabi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68385 Bug ID: 68385 Summary: ICE building libstdc++ on arm-none-eabi Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: arm-none-eabi Created attachment 36738 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36738&action=edit Reduced testcase Starting with r230365, building gcc for arm-none-eabi falls over in libstdc++ with: /work/alalaw01/build-arm-none-eabi/obj/gcc2/./gcc/xgcc -shared-libgcc -B/work/alalaw01/build-arm-none-eabi/obj/gcc2/./gcc -nostdinc++ -L/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/src -L/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/src/.libs -L/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/libsupc++/.libs -B/work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/bin/ -B/work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/lib/ -isystem /work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/include -isystem /work/alalaw01/build-arm-none-eabi/install/arm-none-eabi/sys-include -I/work/alalaw01/src/gcc/libstdc++-v3/../libgcc -I/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/include/arm-none-eabi -I/work/alalaw01/build-arm-none-eabi/obj/gcc2/arm-none-eabi/libstdc++-v3/include -I/work/alalaw01/src/gcc/libstdc++-v3/libsupc++ -fno-implicit-templates -Wall -Wextra -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections -fdata-sections -frandom-seed=eh_personality.lo -O2 -g -c /work/alalaw01/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc -o eh_personality.o /work/alalaw01/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc: In function '_Unwind_Reason_Code __cxxabiv1::__gxx_personality_v0(_Unwind_State, _Unwind_Control_Block*, _Unwind_Context*)': /work/alalaw01/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc:394:26: internal compiler error: tree check: expected integer_cst, have nop_expr in decompose, at tree.h:5123 UNWIND_STACK_REG)) ^ 0xf8d589 tree_check_failed(tree_node const*, char const*, int, char const*, ...) /work/alalaw01/src/gcc/gcc/tree.c:9587 0x10df3fd tree_check /work/alalaw01/src/gcc/gcc/tree.h:3212 0x10df3fd wi::int_traits::decompose(long*, unsigned int, tree_node const*) /work/alalaw01/src/gcc/gcc/tree.h:5123 0x10df3fd wide_int_ref_storage /work/alalaw01/src/gcc/gcc/wide-int.h:936 0x10df3fd generic_wide_int /work/alalaw01/src/gcc/gcc/wide-int.h:714 0x10df3fd generic_simplify_172 /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:6142 0x1113507 generic_simplify_EQ_EXPR /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:22841 0x111d719 generic_simplify(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:25312 0xa182c8 fold_binary_loc(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/src/gcc/gcc/fold-const.c:9138 0xa227b2 fold_build2_stat_loc(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/src/gcc/gcc/fold-const.c:12333 0x10e00cd generic_simplify_46 /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:2014 0x1112b27 generic_simplify_EQ_EXPR /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:22441 0x111d719 generic_simplify(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/build-arm-none-eabi/obj/gcc2/gcc/generic-match.c:25312 0xa182c8 fold_binary_loc(unsigned int, tree_code, tree_node*, tree_node*, tree_node*) /work/alalaw01/src/gcc/gcc/fold-const.c:9138 0xa3ec75 fold(tree_node*) /work/alalaw01/src/gcc/gcc/fold-const.c:11973 0x5bdff3 build_new_op_1 /work/alalaw01/src/gcc/gcc/cp/call.c:5730 0x5be299 build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, tree_node*, tree_node**, int) /work/alalaw01/src/gcc/gcc/cp/call.c:5803 0x70f42f build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, tree_node*, tree_code, tree_node**, int) /work/alalaw01/src/gcc/gcc/cp/typeck.c:3828 0x6e3b39 cp_parser_binary_expression /work/alalaw01/src/gcc/gcc/cp/parser.c:8621 0x6e3cdc cp_parser_assignment_expression /work/alalaw01/src/gcc/gcc/cp/parser.c:8742 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <http://gcc.gnu.org/bugs.html> for instructions. Reduced testcase attached: $ arm-none-eabi-gcc -c reduced.cc reduced.cc: In function 'bool __gxx_personality_v0(_Unwind_State, _Unwind_Control_Block*, _Unwind_Context*)': re
[Bug tree-optimization/68549] [6 Regression] ICE: in verify_loop_structure, at cfgloop.c:1669
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68549 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #8 from alalaw01 at gcc dot gnu.org --- Here's another testcase, reduced from value.c in gdb - ICEs at -O2 on (at least) x86_64 and AArch64: typedef long unsigned int size_t; extern void *xmalloc (size_t) __attribute__ ((__malloc__)) __attribute__ ((__returns_nonnull__)); struct __jmp_buf_tag { }; extern int __sigsetjmp (struct __jmp_buf_tag __env[1], int __savemask) __attribute__ ((__nothrow__)); typedef struct __jmp_buf_tag sigjmp_buf[1]; extern sigjmp_buf *exceptions_state_mc_init (void); extern int exceptions_state_mc_action_iter (void); extern void printf_unfiltered (const char *, ...) ; extern struct gdbarch *get_current_arch (void); struct internalvar { struct internalvar *next; }; static struct internalvar *internalvars; struct internalvar * create_internalvar (const char *name) { struct internalvar *var = ((struct internalvar *) xmalloc (sizeof (struct internalvar))); internalvars = var; } void show_convenience () { struct gdbarch *gdbarch = get_current_arch (); int varseen = 0; for (struct internalvar *var = internalvars; var; var = var->next) { if (!varseen) varseen = 1; sigjmp_buf *buf = exceptions_state_mc_init (); __sigsetjmp ( (*buf), 1); while (exceptions_state_mc_action_iter ()) while (exceptions_state_mc_action_iter ()) ; } if (!varseen) printf_unfiltered ( "" ); }
[Bug target/63870] [Aarch64] [ARM] Errors in use of NEON intrinsics are reported incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63870 --- Comment #7 from alalaw01 at gcc dot gnu.org --- I'm doing some of the ARM work atm, but not sure how far I'll get before stage 4 starts.
[Bug target/64893] [5 Regression] ICE while doing a bootstrap with the latest compiler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64893 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #7 from alalaw01 at gcc dot gnu.org --- This feels like we are working around a deficiency in the C++ frontend, which is a shame, but if we have to, then seems to me like an OK way to do so.
[Bug target/64997] New: [AArch64] Illegal EON on SIMD registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64997 Bug ID: 64997 Summary: [AArch64] Illegal EON on SIMD registers Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Testcase: #include #define force_simd(V1) asm volatile ("mov %d0, %1.d[0]" \ : "=w"(V1) \ : "w"(V1) \ : /* No clobbers */) int foo(int64x1_t val4, int64x1_t val6, int64x1_t val7) { int64x1_t val5 = vbic_s64 (val4, veor_s64 (val6, vsri_n_s64 (val6, val7, 13))); force_simd (val5); return vget_lane_s64 (val5, 0) == 0 ? 1 : 0; } generates an illegal assembly instruction (eon v1, v3, v1 -- EON works only on General-Purpose Registers) at -O1 and higher.
[Bug target/64997] [AArch64] Illegal EON on SIMD registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64997 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2015-02-10 Assignee|unassigned at gcc dot gnu.org |alalaw01 at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Results from split condition of xor_one_cmpl pattern using 'which_alternative' variable, which is not defined in split phase.
[Bug target/64997] [AArch64] Illegal EON on SIMD registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64997 --- Comment #2 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Wed Feb 25 14:20:13 2015 New Revision: 220969 URL: https://gcc.gnu.org/viewcvs?rev=220969&root=gcc&view=rev Log: [AArch64] Fix illegal assembly 'eon v1, v2, v3' PR target/64997 * config/aarch64/aarch64.md (*xor_one_cmpl3): Use FP_REGNUM_P as split condition; force split via '#' in output pattern. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.md
[Bug target/64997] [AArch64] Illegal EON on SIMD registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64997 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #3 from alalaw01 at gcc dot gnu.org --- Fixed r220969
[Bug tree-optimization/61114] Scalar evolution hides a big-endian const-folding bug.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114 --- Comment #9 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Oct 27 14:04:43 2014 New Revision: 216736 URL: https://gcc.gnu.org/viewcvs?rev=216736&root=gcc&view=rev Log: [Vectorizer] Make REDUC_xxx_EXPR tree codes produce a scalar result PR tree-optimization/61114 * expr.c (expand_expr_real_2): For REDUC_{MIN,MAX,PLUS}_EXPR, add extract_bit_field around optab result. * fold-const.c (fold_unary_loc): For REDUC_{MIN,MAX,PLUS}_EXPR, produce scalar not vector. * tree-cfg.c (verify_gimple_assign_unary): Check result vs operand type for REDUC_{MIN,MAX,PLUS}_EXPR. * tree-vect-loop.c (vect_analyze_loop): Update comment. (vect_create_epilog_for_reduction): For direct vector reduction, use result of tree code directly without extract_bit_field. * tree.def (REDUC_MAX_EXPR, REDUC_MIN_EXPR, REDUC_PLUS_EXPR): Update comment. Modified: trunk/gcc/ChangeLog trunk/gcc/expr.c trunk/gcc/fold-const.c trunk/gcc/tree-cfg.c trunk/gcc/tree-vect-loop.c trunk/gcc/tree.def
[Bug tree-optimization/61114] Scalar evolution hides a big-endian const-folding bug.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114 --- Comment #10 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Mon Oct 27 14:20:52 2014 New Revision: 216737 URL: https://gcc.gnu.org/viewcvs?rev=216737&root=gcc&view=rev Log: Add new optabs for reducing vectors to scalars PR tree-optimization/61114 * doc/md.texi (Standard Names): Add reduc_(plus,[us](min|max))|scal optabs, and note in reduc_[us](plus|min|max) to prefer the former. * expr.c (expand_expr_real_2): Use reduc_..._scal if available, fall back to old reduc_... + BIT_FIELD_REF only if not. * optabs.c (optab_for_tree_code): for REDUC_(MAX,MIN,PLUS)_EXPR, return the reduce-to-scalar (reduc_..._scal) optab. (scalar_reduc_to_vector): New. * optabs.def (reduc_smax_scal_optab, reduc_smin_scal_optab, reduc_plus_scal_optab, reduc_umax_scal_optab, reduc_umin_scal_optab): New. * optabs.h (scalar_reduc_to_vector): Declare. * tree-vect-loop.c (vectorizable_reduction): Look for optabs reducing to either scalar or vector. Modified: trunk/gcc/ChangeLog trunk/gcc/doc/md.texi trunk/gcc/expr.c trunk/gcc/optabs.c trunk/gcc/optabs.def trunk/gcc/optabs.h trunk/gcc/tree-vect-loop.c
[Bug target/59843] ICE with return of generic vector on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59843 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #13 from alalaw01 at gcc dot gnu.org --- Fixed on trunk in r211502 and backported to 4.9.
[Bug target/63950] New: [AArch64] ICE at -O0 on vld1_lane intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63950 Bug ID: 63950 Summary: [AArch64] ICE at -O0 on vld1_lane intrinsics Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Testcase #include int8x8_t f_vld1_lane (int8_t * p, int8x8_t v) { int8x8_t res; res = vld1_lane_s8 (p, v, 1); return res; } $ aarch64-none-elf-gcc -S test.c In file included from neon_const_range_tests/vld1.c:2:0: /work/alalaw01/sbuild/install/lib/gcc/aarch64-none-elf/5.0.0/include/arm_neon.h: In function 'f_vld1_lane': /work/alalaw01/sbuild/install/lib/gcc/aarch64-none-elf/5.0.0/include/arm_neon.h:658:10: internal compiler error: in aarch64_simd_lane_bounds, at config/aarch64/aarch64.c:8394 return __aarch64_vset_lane_any (__vec, __index, __elem, 8); ^ 0xf9419d aarch64_simd_lane_bounds(rtx_def*, long, long) /work/alalaw01/svn/gcc/gcc/config/aarch64/aarch64.c:8394 0xfffcf4 gen_aarch64_im_lane_boundsi(rtx_def*, rtx_def*) /work/alalaw01/svn/gcc/gcc/config/aarch64/aarch64-simd.md:4524 0x7cc10e insn_gen_fn::operator()(rtx_def*, rtx_def*) const /work/alalaw01/svn/gcc/gcc/recog.h:303 0xf9a366 aarch64_simd_expand_args /work/alalaw01/svn/gcc/gcc/config/aarch64/aarch64-builtins.c:970 0xf9a703 aarch64_simd_expand_builtin(int, tree_node*, rtx_def*) /work/alalaw01/svn/gcc/gcc/config/aarch64/aarch64-builtins.c:1051 0xf9ac21 aarch64_expand_builtin(tree_node*, rtx_def*, rtx_def*, machine_mode, int) /work/alalaw01/svn/gcc/gcc/config/aarch64/aarch64-builtins.c:1133 . Seems to be caused by lack of constant propagation at -O0, compiles fine at -O1 and higher.
[Bug target/63870] [Aarch64] [ARM] Errors in use of NEON intrinsics are reported incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63870 --- Comment #3 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Tue Dec 9 20:23:36 2014 New Revision: 218536 URL: https://gcc.gnu.org/viewcvs?rev=218536&root=gcc&view=rev Log: [AArch64]Remove be_checked_get_lane, check bounds with __builtin_aarch64_im_lane_boundsi. gcc/: PR target/63870 * config/aarch64/aarch64-simd-builtins.def (be_checked_get_lane): Delete. * config/aarch64/aarch64-simd.md (aarch64_be_checked_get_lane): Delete. * config/aarch64/arm_neon.h (aarch64_vget_lane_any): Use GCC vector extensions, __aarch64_lane, __builtin_aarch64_im_lane_boundsi. (__aarch64_vget_lane_f32, __aarch64_vget_lane_f64, __aarch64_vget_lane_p8, __aarch64_vget_lane_p16, __aarch64_vget_lane_s8, __aarch64_vget_lane_s16, __aarch64_vget_lane_s32, __aarch64_vget_lane_s64, __aarch64_vget_lane_u8, __aarch64_vget_lane_u16, __aarch64_vget_lane_u32, __aarch64_vget_lane_u64, __aarch64_vgetq_lane_f32, __aarch64_vgetq_lane_f64, __aarch64_vgetq_lane_p8, __aarch64_vgetq_lane_p16, __aarch64_vgetq_lane_s8, __aarch64_vgetq_lane_s16, __aarch64_vgetq_lane_s32, __aarch64_vgetq_lane_s64, __aarch64_vgetq_lane_u8, __aarch64_vgetq_lane_u16, __aarch64_vgetq_lane_u32, __aarch64_vgetq_lane_u64): Delete. (__aarch64_vdup_lane_any): Use __aarch64_vget_lane_any, remove 'q2' argument. (__aarch64_vdup_lane_f32, __aarch64_vdup_lane_f64, __aarch64_vdup_lane_p8, __aarch64_vdup_lane_p16, __aarch64_vdup_lane_s8, __aarch64_vdup_lane_s16, __aarch64_vdup_lane_s32, __aarch64_vdup_lane_s64, __aarch64_vdup_lane_u8, __aarch64_vdup_lane_u16, __aarch64_vdup_lane_u32, __aarch64_vdup_lane_u64, __aarch64_vdup_laneq_f32, __aarch64_vdup_laneq_f64, __aarch64_vdup_laneq_p8, __aarch64_vdup_laneq_p16, __aarch64_vdup_laneq_s8, __aarch64_vdup_laneq_s16, __aarch64_vdup_laneq_s32, __aarch64_vdup_laneq_s64, __aarch64_vdup_laneq_u8, __aarch64_vdup_laneq_u16, __aarch64_vdup_laneq_u32, __aarch64_vdup_laneq_u64): Remove argument to __aarch64_vdup_lane_any. (vget_lane_f32, vget_lane_f64, vget_lane_p8, vget_lane_p16, vget_lane_s8, vget_lane_s16, vget_lane_s32, vget_lane_s64, vget_lane_u8, vget_lane_u16, vget_lane_u32, vget_lane_u64, vgetq_lane_f32, vgetq_lane_f64, vgetq_lane_p8, vgetq_lane_p16, vgetq_lane_s8, vgetq_lane_s16, vgetq_lane_s32, vgetq_lane_s64, vgetq_lane_u8, vgetq_lane_u16, vgetq_lane_u32, vgetq_lane_u64, vdupb_lane_p8, vdupb_lane_s8, vdupb_lane_u8, vduph_lane_p16, vduph_lane_s16, vduph_lane_u16, vdups_lane_f32, vdups_lane_s32, vdups_lane_u32, vdupb_laneq_p8, vdupb_laneq_s8, vdupb_laneq_u8, vduph_laneq_p16, vduph_laneq_s16, vduph_laneq_u16, vdups_laneq_f32, vdups_laneq_s32, vdups_laneq_u32, vdupd_laneq_f64, vdupd_laneq_s64, vdupd_laneq_u64, vfmas_lane_f32, vfma_laneq_f64, vfmad_laneq_f64, vfmas_laneq_f32, vfmss_lane_f32, vfms_laneq_f64, vfmsd_laneq_f64, vfmss_laneq_f32, vmla_lane_f32, vmla_lane_s16, vmla_lane_s32, vmla_lane_u16, vmla_lane_u32, vmla_laneq_f32, vmla_laneq_s16, vmla_laneq_s32, vmla_laneq_u16, vmla_laneq_u32, vmlaq_lane_f32, vmlaq_lane_s16, vmlaq_lane_s32, vmlaq_lane_u16, vmlaq_lane_u32, vmlaq_laneq_f32, vmlaq_laneq_s16, vmlaq_laneq_s32, vmlaq_laneq_u16, vmlaq_laneq_u32, vmls_lane_f32, vmls_lane_s16, vmls_lane_s32, vmls_lane_u16, vmls_lane_u32, vmls_laneq_f32, vmls_laneq_s16, vmls_laneq_s32, vmls_laneq_u16, vmls_laneq_u32, vmlsq_lane_f32, vmlsq_lane_s16, vmlsq_lane_s32, vmlsq_lane_u16, vmlsq_lane_u32, vmlsq_laneq_f32, vmlsq_laneq_s16, vmlsq_laneq_s32, vmlsq_laneq_u16, vmlsq_laneq_u32, vmul_lane_f32, vmul_lane_s16, vmul_lane_s32, vmul_lane_u16, vmul_lane_u32, vmuld_lane_f64, vmuld_laneq_f64, vmuls_lane_f32, vmuls_laneq_f32, vmul_laneq_f32, vmul_laneq_f64, vmul_laneq_s16, vmul_laneq_s32, vmul_laneq_u16, vmul_laneq_u32, vmulq_lane_f32, vmulq_lane_s16, vmulq_lane_s32, vmulq_lane_u16, vmulq_lane_u32, vmulq_laneq_f32, vmulq_laneq_f64, vmulq_laneq_s16, vmulq_laneq_s32, vmulq_laneq_u16, vmulq_laneq_u32) : Use __aarch64_vget_lane_any. gcc/testsuite/: * gcc.target/aarch64/simd/vget_lane_f32_indices_1.c: New test. * gcc.target/aarch64/simd/vget_lane_f64_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_p16_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_p8_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_s16_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_s32_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_s64_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_s8_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_u16_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_u32_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_u64_indices_1.c: Likewise. * gcc.target/aarch64/simd/vget_lane_u8_ind
[Bug target/63950] [AArch64] ICE at -O0 on vld1_lane intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63950 alalaw01 at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #1 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Tue Dec 9 19:37:18 2014 New Revision: 218531 URL: https://gcc.gnu.org/viewcvs?rev=218531&root=gcc&view=rev Log: [AArch64]Fix ICE at -O0 on vld1_lane intrinsics gcc/: * config/aarch64/arm_neon.h (__AARCH64_NUM_LANES, __aarch64_lane *2): New. (aarch64_vset_lane_any): Redefine using previous, same for BE + LE. (vset_lane_f32, vset_lane_f64, vset_lane_p8, vset_lane_p16, vset_lane_s8, vset_lane_s16, vset_lane_s32, vset_lane_s64, vset_lane_u8, vset_lane_u16, vset_lane_u32, vset_lane_u64): Remove number of lanes. (vld1_lane_f32, vld1_lane_f64, vld1_lane_p8, vld1_lane_p16, vld1_lane_s8, vld1_lane_s16, vld1_lane_s32, vld1_lane_s64, vld1_lane_u8, vld1_lane_u16, vld1_lane_u32, vld1_lane_u64): Call __aarch64_vset_lane_any rather than vset_lane_xxx. gcc/testsuite/: * gcc.target/aarch64/vld1_lane-o0.c: New test.
[Bug target/63870] [Aarch64] [ARM] Errors in use of NEON intrinsics are reported incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63870 alalaw01 at gcc dot gnu.org changed: What|Removed |Added CC||alalaw01 at gcc dot gnu.org --- Comment #4 from alalaw01 at gcc dot gnu.org --- (Apologies for out-of-orderness, I missed PRs from logs so adding by hand) Author: alalaw01 Date: Tue Dec 9 19:52:22 2014 Revision: 218532 https://gcc.gnu.org/viewcvs?rev=218532&root=gcc&view=rev Log: [AArch64] Fix ICE on non-constant indices to __builtin_aarch64_im_lane_boundsi gcc/: * config/aarch64/aarch64-builtins.c (aarch64_types_binopv_qualifiers, TYPES_BINOPV): Delete. (enum aarch64_builtins): Add AARCH64_BUILTIN_SIMD_LANE_CHECK and AARCH64_SIMD_PATTERN_START. (aarch64_init_simd_builtins): Register __builtin_aarch64_im_lane_boundsi; use AARCH64_SIMD_PATTERN_START. (aarch64_simd_expand_builtin): Handle AARCH64_BUILTIN_LANE_CHECK; use AARCH64_SIMD_PATTERN_START. * config/aarch64/aarch64-simd.md (aarch64_im_lane_boundsi): Delete. * config/aarch64/aarch64-simd-builtins.def (im_lane_bound): Delete. * config/aarch64/arm_neon.h (__AARCH64_LANE_CHECK): New. (__aarch64_vget_lane_f64, __aarch64_vget_lane_s64, __aarch64_vget_lane_u64, __aarch64_vset_lane_any, vdupd_lane_f64, vdupd_lane_s64, vdupd_lane_u64, vext_f32, vext_f64, vext_p8, vext_p16, vext_s8, vext_s16, vext_s32, vext_s64, vext_u8, vext_u16, vext_u32, vext_u64, vextq_f32, vextq_f64, vextq_p8, vextq_p16, vextq_s8, vextq_s16, vextq_s32, vextq_s64, vextq_u8, vextq_u16, vextq_u32, vextq_u64, vmulq_lane_f64): Use __AARCH64_LANE_CHECK. gcc/testsuite/: * gcc.target/aarch64/simd/vset_lane_s16_const_1.c: New test.
[Bug target/59843] ICE with return of generic vector on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59843 --- Comment #6 from alalaw01 at gcc dot gnu.org --- Author: alalaw01 Date: Tue Jul 8 10:32:57 2014 New Revision: 212355 URL: https://gcc.gnu.org/viewcvs?rev=212355&root=gcc&view=rev Log: Backport r211502: PR target/59843 Fix arm_neon.h ZIP/UZP/TRN for bigendian 2014-06-10 Alan Lawrence gcc/: * config/aarch64/aarch64-modes.def: Add V1DFmode. * config/aarch64/aarch64.c (aarch64_vector_mode_supported_p): Support V1DFmode. gcc/testsuite/: * gcc.dg/vect/vect-singleton_1.c: New file. Added: branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/vect-singleton_1.c Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/config/aarch64/aarch64-modes.def branches/gcc-4_9-branch/gcc/config/aarch64/aarch64.c branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
[Bug tree-optimization/68681] New: testcase gcc.dg/vect/pr45752.c fails on AArch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68681 Bug ID: 68681 Summary: testcase gcc.dg/vect/pr45752.c fails on AArch64 Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64 Created attachment 36900 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36900&action=edit tree-vect-details dump Since r231015 (https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03371.html), on AArch64 we have FAIL: gcc.dg/vect/pr45752.c scan-tree-dump-times vect "gaps requires scalar epilogue loop" 0 FAIL: gcc.dg/vect/pr45752.c -flto -ffat-lto-objects scan-tree-dump-times vect "gaps requires scalar epilogue loop" 0 I attach -fdump-tree-vect-details from the non-lto case (line 5379: gcc/testsuite/gcc.dg/vect/pr45752.c:45:3: note: Data access with gaps requires scalar epilogue loop)
[Bug tree-optimization/68707] New: testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 Bug ID: 68707 Summary: testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alalaw01 at gcc dot gnu.org Target Milestone: --- Target: aarch64, arm Created attachment 36928 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36928&action=edit tree-vect-details dump (before patch, with LOAD_LANES) Prior to r230993, O3-pr36098.c (at -O3) was vectorized using a LOAD_LANES / STORE_LANES, resulting in: .L5: ld4 {v4.4s - v7.4s}, [x7], 64 add w4, w4, 1 cmp w3, w4 orr v1.16b, v4.16b, v4.16b orr v2.16b, v5.16b, v5.16b orr v3.16b, v6.16b, v6.16b st3 {v1.4s - v3.4s}, [x6], 48 bhi .L5 each iteration of the outer loop processes a struct of 4 ints, of which the first 3 are copied to a destination. The ld4 nicely gets us four structs with all the elements we want in three registers row-wise (and the elements we don't want in a fourth): struct1 struct2 struct3 struct4 v4.s[0] v4.s[1] v4.s[2] v4.s[3] v5.s[0] v5.s[1] v5.s[2] v5.s[3] v6.s[0] v6.s[1] v6.s[2] v6.s[3] v7.s[0] v7.s[1] v7.s[2] v7.s[3] and st3 stores the desired rows (only) to the right locations. Following r230993, instead the loop gets unrolled four times, four vectors are loaded sequentially, and then permuted by SLP: .L5: ldr q0, [x5, 16] add x4, x4, 48 ldr q1, [x5, 32] add w6, w6, 1 ldr q4, [x5, 48] cmp w3, w6 ldr q2, [x5], 64 orr v3.16b, v0.16b, v0.16b orr v5.16b, v4.16b, v4.16b orr v4.16b, v1.16b, v1.16b tbl v0.16b, {v0.16b - v1.16b}, v6.16b tbl v2.16b, {v2.16b - v3.16b}, v7.16b tbl v4.16b, {v4.16b - v5.16b}, v16.16b str q0, [x4, -32] str q2, [x4, -48] str q4, [x4, -16] bhi .L5 that is, we load struct1 struct2 struct3 struct4 v2.s[0] v0.s[0] v1.s[0] v4.s[0] v2.s[1] v0.s[1] v1.s[1] v4.s[1] v2.s[2] v0.s[2] v1.s[2] v4.s[2] v2.s[3] v0.s[3] v1.s[3] v4.s[3] and then permute struct1 struct2 struct3 struct4 v2.s[0] v2.s[3] v0.s[2] v4.s[1] v2.s[1] v0.s[0] v0.s[3] v4.s[2] v2.s[2] v0.s[1] v4.s[0] v4.s[3] so we then have the data 'columnwise' and store each sequentially.
[Bug tree-optimization/68707] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 --- Comment #1 from alalaw01 at gcc dot gnu.org --- Created attachment 36929 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36929&action=edit tree-vect-details dump (after patch, with SLP)