[Bug middle-end/69526] New: ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 Bug ID: 69526 Summary: ivopts candidate strangeness Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at linux dot vnet.ibm.com Target Milestone: --- Target: s390, x86 While inspecting loop generation code on s390, I saw ivopts choose an IV candidate which does something peculiar. On x86 the same candidate is never chosen because of different cost estimations, yet the behavior can be evoked on x86, through "forcing" GCC to use the same candidate as on s390. I did this to confirm my suspicions and I'll show the x86 version here: This source void v(unsigned long *in, unsigned long *out, unsigned int n) { int i; for (i = 0; i < n; i++) { out[i] = in[i]; } } results in the following assembly, when ivopts candidate 7 is used: v: testl %edx, %edx je .L1 leal -1(%rdx), %eax leaq 8(,%rax,8), %rcx xorl %eax, %eax .L3: movq (%rdi,%rax), %rdx movq %rdx, (%rsi,%rax) addq $8, %rax cmpq %rcx, %rax jne .L3 .L1: rep ret Should the following be happening? leal -1(%rdx), %eax leaq 8(,%rax,8), %rcx i.e. %eax = n - 1 %rcx = 8 * (n + 1) The pattern can already be observed in ivopts' GIMPLE: : _15 = n_5(D) + 4294967295; _2 = (sizetype) _15; _1 = _2 + 1; _24 = _1 * 8; Why do we need the - 1 and subsequent + 1 when the %eax is zeroed afterwards anyway? Granted, this exact situation won't ever be observed on x86 as another ivopts candidate is chosen but on s390 this situation will amount to three instructions. If I see it correctly, the n - 1 comes from estimating the number of loop iterations, while the +1 is then correctly added by cand_value_at() because the loop counter is incremented before the exit test. Perhaps this is intended behavior and there is nothing wrong with it? Regards Robin
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #2 from rdapp at linux dot vnet.ibm.com --- Ok, that's sensible but why is the - 1 necessary in the first place? n_5 - 1 can only underflow if n_5 == 0 which is checked by testl %edx, %edx before. This is in a previous basic block, unfortunately, and will not be regarded by ivopts then (because strictly speaking, it has no influence on the induction variable)?
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #5 from rdapp at linux dot vnet.ibm.com --- I still don't quite get why the "n - 1" is needed. Do we need it to possibly have an exit condition like if (i != n-1) or if (i <= n-1)? Am I missing something really obvious here?
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #7 from rdapp at linux dot vnet.ibm.com --- (In reply to amker from comment #6) > It comes from loop niter analysis, as in may_eliminate_iv, we have: > > (gdb) call debug_generic_expr(desc->niter) > n_5(D) + 4294967295 and this is correct? I.e. the number of iterations is n - 1? I'd naively expect desc->niter = n_5(D) (Again, it might be necessary for some reason escapes me currently)
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #10 from rdapp at linux dot vnet.ibm.com --- Created attachment 38144 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38144&action=edit Tentative patch for VRP and loop-doloop Meanwhile I found the time to implement a pattern for VRP which seems to do the job on tree level. Although larger now (and not many cases are covered yet) I suppose I agree that it fits better in VRP than in ivopts and might even catch more cases than before. The fix on RTL level is nonetheless necessary since doloop calculates the number of iterations independently of the code before and also adds a +1. I cleaned up the fix a little by introducing niterp1_expr (niter plus 1) but there's still room for improvement. No regressions so far, improvement ideas welcome.
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #13 from rdapp at linux dot vnet.ibm.com --- Created attachment 38535 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38535&action=edit VRP/match.pd patch Found some time again and hacked together a fix for match.pd and VRP. The patch exports a VRP function no_range_overflow that is called from match.pd via tree.c. I realize that VRP didn't export any functions so far and if there's a better/idiomatic way to do it, please tell me. Further improvements or "code smell" hints very welcome. Currently this should fix cases like long foo(int i) { return (long) (i + 1) - 2; } Commutativity of a PLUS_EXPR is not yet being exploited and I'm relying on TYPE_OVERFLOW_UNDEFINED so far. The overflow checking could also be done more elegantly I suppose. This diff is extracted from the full patch which also includes the RTL level fix which I haven't included here since it has not changed. I hope the splitting didn't introduce new bugs but there are no regressions on s390x for the full patch.
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #15 from rdapp at linux dot vnet.ibm.com --- Thanks for the suggestions. The omission of the inner op was actually more or less on purpose as I intended to capture the (convert @inner) in order to access @inner's value range as a whole rather than re-calculating what VRP already figured out. Is there a simple method to access @inner when capturing (outer_op (convert (inner_op SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) ^-^ @inner or, even-better both, @inner as well as @0 and @1, at the same time? (Apart from looking through use stmts) In my case VRP determines @0's range as ~[0,0] and @inner's as [0,INT_MAX-1]. VRP probably canonicalized the anti-range to a normal range and performed other simplifications in order to arrive at [0,INT_MAX-1]. If I cannot get @inner's VRP info with the capture above, would there be another way to obtain it? The TREE_OVERFLOW/const_binop code is copied from the (A +- CST) +- CST -> A + CST pattern and the result of const_binop is used in the final simplification. The overflow check could of course be done via wi::add/sub but wouldn't I just replicate what int_const_binop_1 is doing on top of it when I need the result anyway?
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #17 from rdapp at linux dot vnet.ibm.com --- (In reply to Marc Glisse from comment #16) > (In reply to rdapp from comment #15) > > Is there a simple method to access @inner when > > capturing > > (outer_op (convert (inner_op SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) > >^-^ > > @inner > > or, even-better both, @inner as well as @0 and @1, at the same time? (Apart > > from looking through use stmts) > > (outer_op (convert (inner_op@3 SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2)) > > Does @3 do what you want here? (I didn't follow the discussion) looks good, thanks :)
[Bug tree-optimization/80925] [8 Regression] vect peeling failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925 --- Comment #3 from rdapp at linux dot vnet.ibm.com --- Strange, my tests didn't show new failures on Power7. I'll have a look, perhaps the build settings were wrong.
[Bug tree-optimization/80925] [8 Regression] vect peeling failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925 --- Comment #5 from rdapp at linux dot vnet.ibm.com --- I quickly built trunk without bootstrap on power7 BE ("--enable-languages="c,c++,fortran" --disable-multilib --disable-bootstrap") and still get no new fails. Do I need other build parameters? Meanwhile I got access to a power8 LE machine and will check there. r248678 definitely introduced fails but r248680 should actually get rid of them. I can't confirm for every one of these but [6/6] was made specifically to address fails on power. Are all of them still present after r248680?
[Bug tree-optimization/80925] [8 Regression] vect peeling failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925 --- Comment #7 from rdapp at linux dot vnet.ibm.com --- I could reproduce the fails on a power8 machine now. Looking at the vect-28.c FAIL now - the loop to be vectorized is: for (i = 0; i < N; i++) { ia[i+off] = 5; } It still gets vectorized but not peeled anymore because the costs for no peeling equal the costs for peeling (for unknown alignment). Costs for an unaligned store are the same (1) as for a regular store so this is to be expected. At first sight, the situation is similar for vect-87.c, vect-88.c and maybe most of the fails with '"Vectorizing an unaligned access" 0'. How should we deal with this? If the cost function is correct as it is and unaligned stores are not slower at all, I don't think we should be peeling. What is expected for real workloads and unaligned loads/stores?
[Bug tree-optimization/80925] [8 Regression] vect peeling failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925 --- Comment #9 from rdapp at linux dot vnet.ibm.com --- I built --with-cpu=power7 and still see TARGET_EFFICIENT_UNALIGNED_VSX == true in the backend which causes unaligned stores to have costs of 1. On my power7 system, TARGET_EFFICIENT_UNALIGNED_VSX is never true. I stepped through rs6000_override_internal and TARGET_EFFICIENT_UNALIGNED_VSX is properly unset via the options defined by --with-cpu=power7 (or power6). Afterwards, however, it is overwritten again by TARGET_P8_VECTOR which, in turn, is set automatically by the vector test suite depending on the current CPU: [...] } elseif [check_p8vector_hw_available] { lappend DEFAULT_VECTCFLAGS "-mpower8-vector" Therefore, whenever the vector tests are run on a power8 CPU, TARGET_EFFICIENT_UNALIGNED_VSX = 1, no matter the --with-cpu. This would also explain why I didn't see the fails on my machine, all vect tests are only called with -maltivec which doesn't override TARGET_EFFICIENT_UNALIGNED_VSX. So, the way the vect test suite is currently set up, this is kind of expected.
[Bug tree-optimization/80925] [8 Regression] vect peeling failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925 --- Comment #16 from rdapp at linux dot vnet.ibm.com --- (In reply to seurer from comment #14) > spawn -ignore SIGHUP /home/seurer/gcc/build/gcc-test/gcc/xgcc > -B/home/seurer/gcc/build/gcc-test/gcc/ > /home/seurer/gcc/gcc-test/gcc/testsuite/gcc.dg/vect/vect-33-big-array.c > -fno-diagnostics-show-caret -fdiagnostics-color=never -maltivec > -mpower8-vector -ftree-vectorize -fno-vect-cost-model -fno-common -O2 > -fdump-tree-vect-details -S -o vect-33-big-array.s > PASS: gcc.dg/vect/vect-33-big-array.c (test for excess errors) > PASS: gcc.dg/vect/vect-33-big-array.c scan-tree-dump-times vect "vectorized > 1 loops" 1 > FAIL: gcc.dg/vect/vect-33-big-array.c scan-tree-dump-times vect "Vectorizing > an unaligned access" 0 > FAIL: gcc.dg/vect/vect-33-big-array.c scan-tree-dump-times vect "Alignment > of access forced using peeling" 1 > Executing on host: /home/seurer/gcc/build/gcc-test/gcc/xgcc > -B/home/seurer/gcc/build/gcc-test/gcc/ > /home/seurer/gcc/gcc-test/gcc/testsuite/gcc.dg/vect/vect-33-big-array.c > -fno-diagnostics-show-caret -fdiagnostics-color=never -flto > -ffat-lto-objects -maltivec -mpower8-vector -ftree-vectorize > -fno-vect-cost-model -fno-common -O2 -fdump-tree-vect-details -S -o > vect-33-big-array.s(timeout = 300) There is still -mpower8-vector in the compile options, making unaligned stores inexpensive. This is really run on a power6 CPU? In order to bulk-disable tests that rely on peeling, would something like a global check (e.g. target_vect_unaligned, akin to target_vect_int etc.) make sense? This could be used to flag and disable specific tests depending on the CPU the the vect suite is run on.
[Bug target/81362] [8.0 regression] FAIL: gcc.dg/vect/no-vfa-vect-57.c execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81362 --- Comment #1 from rdapp at linux dot vnet.ibm.com --- Could you provide the vectorizer dump (-fdump-tree-vect-details)? The generated assembly might also be interesting as well as the exact command line for building (in the test suite logs). I compiled --with-cpu-64=power4 yet cannot reproduce the FAIL. For me, the compile options set by the test suite are -maltivec -mvsx -mno-allow-movmisalign -ftree-vectorize -fno-vect-cost-model -fno-common -O2 -fdump-tree-vect-details --param vect-max-version-for-alias-checks=0 -lm -m32
[Bug target/81362] [8.0 regression] FAIL: gcc.dg/vect/no-vfa-vect-57.c execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81362 --- Comment #5 from rdapp at linux dot vnet.ibm.com --- Ah, npeel is set by vect_peeling_hash_get_lowest_cost although the corresponding dr is not used afterwards. It should be save to get rid of the npeel parameter since we use the returned peeling's npeel anyway. I think the same is true for body_cost_vec but it's not used afterwards so doesn't cause problems. The following fixes the regression for me: index 5103ba1..257be41 100644 [0/92965] --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -1347,7 +1347,6 @@ vect_peeling_hash_get_lowest_cost (_vect_peel_info **slot, static struct _vect_peel_extended_info vect_peeling_hash_choose_best_peeling (hash_table *peeling_htab, loop_vec_info loop_vinfo, - unsigned int *npeel, stmt_vector_for_cost *body_cost_vec) { struct _vect_peel_extended_info res; @@ -1371,7 +1370,6 @@ vect_peeling_hash_choose_best_peeling (hash_table *pee ling_hta res.outside_cost = 0; } - *npeel = res.peel_info.npeel; *body_cost_vec = res.body_cost_vec; return res; } @@ -1812,7 +1810,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo) unless aligned. So we try to choose the best possible peeling from the hash table. */ peel_for_known_alignment = vect_peeling_hash_choose_best_peeling - (&peeling_htab, loop_vinfo, &npeel, &body_cost_vec); + (&peeling_htab, loop_vinfo, &body_cost_vec); } /* Compare costs of peeling for known and unknown alignment. */
[Bug target/81362] [8.0 regression] FAIL: gcc.dg/vect/no-vfa-vect-57.c execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81362 --- Comment #6 from rdapp at linux dot vnet.ibm.com --- Created attachment 41715 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41715&action=edit Tentative patch Removed the npeel function argument, also removed body_cost_vec and the corresponding release ()es. I'm not so sure about the semantics of vecs in general. Are we leaking memory when not releasing the body_cost_vecs contained in the various _vect_peel_extended_infos? If it was RAII-like body_cost_vec there would have been no need to release () body_cost_vec before, so I assume it is not.
[Bug libgomp/78468] [7 regression] libgomp.c/reduction-10.c and many more FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78468 rdapp at linux dot vnet.ibm.com changed: What|Removed |Added CC||rdapp at linux dot vnet.ibm.com --- Comment #12 from rdapp at linux dot vnet.ibm.com --- (Writing from a colleagues machine ...) Thinking about it, I'm quite sure the dynamic allocation must come from here: > #pragma omp parallel for reduction(*:y[:p4]) reduction(|:a[:p5]) ^^ ^^ I.e. the bug should go awaz if you replace either p4 or p5 or both with the numeric values. Dominik ^_^ ^_^
[Bug tree-optimization/77283] [7 Regression] Revision 238005 disables loop unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77283 --- Comment #11 from rdapp at linux dot vnet.ibm.com --- Any progress on this? The patch fixes the problem for s390x (no performance regressions), but without it we see the regression in SPEC2006's libquantum all the time, I guess the same is true for PowerPC? Any chance for it to go into 7.0? For s390x, disabling the path-splitting pass does not introduce performance regressions either and fixes libquantum but that would only be a very last resort.
[Bug tree-optimization/77283] [7 Regression] Revision 238005 disables loop unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77283 --- Comment #15 from rdapp at linux dot vnet.ibm.com --- The updated patch fixes libquantum on s390 so PR77366 might indeed be to simplified to check for that, but it was unrolled before r238005. Addressing libquantum is more important, of course.
[Bug middle-end/83069] [8 Regression] internal compiler error: in from_gcov_type, at profile-count.h:676
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83069 rdapp at linux dot vnet.ibm.com changed: What|Removed |Added CC||rdapp at linux dot vnet.ibm.com --- Comment #11 from rdapp at linux dot vnet.ibm.com --- 416.gamess fails on S390 as well since r254888. I didn't immediately get what the if (freq_max < 16) freq_max = 16; part of the patch is supposed to achieve. When freq_max < 16, tmp will later be larger than BLOCK_INFO (bb)->frequency which wasn't the case before. In the worst case we approach max_count by 4 bits every time. Why do we need a minimum freq_max? To ensure some scaling even for low frequencies?
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #19 from rdapp at linux dot vnet.ibm.com --- (In reply to rguent...@suse.de from comment #18) > The match.pd patch is slowly assuming shape... Two things however I'm still unsure about: 1. An existing rule in match.pd checks for overflow of the combined constant in (a +- CST1) +- CST2 and does not perform the simplification when an overflow occurs. My case looks quite similar, however CST1 and CST2 are/might be unsigned ints/longs. I assume this is because non-overflow cannot be proved by the C frontend and it therefore internally uses unsigned representation? (cast) (a +- CST1) +- CST2 We can prove the non-overflow of (a +- CST1) via VRP but an overflow check of (CST1 +- CST2) would fail since ((UINT_MAX - 1) + 1) does overflow. So, in general, should we care about overflow of the combined operation at all after having established the inner operation does not overflow? If the overflow is well-defined and we overflow, the result should be valid. If the overflow is undefined, we could do anything, in particular optimize this which wouldn't even be too unexpected from a user's perspective. Wouldn't any overflow in the combined constant be caused anyway, even without combining? I think there are only two cases two discern, regardless of overflow: - abs(CST1 +- CST2) < abs(CST1), then we can simplify to (cast)(a +- (CST1 +- CST2)) - else, we can simplify to (cast)(a) +- (CST1 +- CST2) 2. Is there an idiomatic/correct way to check a VR_RANGE for overflow? Does it suffice to check if the range includes +-INF or +-INF(OVF)? I suspect other, naive methods like checking if min < max will fail, since the ranges are canonicalized.
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 rdapp at linux dot vnet.ibm.com changed: What|Removed |Added Version|6.0 |unknown --- Comment #21 from rdapp at linux dot vnet.ibm.com --- (In reply to amker from comment #20) > IIUC we can simply handle signed/unsigned type differently. Given a has > int/uint type. > For unsigned case: (unsigned long long)(a + cst1) + cst2 and a is unsigned > int. > It can be simplified into (unsigned long long)a + cst3 iff a + cst1 doesn't > overflow/wrap. Since unsigned type can wrap, simplify (unsigned long > long)cst1 + cst2 into cst3 is always safe. > For signed case: (long long)(a + cst) + cst2 and a is signed int. > It can be simplified into (long long)a + cst3 iff (long long)cst1 + cst2 > doesn't overflow. We don't need to prove (a+cst) doesn't overflow. ok, the stipulation seems to be assume no signed overflow if the operation was already present but don't provoke a potentially new overflow by combining constants that previously would not have been. Makes sense. Your cases can be improved a little as Richard also pointed out: When abs(cst3) < abs(cst1) the combined operation can be done in the inner type (and the signs match so the overflow proof still holds).
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #24 from rdapp at linux dot vnet.ibm.com --- Created attachment 38775 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38775&action=edit Updated patch and test, match.pd/VRP (In reply to amker from comment #23) > Hmm, not sure if I mis-understood your point. > I think there is no *overflow* concept for a VR_RANGE (or a variable), > overflow is always a semantic concept for an (arithmetic) operation. Take > below loop as an example: > > unsigned int i; > for (i = 4; i != 2; i += 2) > { > // > } > > Variable i has VR_RANGE [0, MAX_VALUE - 1], but it does overflow in the loop. Yes, a variable by itself cannot overflow, only an operation can. VRP for example has a function extract_range_from_binary_expr_1 that performs some overflow checking internally. It does so by comparing the minima and maxima of the respective ranges from both operands. I assumed an overflow there would somehow be propagated into the ranges, hence the question how to access the cast expression in match.pd. Guess I was thinking too complicated, getting the range of both operands and checking if their common minimum is smaller than the maximum should suffice for now. Attached is the current version of the patch, no regressions on s390x and x86-64. The test could be made a bit more specific I suppose, currently it counts the number of "gimple_simplified"s.
[Bug middle-end/69526] ivopts candidate strangeness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526 --- Comment #25 from rdapp at linux dot vnet.ibm.com --- Created attachment 38875 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38875&action=edit Updated patch and test, math.pd/VRP Some cleanups and fixes, the patch can now handle more cases in the inner type. Also added more test cases.
[Bug middle-end/77366] New: Rev. 2ac4967 prevents loop unrolling for s390
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77366 Bug ID: 77366 Summary: Rev. 2ac4967 prevents loop unrolling for s390 Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at linux dot vnet.ibm.com CC: krebbel at gcc dot gnu.org Target Milestone: --- Host: s390 Target: s390 Since 2ac4967f49a70e1bd0bb28a142324f527dac3743 the following loop is not being unrolled anymore (-O3 -funroll-loops): void foo(unsigned int size, unsigned int *state) { unsigned int i; for(i = 0; i < size; i++) { if(*state & 1) { *state ^= 1; } } } The revision adds an additional check that allows the split-paths pass to duplicate a basic block (returned NULL before the patch) which in turn causes check_simple_exit() in loop-iv.c to not find a proper niter_desc. This seems due to if (!dominated_by_p (CDI_DOMINATORS, loop->latch, exit_bb)) failing, i.e. check_simple_exit() doesn't see that the exit condition is checked every iteration.