[Bug middle-end/69526] New: ivopts candidate strangeness

2016-01-28 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

Bug ID: 69526
   Summary: ivopts candidate strangeness
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: minor
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at linux dot vnet.ibm.com
  Target Milestone: ---
Target: s390, x86

While inspecting loop generation code on s390, I saw ivopts choose an IV
candidate which does something peculiar. On x86 the same candidate is never
chosen because of different cost estimations, yet the behavior can be evoked on
x86, through "forcing" GCC to use the same candidate as on s390. I did this to
confirm my suspicions and I'll show the x86 version here:

This source

void v(unsigned long *in, unsigned long *out, unsigned int n)
{
  int i;

  for (i = 0; i < n; i++)
  {
out[i] = in[i];
  }
}

results in the following assembly, when ivopts candidate 7 is used:

v:
  testl %edx, %edx
  je .L1
  leal -1(%rdx), %eax
  leaq 8(,%rax,8), %rcx
  xorl %eax, %eax
.L3:
  movq (%rdi,%rax), %rdx
  movq %rdx, (%rsi,%rax)
  addq $8, %rax
  cmpq %rcx, %rax
  jne .L3
.L1:
  rep ret

Should the following be happening?

  leal -1(%rdx), %eax
  leaq 8(,%rax,8), %rcx

i.e. %eax = n - 1
 %rcx = 8 * (n + 1)

The pattern can already be observed in ivopts' GIMPLE:

:
_15 = n_5(D) + 4294967295;
_2 = (sizetype) _15;
_1 = _2 + 1;
_24 = _1 * 8;

Why do we need the - 1 and subsequent + 1 when the %eax is zeroed afterwards
anyway? Granted, this exact situation won't ever be observed on x86 as another
ivopts candidate is chosen but on s390 this situation will amount to three
instructions.

If I see it correctly, the n - 1 comes from estimating the number of loop
iterations, while the +1 is then correctly added by cand_value_at() because the
loop counter is incremented before the exit test. Perhaps this is intended
behavior and there is nothing wrong with it?

Regards
 Robin

[Bug middle-end/69526] ivopts candidate strangeness

2016-01-28 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #2 from rdapp at linux dot vnet.ibm.com ---
Ok, that's sensible but why is the - 1 necessary in the first place?

n_5 - 1 can only underflow if n_5 == 0 which is checked by

testl %edx, %edx

before. This is in a previous basic block, unfortunately, and will not be
regarded by ivopts then (because strictly speaking, it has no influence on the
induction variable)?

[Bug middle-end/69526] ivopts candidate strangeness

2016-01-28 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #5 from rdapp at linux dot vnet.ibm.com ---
I still don't quite get why the "n - 1" is needed. Do we need it to possibly
have an exit condition like

 if (i != n-1) or 
 if (i <= n-1)?

Am I missing something really obvious here?

[Bug middle-end/69526] ivopts candidate strangeness

2016-01-28 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #7 from rdapp at linux dot vnet.ibm.com ---
(In reply to amker from comment #6)

> It comes from loop niter analysis, as in may_eliminate_iv, we have:
> 
> (gdb) call debug_generic_expr(desc->niter)
> n_5(D) + 4294967295

and this is correct? I.e. the number of iterations is n - 1? I'd naively expect 
  desc->niter = n_5(D)

(Again, it might be necessary for some reason escapes me currently)

[Bug middle-end/69526] ivopts candidate strangeness

2016-03-31 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #10 from rdapp at linux dot vnet.ibm.com ---
Created attachment 38144
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38144&action=edit
Tentative patch for VRP and loop-doloop

Meanwhile I found the time to implement a pattern for VRP which seems to do the
job on tree level. Although larger now (and not many cases are covered yet) I
suppose I agree that it fits better in VRP than in ivopts and might even catch
more cases than before.

The fix on RTL level is nonetheless necessary since doloop calculates the
number of iterations independently of the code before and also adds a +1.
I cleaned up the fix a little by introducing niterp1_expr (niter plus 1) but
there's still room for improvement.

No regressions so far, improvement ideas welcome.

[Bug middle-end/69526] ivopts candidate strangeness

2016-05-20 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #13 from rdapp at linux dot vnet.ibm.com ---
Created attachment 38535
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38535&action=edit
VRP/match.pd patch

Found some time again and hacked together a fix for match.pd and VRP. The patch
exports a VRP function no_range_overflow that is called from match.pd via
tree.c. I realize that VRP didn't export any functions so far and if there's a
better/idiomatic way to do it, please tell me. Further improvements or "code
smell" hints very welcome.

Currently this should fix cases like

long foo(int i)
{
  return (long) (i + 1) - 2;
}

Commutativity of a PLUS_EXPR is not yet being exploited and I'm relying on
TYPE_OVERFLOW_UNDEFINED so far. The overflow checking could also be done more
elegantly I suppose.

This diff is extracted from the full patch which also includes the RTL level
fix which I haven't included here since it has not changed. I hope the
splitting didn't introduce new bugs but there are no regressions on s390x for
the full patch.

[Bug middle-end/69526] ivopts candidate strangeness

2016-06-07 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #15 from rdapp at linux dot vnet.ibm.com ---
Thanks for the suggestions. The omission of the inner op was actually more or
less on purpose as I intended to capture the
(convert @inner)
in order to access @inner's value range as a whole rather than re-calculating
what VRP already figured out.

Is there a simple method to access @inner when
capturing
(outer_op (convert (inner_op SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2))
   ^-^
 @inner
or, even-better both, @inner as well as @0 and @1, at the same time? (Apart
from looking through use stmts)

In my case VRP determines @0's range as ~[0,0] and @inner's as [0,INT_MAX-1].
VRP probably canonicalized the anti-range to a normal range and performed other
simplifications in order to arrive at [0,INT_MAX-1]. If I cannot get @inner's
VRP info with the capture above, would there be another way to obtain it?

The TREE_OVERFLOW/const_binop code is copied from the (A +- CST) +- CST -> A +
CST pattern and the result of const_binop is used in the final simplification.
The overflow check could of course be done via wi::add/sub but wouldn't I just
replicate what int_const_binop_1 is doing on top of it when I need the result
anyway?

[Bug middle-end/69526] ivopts candidate strangeness

2016-06-07 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #17 from rdapp at linux dot vnet.ibm.com ---
(In reply to Marc Glisse from comment #16)
> (In reply to rdapp from comment #15)
> > Is there a simple method to access @inner when
> > capturing
> > (outer_op (convert (inner_op SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2))
> >^-^
> >  @inner
> > or, even-better both, @inner as well as @0 and @1, at the same time? (Apart
> > from looking through use stmts)
> 
> (outer_op (convert (inner_op@3 SSA_NAME@0 INTEGER_CST@1)) INTEGER_CST@2))
> 
> Does @3 do what you want here? (I didn't follow the discussion)

looks good, thanks :)

[Bug tree-optimization/80925] [8 Regression] vect peeling failures

2017-05-31 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925

--- Comment #3 from rdapp at linux dot vnet.ibm.com ---
Strange, my tests didn't show new failures on Power7. I'll have a look, perhaps
the build settings were wrong.

[Bug tree-optimization/80925] [8 Regression] vect peeling failures

2017-05-31 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925

--- Comment #5 from rdapp at linux dot vnet.ibm.com ---
I quickly built trunk without bootstrap on power7 BE
("--enable-languages="c,c++,fortran" --disable-multilib --disable-bootstrap")
and still get no new fails.  Do I need other build parameters?  Meanwhile I got
access to a power8 LE machine and will check there.

r248678 definitely introduced fails but r248680 should actually get rid of
them.  I can't confirm for every one of these but [6/6] was made specifically
to address fails on power.  Are all of them still present after r248680?

[Bug tree-optimization/80925] [8 Regression] vect peeling failures

2017-06-01 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925

--- Comment #7 from rdapp at linux dot vnet.ibm.com ---
I could reproduce the fails on a power8 machine now.
Looking at the vect-28.c FAIL now - the loop to be vectorized is:

  for (i = 0; i < N; i++)
{
  ia[i+off] = 5;
}

It still gets vectorized but not peeled anymore because the costs for no
peeling equal the costs for peeling (for unknown alignment).  Costs for an
unaligned store are the same (1) as for a regular store so this is to be
expected.

At first sight, the situation is similar for vect-87.c, vect-88.c and maybe
most of the fails with '"Vectorizing an unaligned access" 0'.

How should we deal with this?  If the cost function is correct as it is and
unaligned stores are not slower at all, I don't think we should be peeling. 
What is expected for real workloads and unaligned loads/stores?

[Bug tree-optimization/80925] [8 Regression] vect peeling failures

2017-06-02 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925

--- Comment #9 from rdapp at linux dot vnet.ibm.com ---
I built --with-cpu=power7 and still see TARGET_EFFICIENT_UNALIGNED_VSX == true
in the backend which causes unaligned stores to have costs of 1.  On my power7
system, TARGET_EFFICIENT_UNALIGNED_VSX is never true.

I stepped through rs6000_override_internal and TARGET_EFFICIENT_UNALIGNED_VSX
is properly unset via the options defined by --with-cpu=power7 (or power6).

Afterwards, however, it is overwritten again by TARGET_P8_VECTOR which, in
turn, is set automatically by the vector test suite depending on the current
CPU:

  [...]
  } elseif [check_p8vector_hw_available] {
lappend DEFAULT_VECTCFLAGS "-mpower8-vector" 

Therefore, whenever the vector tests are run on a power8 CPU,
TARGET_EFFICIENT_UNALIGNED_VSX = 1, no matter the --with-cpu. This would also
explain why I didn't see the fails on my machine, all vect tests are only
called with -maltivec which doesn't override TARGET_EFFICIENT_UNALIGNED_VSX.

So, the way the vect test suite is currently set up, this is kind of expected.

[Bug tree-optimization/80925] [8 Regression] vect peeling failures

2017-06-06 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80925

--- Comment #16 from rdapp at linux dot vnet.ibm.com ---
(In reply to seurer from comment #14)
> spawn -ignore SIGHUP /home/seurer/gcc/build/gcc-test/gcc/xgcc
> -B/home/seurer/gcc/build/gcc-test/gcc/
> /home/seurer/gcc/gcc-test/gcc/testsuite/gcc.dg/vect/vect-33-big-array.c
> -fno-diagnostics-show-caret -fdiagnostics-color=never -maltivec
> -mpower8-vector -ftree-vectorize -fno-vect-cost-model -fno-common -O2
> -fdump-tree-vect-details -S -o vect-33-big-array.s
> PASS: gcc.dg/vect/vect-33-big-array.c (test for excess errors)
> PASS: gcc.dg/vect/vect-33-big-array.c scan-tree-dump-times vect "vectorized
> 1 loops" 1
> FAIL: gcc.dg/vect/vect-33-big-array.c scan-tree-dump-times vect "Vectorizing
> an unaligned access" 0
> FAIL: gcc.dg/vect/vect-33-big-array.c scan-tree-dump-times vect "Alignment
> of access forced using peeling" 1
> Executing on host: /home/seurer/gcc/build/gcc-test/gcc/xgcc
> -B/home/seurer/gcc/build/gcc-test/gcc/
> /home/seurer/gcc/gcc-test/gcc/testsuite/gcc.dg/vect/vect-33-big-array.c 
> -fno-diagnostics-show-caret -fdiagnostics-color=never  -flto
> -ffat-lto-objects -maltivec -mpower8-vector -ftree-vectorize
> -fno-vect-cost-model -fno-common -O2 -fdump-tree-vect-details -S   -o
> vect-33-big-array.s(timeout = 300)

There is still -mpower8-vector in the compile options, making unaligned stores
inexpensive.  This is really run on a power6 CPU?

In order to bulk-disable tests that rely on peeling, would something like a
global check (e.g. target_vect_unaligned, akin to target_vect_int etc.) make
sense? This could be used to flag and disable specific tests depending on the
CPU the the vect suite is run on.

[Bug target/81362] [8.0 regression] FAIL: gcc.dg/vect/no-vfa-vect-57.c execution test

2017-07-10 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81362

--- Comment #1 from rdapp at linux dot vnet.ibm.com ---
Could you provide the vectorizer dump (-fdump-tree-vect-details)? The generated
assembly might also be interesting as well as the exact command line for
building (in the test suite logs).  I compiled --with-cpu-64=power4 yet cannot
reproduce the FAIL.  For me, the compile options set by the test suite are

-maltivec -mvsx -mno-allow-movmisalign -ftree-vectorize -fno-vect-cost-model
-fno-common -O2 -fdump-tree-vect-details --param
vect-max-version-for-alias-checks=0 -lm -m32

[Bug target/81362] [8.0 regression] FAIL: gcc.dg/vect/no-vfa-vect-57.c execution test

2017-07-11 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81362

--- Comment #5 from rdapp at linux dot vnet.ibm.com ---
Ah, npeel is set by vect_peeling_hash_get_lowest_cost although the
corresponding dr is not used afterwards.  It should be save to get rid of the
npeel parameter since we use the returned peeling's npeel anyway.  I think the
same is true for 
body_cost_vec but it's not used afterwards so doesn't cause problems.

The following fixes the regression for me:

index 5103ba1..257be41 100644  
   
   [0/92965]
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -1347,7 +1347,6 @@ vect_peeling_hash_get_lowest_cost (_vect_peel_info
**slot,
 static struct _vect_peel_extended_info
 vect_peeling_hash_choose_best_peeling (hash_table
*peeling_htab,
   loop_vec_info loop_vinfo,
-   unsigned int *npeel,
   stmt_vector_for_cost *body_cost_vec)
 {
struct _vect_peel_extended_info res;
@@ -1371,7 +1370,6 @@ vect_peeling_hash_choose_best_peeling
(hash_table *pee
ling_hta
res.outside_cost = 0;
  }

-   *npeel = res.peel_info.npeel;
*body_cost_vec = res.body_cost_vec;
return res;
 }
@@ -1812,7 +1810,7 @@ vect_enhance_data_refs_alignment (loop_vec_info
loop_vinfo)
  unless aligned.  So we try to choose the best possible peeling from
 the hash table.  */
   peel_for_known_alignment = vect_peeling_hash_choose_best_peeling
-   (&peeling_htab, loop_vinfo, &npeel, &body_cost_vec);
+   (&peeling_htab, loop_vinfo, &body_cost_vec);
 }

   /* Compare costs of peeling for known and unknown alignment. */

[Bug target/81362] [8.0 regression] FAIL: gcc.dg/vect/no-vfa-vect-57.c execution test

2017-07-11 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81362

--- Comment #6 from rdapp at linux dot vnet.ibm.com ---
Created attachment 41715
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41715&action=edit
Tentative patch

Removed the npeel function argument, also removed body_cost_vec and the
corresponding release ()es.  I'm not so sure about the semantics of vecs in
general.  Are we leaking memory when not releasing the body_cost_vecs contained
in the various _vect_peel_extended_infos?  If it was RAII-like body_cost_vec
there would have been no need to release () body_cost_vec before, so I assume
it is not.

[Bug libgomp/78468] [7 regression] libgomp.c/reduction-10.c and many more FAIL

2016-11-24 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78468

rdapp at linux dot vnet.ibm.com changed:

   What|Removed |Added

 CC||rdapp at linux dot vnet.ibm.com

--- Comment #12 from rdapp at linux dot vnet.ibm.com ---
(Writing from a colleagues machine ...)

Thinking about it, I'm quite sure the dynamic allocation must come from here:

>  #pragma omp parallel for reduction(*:y[:p4]) reduction(|:a[:p5])
^^  ^^

I.e. the bug should go awaz if you replace either p4 or p5 or both with the
numeric values.

Dominik ^_^  ^_^

[Bug tree-optimization/77283] [7 Regression] Revision 238005 disables loop unrolling

2017-01-12 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77283

--- Comment #11 from rdapp at linux dot vnet.ibm.com ---
Any progress on this? The patch fixes the problem for s390x (no performance
regressions), but without it we see the regression in SPEC2006's libquantum all
the time, I guess the same is true for PowerPC? Any chance for it to go into
7.0?

For s390x, disabling the path-splitting pass does not introduce performance
regressions either and fixes libquantum but that would only be a very last
resort.

[Bug tree-optimization/77283] [7 Regression] Revision 238005 disables loop unrolling

2017-01-12 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77283

--- Comment #15 from rdapp at linux dot vnet.ibm.com ---
The updated patch fixes libquantum on s390 so PR77366 might indeed be to
simplified to check for that, but it was unrolled before r238005. Addressing
libquantum is more important, of course.

[Bug middle-end/83069] [8 Regression] internal compiler error: in from_gcov_type, at profile-count.h:676

2017-11-28 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83069

rdapp at linux dot vnet.ibm.com changed:

   What|Removed |Added

 CC||rdapp at linux dot vnet.ibm.com

--- Comment #11 from rdapp at linux dot vnet.ibm.com ---
416.gamess fails on S390 as well since r254888.

I didn't immediately get what the

  if (freq_max < 16)
freq_max = 16;

part of the patch is supposed to achieve. When freq_max < 16, tmp will later be
larger than BLOCK_INFO (bb)->frequency which wasn't the case before. In the
worst case we approach max_count by 4 bits every time.
Why do we need a minimum freq_max? To ensure some scaling even for low
frequencies?

[Bug middle-end/69526] ivopts candidate strangeness

2016-06-23 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #19 from rdapp at linux dot vnet.ibm.com ---
(In reply to rguent...@suse.de from comment #18)
>

The match.pd patch is slowly assuming shape... Two things however I'm still
unsure about:

1.
An existing rule in match.pd checks for overflow of the combined constant
in (a +- CST1) +- CST2 and does not perform the simplification when an overflow
occurs. My case looks quite similar, however CST1 and CST2 are/might be
unsigned ints/longs. I assume this is because non-overflow cannot be proved by
the C frontend and it therefore internally uses unsigned representation?

(cast) (a +- CST1) +- CST2

We can prove the non-overflow of (a +- CST1) via VRP but an overflow check of
(CST1 +- CST2) would fail since ((UINT_MAX - 1) + 1) does overflow.

So, in general, should we care about overflow of the combined operation at all
after having established the inner operation does not overflow?
If the overflow is well-defined and we overflow, the result should be valid. If
the overflow is undefined, we could do anything, in particular optimize this
which wouldn't even be too unexpected from a user's perspective. Wouldn't any
overflow in the combined constant be caused anyway, even without combining?

I think there are only two cases two discern, regardless of overflow:

 - abs(CST1 +- CST2) < abs(CST1), then we can simplify to (cast)(a +- (CST1 +-
CST2))

 - else, we can simplify to (cast)(a) +- (CST1 +- CST2)

2.
Is there an idiomatic/correct way to check a VR_RANGE for overflow? Does it
suffice to check if the range includes +-INF or +-INF(OVF)? I suspect other,
naive methods like checking if min < max will fail, since the ranges are
canonicalized.

[Bug middle-end/69526] ivopts candidate strangeness

2016-06-23 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

rdapp at linux dot vnet.ibm.com changed:

   What|Removed |Added

Version|6.0 |unknown

--- Comment #21 from rdapp at linux dot vnet.ibm.com ---
(In reply to amker from comment #20)
> IIUC we can simply handle signed/unsigned type differently.  Given a has
> int/uint type. 
> For unsigned case: (unsigned long long)(a + cst1) + cst2  and a is unsigned
> int.
> It can be simplified into (unsigned long long)a + cst3 iff a + cst1 doesn't
> overflow/wrap.  Since unsigned type can wrap, simplify (unsigned long
> long)cst1 + cst2 into cst3 is always safe.
> For signed case: (long long)(a + cst) + cst2  and a is signed int.
> It can be simplified into (long long)a + cst3 iff (long long)cst1 + cst2
> doesn't overflow.  We don't need to prove (a+cst) doesn't overflow.

ok, the stipulation seems to be assume no signed overflow if the operation was
already present but don't provoke a potentially new overflow by combining
constants that previously would not have been. Makes sense. Your cases can be
improved a little as Richard also pointed out: When abs(cst3) < abs(cst1) the
combined operation can be done in the inner type (and the signs match so the
overflow proof still holds).

[Bug middle-end/69526] ivopts candidate strangeness

2016-06-27 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #24 from rdapp at linux dot vnet.ibm.com ---
Created attachment 38775
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38775&action=edit
Updated patch and test, match.pd/VRP

(In reply to amker from comment #23)
> Hmm, not sure if I mis-understood your point.
> I think there is no *overflow* concept for a VR_RANGE (or a variable),
> overflow is always a semantic concept for an (arithmetic) operation.  Take
> below loop as an example:
> 
> unsigned int i;
> for (i = 4; i != 2; i += 2)
>   {
> //
>   }
> 
> Variable i has VR_RANGE [0, MAX_VALUE - 1], but it does overflow in the loop.

Yes, a variable by itself cannot overflow, only an operation can. VRP for
example has a function extract_range_from_binary_expr_1 that performs some
overflow checking internally. It does so by comparing the minima and maxima of
the respective ranges from both operands. I assumed an overflow there would
somehow be propagated into the ranges, hence the question how to access the
cast expression in match.pd.

Guess I was thinking too complicated, getting the range of both operands and
checking if their common minimum is smaller than the maximum should suffice for
now.

Attached is the current version of the patch, no regressions on s390x and
x86-64. The test could be made a bit more specific I suppose, currently it
counts the number of "gimple_simplified"s.

[Bug middle-end/69526] ivopts candidate strangeness

2016-07-11 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69526

--- Comment #25 from rdapp at linux dot vnet.ibm.com ---
Created attachment 38875
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38875&action=edit
Updated patch and test, math.pd/VRP

Some cleanups and fixes, the patch can now handle more cases in the inner type.
Also added more test cases.

[Bug middle-end/77366] New: Rev. 2ac4967 prevents loop unrolling for s390

2016-08-24 Thread rdapp at linux dot vnet.ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77366

Bug ID: 77366
   Summary: Rev. 2ac4967 prevents loop unrolling for s390
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at linux dot vnet.ibm.com
CC: krebbel at gcc dot gnu.org
  Target Milestone: ---
  Host: s390
Target: s390

Since 2ac4967f49a70e1bd0bb28a142324f527dac3743 the following loop is not being
unrolled anymore (-O3 -funroll-loops):

void
foo(unsigned int size, unsigned int *state)
{
   unsigned int i;

   for(i = 0; i < size; i++)
{
   if(*state & 1)
 {
   *state ^= 1;
 }
}
}

The revision adds an additional check that allows the split-paths pass to
duplicate a basic block (returned NULL before the patch) which in turn causes
check_simple_exit() in loop-iv.c to not find a proper niter_desc. This seems
due to

if (!dominated_by_p (CDI_DOMINATORS, loop->latch, exit_bb))

failing, i.e. check_simple_exit() doesn't see that the exit condition is
checked every iteration.