https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
So if in addition to this patch we do

Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c        (revision 250386)
+++ gcc/tree-vect-loop.c        (working copy)
@@ -7376,7 +7377,7 @@ vect_transform_loop (loop_vec_info loop_
   /* Version the loop first, if required, so the profitability check
      comes first.  */

-  if (LOOP_REQUIRES_VERSIONING (loop_vinfo))
+  if (check_profitability || LOOP_REQUIRES_VERSIONING (loop_vinfo))
     {
       vect_loop_versioning (loop_vinfo, th, check_profitability);
       check_profitability = false;

thus always do the profitability check by versioning which means not sharing
the epilogue loop with the scalar execution (plus reliably executing the
cost model check first) we get down to 212s from 250s.  This might be solely
because we do not completely peel the versioned copy as we are not able to
analyze its number of iterations (despite the dominating > 7 check).

  _33 = (unsigned int) _1;
  if (_33 > 7)
    goto <bb 18>; [80.01%] [count: INV]
  else
    goto <bb 34>; [19.99%] [count: INV]

  <bb 34> [3.00%] [count: INV]:

  <bb 35> [16.99%] [count: INV] loop 6 header:
  # m_23 = PHI <1(34), m_449(36)>
...
  m_449 = m_23 + 1;
  if (_1 < m_449)
    goto <bb 38>; [17.65%] [count: INV]
  else
    goto <bb 36>; [82.35%] [count: INV]

  <bb 36> [13.99%] [count: INV] loop 6 latch:
  goto <bb 35>; [100.00%] [count: INV]

we're probably confused by the casting here and infering a range from just
the above for _1 would result in [INT_MIN, 7] only (good enough I guess).

We peel the vector epilogue because:

Loop 8 iterates at most 5 times.
Loop 8 likely iterates at most 5 times.
Estimating sizes for loop 8
 BB: 29, after_exit: 0
  size:   0 _372 = (integer(kind=8)) m_375;
  size:   1 _371 = _372 * stride.88_115;
...
  size:   1 _332 = _349 + _333;
  size:   1 m_331 = m_375 + 1;
  size:   2 if (_1 < m_331)
   Exit condition will be eliminated in last copy.
 BB: 30, after_exit: 1
size: 41-0, last_iteration: 41-2
  Loop size: 41
  Estimated size after unrolling: 162

that is we determine that no stmts will be optimized away due to propagating
constants but then apply our usual 2/3 optimistic heuristic leading to
that estimate (max-completely-peeled-insns is 200).

For small trip count loops the advantage of peeling (irrespective of size)
is better branch predictor hitrate.

There's quite a mistake in cost modeling peeling for alignment but still
with fixing that we end up with a nopeel inside-cost of 14 and a best peel
inside-cost of 13 (we manage to align one load).  Now it doesn't take into
account outside cost at all which is 59 vs 115, but it's hard to combine
both in a sensible way ...

Reply via email to