https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81303
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- So if in addition to this patch we do Index: gcc/tree-vect-loop.c =================================================================== --- gcc/tree-vect-loop.c (revision 250386) +++ gcc/tree-vect-loop.c (working copy) @@ -7376,7 +7377,7 @@ vect_transform_loop (loop_vec_info loop_ /* Version the loop first, if required, so the profitability check comes first. */ - if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) + if (check_profitability || LOOP_REQUIRES_VERSIONING (loop_vinfo)) { vect_loop_versioning (loop_vinfo, th, check_profitability); check_profitability = false; thus always do the profitability check by versioning which means not sharing the epilogue loop with the scalar execution (plus reliably executing the cost model check first) we get down to 212s from 250s. This might be solely because we do not completely peel the versioned copy as we are not able to analyze its number of iterations (despite the dominating > 7 check). _33 = (unsigned int) _1; if (_33 > 7) goto <bb 18>; [80.01%] [count: INV] else goto <bb 34>; [19.99%] [count: INV] <bb 34> [3.00%] [count: INV]: <bb 35> [16.99%] [count: INV] loop 6 header: # m_23 = PHI <1(34), m_449(36)> ... m_449 = m_23 + 1; if (_1 < m_449) goto <bb 38>; [17.65%] [count: INV] else goto <bb 36>; [82.35%] [count: INV] <bb 36> [13.99%] [count: INV] loop 6 latch: goto <bb 35>; [100.00%] [count: INV] we're probably confused by the casting here and infering a range from just the above for _1 would result in [INT_MIN, 7] only (good enough I guess). We peel the vector epilogue because: Loop 8 iterates at most 5 times. Loop 8 likely iterates at most 5 times. Estimating sizes for loop 8 BB: 29, after_exit: 0 size: 0 _372 = (integer(kind=8)) m_375; size: 1 _371 = _372 * stride.88_115; ... size: 1 _332 = _349 + _333; size: 1 m_331 = m_375 + 1; size: 2 if (_1 < m_331) Exit condition will be eliminated in last copy. BB: 30, after_exit: 1 size: 41-0, last_iteration: 41-2 Loop size: 41 Estimated size after unrolling: 162 that is we determine that no stmts will be optimized away due to propagating constants but then apply our usual 2/3 optimistic heuristic leading to that estimate (max-completely-peeled-insns is 200). For small trip count loops the advantage of peeling (irrespective of size) is better branch predictor hitrate. There's quite a mistake in cost modeling peeling for alignment but still with fixing that we end up with a nopeel inside-cost of 14 and a best peel inside-cost of 13 (we manage to align one load). Now it doesn't take into account outside cost at all which is 59 vs 115, but it's hard to combine both in a sensible way ...