https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119351
--- Comment #8 from Tamar Christina <tnfchris at gcc dot gnu.org> --- Looking at it some more, I think the loop is valid to vectorize. But we don't seem to vectorize the reduction jumping back to the outerloop: ;; basic block 384, loop depth 3, count 8598980 (estimated locally, freq 26.9637), maybe hot ;; prev block 383, next block 458, flags: (NEW, REACHABLE, VISITED) ;; pred: 387 [94.5% (guessed)] count:8598980 (estimated locally, freq 26.9637) (TRUE_VALUE,EXECUTABLE) # RANGE [irange] int [1, +INF] _1643 = ci_y_1924 + 1; _3519 = vect_vec_iv_.4957_3520 + { 4, 4, 4, 4 }; # PT = nonlocal escaped null vectp.4959_3503 = vectp.4959_3504 + 16; ivtmp_3495 = ivtmp_3496 + 4; _3493 = (unsigned int) ivtmp_3495; next_mask_3470 = .WHILE_ULT (_3493, _3494, { 0, 0, 0, 0 }); if (next_mask_3470 == { 0, 0, 0, 0 }) goto <bb 844>; [5.50%] else goto <bb 458>; [94.50%] ;; succ: 844 [5.5% (guessed)] count:472944 (estimated locally, freq 1.4830) (TRUE_VALUE,EXECUTABLE) ;; 458 [94.5% (guessed)] count:8126036 (estimated locally, freq 25.4807) (FALSE_VALUE,EXECUTABLE) ;; basic block 458, loop depth 3, count 8126036 (estimated locally, freq 25.4807), maybe hot ;; prev block 384, next block 844, flags: (NEW, REACHABLE, VISITED) ;; pred: 384 [94.5% (guessed)] count:8126036 (estimated locally, freq 25.4807) (FALSE_VALUE,EXECUTABLE) goto <bb 387>; [100.00%] ;; succ: 387 [always] count:8126036 (estimated locally, freq 25.4807) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; basic block 844, loop depth 2, count 472944 (estimated locally, freq 1.4830), maybe hot ;; prev block 458, next block 840, flags: (NEW, VISITED) ;; pred: 384 [5.5% (guessed)] count:472944 (estimated locally, freq 1.4830) (TRUE_VALUE,EXECUTABLE) # RANGE [irange] int [1, +INF] # _3469 = PHI <_1643(384)> _3538 = niters.4954_3792; _3536 = (intD.10) _3538; tmp.4955_3537 = ci_y_1923 + _3536; if (_3538 == niters.4954_3792) goto <bb 385>; [25.00%] else goto <bb 840>; [75.00%] where _1643 = ci_y_1924 + 1; has stayed scalar and so LCSSA code inserts a PHI here: # _3469 = PHI <_1643(384)> which is unused, as BB 384 is considered the main exit. So it assumes that if you exit from 384 -> 844 that you've done all iterations and so it just uses niters + ci_y_1923. i.e. just adds the number of iterations to ci_y. So I don't think that's wrong.. I could use some help here richi in whether this loop *is* valid to vectorize or not. I've not yet been able to create a small reproducer but the loop looks like: (gdb) p debug_loop (loop, 3) loop_48 (header = 387, latch = 458, finite_p upper_bound 2147483647 likely_upper_bound 2147483647 iterations by profile: 8.347976 (unreliable, maybe flat) entry count:1081571 (estimated locally, freq 3.3915)) { bb_384 (preds = {bb_387 }, succs = {bb_385 bb_458 }) { <bb 384> [local count: 9554422]: _1643 = ci_y_1924 + 1; if (_1643 == _1649) goto <bb 385>; [5.50%] else goto <bb 458>; [94.50%] } bb_458 (preds = {bb_384 }, succs = {bb_387 }) { <bb 458> [local count: 9028929]: goto <bb 387>; [100.00%] } bb_387 (preds = {bb_458 bb_386 }, succs = {bb_384 bb_388 }) { <bb 387> [local count: 10110500]: # ci_y_1924 = PHI <_1643(458), ci_y_1923(386)> _1651 = _1650 + ci_y_1924; _1652 = _1651 + 1; _1653 = (long unsigned int) _1652; _1655 = _1653 * 4; _1656 = _1654 + _1655; # VUSE <.MEM_1725> _1657 = *_1656; if (_1657 <= ci_1918) goto <bb 384>; [94.50%] else goto <bb 388>; [5.50%] } } (gdb) p debug_bb_n_slim (388) ;; basic block 388, loop depth 1 ;; pred: 387 # ci_x_1703 = PHI <ci_x_1920(387)> # _2327 = PHI <_1651(387)> # ci_y_2179 = PHI <ci_y_1924(387)> if (_333 != 0) goto <bb 62>; [50.00%] else goto <bb 61>; [50.00%] ;; succ: 61 ;; 62 (gdb) p debug_bb_n_slim (385) ;; basic block 385, loop depth 2 ;; pred: 384 _1646 = ci_x_1920 + 1; ;; succ: 386 (gdb) p debug_bb_n_slim (386) ;; basic block 386, loop depth 2 ;; pred: 382 ;; 385 # ci_x_1920 = PHI <ci_x_1919(382), _1646(385)> # ci_y_1923 = PHI <ci_y_1922(382), 0(385)> _1650 = _1649 * ci_x_1920; ;; succ: 387 (gdb) p debug (pre_header) <edge (386 -> 387)> So the reduction values look sane, and the vector code looks sane, I'll instead focus first on values that change during the loop