https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120
--- Comment #7 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Tamar Christina <[email protected]>: https://gcc.gnu.org/g:65a3849eb46df2fbac6b41ff78dae13c85387f9e commit r16-5780-g65a3849eb46df2fbac6b41ff78dae13c85387f9e Author: Tamar Christina <[email protected]> Date: Sun Nov 30 07:29:50 2025 +0000 vect: support vectorization of early break forced live IVs as scalar Consider this simple loop long long arr[1024]; long long *f() { int i; for (i = 0; i < 1024; i++) if (arr[i] == 42) break; return arr + i; } where today we generate this at -O3: .L2: add v29.4s, v29.4s, v25.4s add v28.4s, v28.4s, v26.4s cmp x2, x1 beq .L9 .L6: ldp q30, q31, [x1], 32 cmeq v30.2d, v30.2d, v27.2d cmeq v31.2d, v31.2d, v27.2d addhn v31.2s, v31.2d, v30.2d fmov x3, d31 cbz x3, .L2 but which is highly inefficient. This loops has 3 IVs (PR119577), one normal scalar one, two vector ones, one counting up and one counting down (PR115120) and has a forced unrolling due to an increase in VF because of the mismatch in modes between the IVs and the loop body (PR119860). This patch fixed all three of these issues and we now generate: .L2: add w2, w2, 2 cmp w2, 1024 beq .L13 .L5: ldr q31, [x1] add x1, x1, 16 cmeq v31.2d, v31.2d, v30.2d umaxp v31.4s, v31.4s, v31.4s fmov x0, d31 cbz x0, .L2 or with sve .L3: add x1, x1, x3 whilelo p7.d, w1, w2 b.none .L11 .L4: ld1d z30.d, p7/z, [x0, x1, lsl 3] cmpeq p7.d, p7/z, z30.d, z31.d b.none .L3 which shows that the new scalar IV is efficiently merged with the loop control one based on IVopts. To accomplish this the patch reworks how we handle "forced lived inductions" with regard to vectorization. Prior to this change when we vectorize a loop with early break any induction variables would be forced live. Forcing live means that even though the values aren't used inside the loop we must preserve the values such that when we start the scalar loop we can pass the correct initial values. However this had several side-effects: 1. We must be able to vectorize the induction. 2. The induction variable participates in VF determination. This would often times lead to a higher VF than would have normally been needed. As such the vector loops become less profitable. 3. IVcannon on constant loop iterations inserts a downward counting IV in addition to the upwards one in order to support things like doloops. Normally this duplicate IV is removed by IV opts, but IV doesn't understand vector inductions. As such we end up with 3 IVs. This patch fixes all three of these by choosing instead to create a new scalar IV that's adjusted within the loop and to update all the IV statements outside the loop by using this new IV. We re-use vect_update_ivs_after_vectorizer for all exits now and put in a dummy value representing the IV that is to be generated later. To do this we delay when we call vect_update_ivs_after_vectorizer until after the skip_epilogue edge is created and vect_update_ivs_after_vectorizer now updates all out of loop usages of IVs and not just that in the merge edge to the scalar loop. This not only generates better code, but negates the need to fixup the "forced live" scalar IVs later on. This new scalar IV is then materialized in vect_update_ivs_after_vectorizer_for_early_breaks. When PFA using masks by skipping iterations we now roll up the pfa IV into the new scalar IV by adjusting the first iteration back from start - niters_peel and then take the MAX <scal_iv, 0> to correctly handle the first iteration. Because we are now re-using vect_update_ivs_after_vectorizer we have an issue with UB clamping on non-linear inductions. At the moment when doing early exit updating I just ignore the possibility of UB since if the main exit is OK, the early exit is one iteration behind the main one and so should be ok. Things however get complicated with PEELED loops. gcc/ChangeLog: PR tree-optimization/115120 PR tree-optimization/119577 PR tree-optimization/119860 * tree-vect-loop-manip.cc (vect_can_advance_ivs_p): Check for nonlinear mult induction and early break. (vect_update_ivs_after_vectorizer): Support early break exits. (vect_do_peeling): Support scalar IVs. * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Support early break. (vect_update_nonlinear_iv): use `unsigned_type_for` such that function works for both vector and scalar types. (vectorizable_induction, vectorizable_live_operation): Remove vector early break IV code. (vect_update_ivs_after_vectorizer_for_early_breaks): New. (vect_transform_loop): Support new scalar IV for early break. * tree-vect-slp.cc (vect_analyze_slp): Remove SLP build for early break IVs. * tree-vect-stmts.cc (vect_stmt_relevant_p): No longer mark early break IVs as completely unused rather than used_only_live. They no longer contribute to the vector loop and so should not be analyzed. (can_vectorize_live_stmts): Remove vector early vreak IV code. * tree-vectorizer.h (LOOP_VINFO_EARLY_BRK_NITERS_VAR): New. (class loop_vec_info): Add early_break_niters_var. gcc/testsuite/ChangeLog: PR tree-optimization/115120 PR tree-optimization/119577 PR tree-optimization/119860 * gcc.dg/vect/vect-early-break_39.c: Update. * gcc.dg/vect/vect-early-break_139.c: New testcase. * gcc.target/aarch64/sve/peel_ind_10.c: Update. * gcc.target/aarch64/sve/peel_ind_11.c: Update. * gcc.target/aarch64/sve/peel_ind_12.c: Update. * gcc.target/aarch64/sve/peel_ind_5.c: Update. * gcc.target/aarch64/sve/peel_ind_6.c: Update. * gcc.target/aarch64/sve/peel_ind_7.c: Update. * gcc.target/aarch64/sve/peel_ind_9.c: Update. * gcc.target/aarch64/sve/pr119351.c
