https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119351
--- Comment #20 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Tamar Christina <tnfch...@gcc.gnu.org>: https://gcc.gnu.org/g:46ccce1de686c1b437eff43431dc20d20d4687c0 commit r15-9518-g46ccce1de686c1b437eff43431dc20d20d4687c0 Author: Tamar Christina <tamar.christ...@arm.com> Date: Wed Apr 16 13:09:05 2025 +0100 middle-end: Fix incorrect codegen with PFA and VLS [PR119351] The following example: #define N 512 #define START 2 #define END 505 int x[N] __attribute__((aligned(32))); int __attribute__((noipa)) foo (void) { for (signed int i = START; i < END; ++i) { if (x[i] == 0) return i; } return -1; } generates incorrect code with fixed length SVE because for early break we need to know which value to start the scalar loop with if we take an early exit. Historically this means that we take the first element of every induction. this is because there's an assumption in place, that even with masked loops the masks come from a whilel* instruction. As such we reduce using a BIT_FIELD_REF <, 0>. When PFA was added this assumption was correct for non-masked loop, however we assumed that PFA for VLA wouldn't work for now, and disabled it using the alignment requirement checks. We also expected VLS to PFA using scalar loops. However as this PR shows, for VLS the vectorizer can, and does in some circumstances choose to peel using masks by masking the first iteration of the loop with an additional alignment mask. When this is done, the first elements of the predicate can be inactive. In this example element 1 is inactive based on the calculated misalignment. hence the -1 value in the first vector IV element. When we reduce using BIT_FIELD_REF we get the wrong value. This patch updates it by creating a new scalar PHI that keeps track of whether we are the first iteration of the loop (with the additional masking) or whether we have taken a loop iteration already. The generated sequence: pre-header: bb1: i_1 = <number of leading inactive elements> header: bb2: i_2 = PHI <i_1(bb1), 0(latch)> ⦠early-exit: bb3: i_3 = iv_step * i_2 + PHI<vector-iv> Which eliminates the need to do an expensive mask based reduction. This fixes gromacs with one OpenMP thread. But with > 1 there is still an issue. gcc/ChangeLog: PR tree-optimization/119351 * tree-vectorizer.h (LOOP_VINFO_MASK_NITERS_PFA_OFFSET, LOOP_VINFO_NON_LINEAR_IV): New. (class _loop_vec_info): Add mask_skip_niters_pfa_offset and nonlinear_iv. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize them. (vect_analyze_scalar_cycles_1): Record non-linear inductions. (vectorizable_induction): If early break and PFA using masking create a new phi which tracks where the scalar code needs to start... (vectorizable_live_operation): ...and generate the adjustments here. (vect_use_loop_mask_for_alignment_p): Reject non-linear inductions and early break needing peeling. gcc/testsuite/ChangeLog: PR tree-optimization/119351 * gcc.target/aarch64/sve/peel_ind_10.c: New test. * gcc.target/aarch64/sve/peel_ind_10_run.c: New test. * gcc.target/aarch64/sve/peel_ind_5.c: New test. * gcc.target/aarch64/sve/peel_ind_5_run.c: New test. * gcc.target/aarch64/sve/peel_ind_6.c: New test. * gcc.target/aarch64/sve/peel_ind_6_run.c: New test. * gcc.target/aarch64/sve/peel_ind_7.c: New test. * gcc.target/aarch64/sve/peel_ind_7_run.c: New test. * gcc.target/aarch64/sve/peel_ind_8.c: New test. * gcc.target/aarch64/sve/peel_ind_8_run.c: New test. * gcc.target/aarch64/sve/peel_ind_9.c: New test. * gcc.target/aarch64/sve/peel_ind_9_run.c: New test.