https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #7 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <[email protected]>:

https://gcc.gnu.org/g:65a3849eb46df2fbac6b41ff78dae13c85387f9e

commit r16-5780-g65a3849eb46df2fbac6b41ff78dae13c85387f9e
Author: Tamar Christina <[email protected]>
Date:   Sun Nov 30 07:29:50 2025 +0000

    vect: support vectorization of early break forced live IVs as scalar

    Consider this simple loop

    long long arr[1024];
    long long *f()
    {
        int i;
        for (i = 0; i < 1024; i++)
          if (arr[i] == 42)
            break;
        return arr + i;
    }

    where today we generate this at -O3:

    .L2:
            add     v29.4s, v29.4s, v25.4s
            add     v28.4s, v28.4s, v26.4s
            cmp     x2, x1
            beq     .L9
    .L6:
            ldp     q30, q31, [x1], 32
            cmeq    v30.2d, v30.2d, v27.2d
            cmeq    v31.2d, v31.2d, v27.2d
            addhn   v31.2s, v31.2d, v30.2d
            fmov    x3, d31
            cbz     x3, .L2

    but which is highly inefficient.  This loops has 3 IVs (PR119577), one
normal
    scalar one, two vector ones, one counting up and one counting down
(PR115120)
    and has a forced unrolling due to an increase in VF because of the mismatch
in
    modes between the IVs and the loop body (PR119860).

    This patch fixed all three of these issues and we now generate:

    .L2:
            add     w2, w2, 2
            cmp     w2, 1024
            beq     .L13
    .L5:
            ldr     q31, [x1]
            add     x1, x1, 16
            cmeq    v31.2d, v31.2d, v30.2d
            umaxp   v31.4s, v31.4s, v31.4s
            fmov    x0, d31
            cbz     x0, .L2

    or with sve

    .L3:
            add     x1, x1, x3
            whilelo p7.d, w1, w2
            b.none  .L11
    .L4:
            ld1d    z30.d, p7/z, [x0, x1, lsl 3]
            cmpeq   p7.d, p7/z, z30.d, z31.d
            b.none  .L3

    which shows that the new scalar IV is efficiently merged with the loop
    control one based on IVopts.

    To accomplish this the patch reworks how we handle "forced lived
inductions"
    with regard to vectorization.

    Prior to this change when we vectorize a loop with early break any
induction
    variables would be forced live.  Forcing live means that even though the
values
    aren't used inside the loop we must preserve the values such that when we
start
    the scalar loop we can pass the correct initial values.

    However this had several side-effects:

    1. We must be able to vectorize the induction.
    2. The induction variable participates in VF determination.  This would
often
       times lead to a higher VF than would have normally been needed.  As such
the
       vector loops become less profitable.
    3. IVcannon on constant loop iterations inserts a downward counting IV in
       addition to the upwards one in order to support things like doloops.
       Normally this duplicate IV is removed by IV opts, but IV doesn't
understand
       vector inductions.  As such we end up with 3 IVs.

    This patch fixes all three of these by choosing instead to create a new
scalar
    IV that's adjusted within the loop and to update all the IV statements
outside
    the loop by using this new IV.

    We re-use vect_update_ivs_after_vectorizer for all exits now and put in a
dummy
    value representing the IV that is to be generated later.

    To do this we delay when we call vect_update_ivs_after_vectorizer until
after
    the skip_epilogue edge is created and vect_update_ivs_after_vectorizer now
    updates all out of loop usages of IVs and not just that in the merge edge
to
    the scalar loop.  This not only generates better code, but negates the need
to
    fixup the "forced live" scalar IVs later on.

    This new scalar IV is then materialized in
    vect_update_ivs_after_vectorizer_for_early_breaks.  When PFA using masks by
    skipping iterations we now roll up the pfa IV into the new scalar IV by
    adjusting the first iteration back from start - niters_peel and then take
the
    MAX <scal_iv, 0> to correctly handle the first iteration.

    Because we are now re-using vect_update_ivs_after_vectorizer we have an
issue
    with UB clamping on non-linear inductions.

    At the moment when doing early exit updating I just ignore the possibility
of UB
    since if the main exit is OK, the early exit is one iteration behind the
main
    one and so should be ok.

    Things however get complicated with PEELED loops.

    gcc/ChangeLog:

            PR tree-optimization/115120
            PR tree-optimization/119577
            PR tree-optimization/119860
            * tree-vect-loop-manip.cc (vect_can_advance_ivs_p): Check for
nonlinear
            mult induction and early break.
            (vect_update_ivs_after_vectorizer): Support early break exits.
            (vect_do_peeling): Support scalar IVs.
            * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Support early
break.
            (vect_update_nonlinear_iv): use `unsigned_type_for` such that
function
            works for both vector and scalar types.
            (vectorizable_induction, vectorizable_live_operation): Remove
vector
            early break IV code.
            (vect_update_ivs_after_vectorizer_for_early_breaks): New.
            (vect_transform_loop): Support new scalar IV for early break.
            * tree-vect-slp.cc (vect_analyze_slp): Remove SLP build for early
break
            IVs.
            * tree-vect-stmts.cc (vect_stmt_relevant_p): No longer mark early
break
            IVs as completely unused rather than used_only_live.  They no
longer
            contribute to the vector loop and so should not be analyzed.
            (can_vectorize_live_stmts): Remove vector early vreak IV code.
            * tree-vectorizer.h (LOOP_VINFO_EARLY_BRK_NITERS_VAR): New.
            (class loop_vec_info): Add early_break_niters_var.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/115120
            PR tree-optimization/119577
            PR tree-optimization/119860
            * gcc.dg/vect/vect-early-break_39.c: Update.
            * gcc.dg/vect/vect-early-break_139.c: New testcase.
            * gcc.target/aarch64/sve/peel_ind_10.c: Update.
            * gcc.target/aarch64/sve/peel_ind_11.c: Update.
            * gcc.target/aarch64/sve/peel_ind_12.c: Update.
            * gcc.target/aarch64/sve/peel_ind_5.c: Update.
            * gcc.target/aarch64/sve/peel_ind_6.c: Update.
            * gcc.target/aarch64/sve/peel_ind_7.c: Update.
            * gcc.target/aarch64/sve/peel_ind_9.c: Update.
            * gcc.target/aarch64/sve/pr119351.c
  • [Bug tree-optimization/115120] ... cvs-commit at gcc dot gnu.org via Gcc-bugs

Reply via email to