Hi Juzhe,
>
> Hi, Tamar.
>
> This is an amazing auto-vectorization flow.
>
> I am thinking about whether RVV can also get benefits from this optimization.
> IMHO, RVV should be also using this flow.
>
> So, to allow RVV (target uses len as loop_control and mask as flow control),
> I
> am not sure whether we can do this (Feel free to correct me if I am wrong):
>
> + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> + vect_record_loop_mask (loop_vinfo, masks, ncopies, truth_type,
> NULL);
>
> Maybe it can be ?
>
> if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) {
> if (mask_loop_p)
> vect_record_loop_mask
> else
> vect_record_loop_len
> }
>
Yeah, that should be the only change required, I started this patch before the
loop_len change
made it in and just rebased recently 😊
>
> + tree cond = gimple_assign_lhs (new_stmt);
> + if (masked_loop_p)
> + {
> + tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> truth_type, 0);
> + cond = prepare_vec_mask (loop_vinfo, TREE_TYPE (mask), mask, cond,
> + &cond_gsi);
> + }
> +
> + tree t = fold_build2 (NE_EXPR, boolean_type_node, cond,
> + build_zero_cst (truth_type));
>
> From my understanding, you are using final_mask = loop_mask (WHILE_ULT)
> && control_mask (comparison).
> Then Test final_mask using NE_EXPR. Am I right?
Yeah that's right, It's creating the mask for partial iterations. The only
other constraint is
being able to reduce a boolean mask using inclusive OR, but that's optional
and is only
needed if one side of the comparison produces more than 1 copy (so it's only
checked then).
>
> For RVV, I thinking whether we can have a good way to do this testing.
> Not sure whether we can have something like LEN_TEST_MASK_NE (loop_len,
> control_mask...)
>
Hmm Is just the vect_record_loop_len change not enough? (I haven't followed the
masking
implementation in RVV in detail) but I assume that it's following the general
principle than
& an operation with a mask creates a masked operation?
That is to say, I thought LOOP_LEN was only for the loop control? Which doesn't
change here.
> I am not saying that we should support "early break" auto-vectorization for
> RVV (loop_len && control_mask).
> I am just write some comments trying to figure out how I can adapt your
> working for RVV in the future.
>
Yes happy to help, the more uses it gets the more bugs I can fix 😊
Cheers,
Tamar
> Thanks.
>
>
> [email protected]
>
> From: Li, Pan2
> Date: 2023-06-28 22:21
> To: [email protected]
> Subject: FW: [PATCH v5 0/19] Support early break/return auto-vectorization
> FYI.
>
> -----Original Message-----
> From: Gcc-patches <[email protected]>
> On Behalf Of Tamar Christina via Gcc-patches
> Sent: Wednesday, June 28, 2023 9:41 PM
> To: [email protected]
> Cc: [email protected]; [email protected]; [email protected]
> Subject: [PATCH v5 0/19] Support early break/return auto-vectorization
>
> Hi All,
>
> This patch adds initial support for early break vectorization in GCC.
> The support is added for any target that implements a vector cbranch optab,
> this includes both fully masked and non-masked targets.
>
> Depending on the operation, the vectorizer may also require support for
> boolean mask reductions using Inclusive OR. This is however only checked
> then the comparison would produce multiple statements.
>
> Concretely the kind of loops supported are of the forms:
>
> for (int i = 0; i < N; i++)
> {
> <statements1>
> if (<condition>)
> {
> ...
> <action>;
> }
> <statements2>
> }
>
> where <action> can be:
> - break
> - return
> - goto
>
> Any number of statements can be used before the <action> occurs.
>
> Since this is an initial version for GCC 14 it has the following limitations
> and
> features:
>
> - Only fixed sized iterations and buffers are supported. That is to say any
> vectors loaded or stored must be to statically allocated arrays with known
> sizes. N must also be known. This limitation is because our primary target
> for this optimization is SVE. For VLA SVE we can't easily do cross page
> iteraion checks. The result is likely to also not be beneficial. For that
> reason we punt support for variable buffers till we have First-Faulting
> support in GCC.
> - any stores in <statements1> should not be to the same objects as in
> <condition>. Loads are fine as long as they don't have the possibility to
> alias. More concretely, we block RAW dependencies when the intermediate
> value
> can't be separated fromt the store, or the store itself can't be moved.
> - The number of loop iterations must be known, this is just a temporarily
> limitation that I intend to address in GCC 14 itself as follow on patches.
> - Prologue peeling, alignment peelinig and loop versioning are supported.
> - Fully masked loops, unmasked loops and partially masked loops are
> supported
> - Any number of loop early exits are supported.
> - The early exit must be before the natural loop exit/latch. The vectorizer
> is
> designed in way to propage phi-nodes downwards. As such supporting this
> inverted control flow is hard.
> - No support for epilogue vectorization. The only epilogue supported is the
> scalar final one. Epilogue vectorization would also not be profitable.
> - Early breaks are only supported for inner loop vectorization.
>
> I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break
>
> With the help of IPA and LTO this still gets hit quite often. During
> bootstrap it
> hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now
> since these are tests for support for early exit vectorization.
>
> This implementation does not support completely handling the early break
> inside the vector loop itself but instead supports adding checks such that if
> we
> know that we have to exit in the current iteration then we branch to scalar
> code to actually do the final VF iterations which handles all the code in
> <action>.
>
> niters analysis and the majority of the vectorizer with hardcoded single_exit
> have been updated with the use of a new function vec_loop_iv value which
> returns the exit the vectorizer wants to use as the main IV exit.
>
> for niters the this exit is what determines the overall iterations as that is
> the
> O(iters) for the loop.
>
> For the scalar loop we know that whatever exit you take you have to perform
> at most VF iterations. For vector code we only case about the state of fully
> performed iteration and reset the scalar code to the (partially) remaining
> loop.
>
> This new version of the patch does the majority of the work in a new rewritten
> loop peeling. This new function maintains LCSSA all the way through and no
> longer requires the touch up functions the vectorized used to incrementally
> adjust them later on. This means that aside from IV updates and guard edge
> updates the early exit code is identical to the single exit cases.
>
> When the loop is peeled during the copying I have to go through great lengths
> to keep the dominators up to date. All exits from the first loop are rewired
> to
> the loop header of the second loop. But this can change the immediate
> dominator.
>
> The dominators can change again when we wire in the loop guard, as such
> peeling now returns a list of dominators that need to be updated if a new
> guard edge is added.
>
> For the loop peeling we rewrite the loop form:
>
>
> Header
> ---
> |x|
> 2
> |
> v
> -------3<------
> early exit | | |
> v v | latch
> 7 4----->6
> | |
> | v
> | 8
> | |
> | v
> ------>5
>
> into
>
> Header
> ---
> |x|
> 2
> |
> v
> -------3<------
> early exit | | |
> v v | latch
> 7 4----->6
> | |
> | v
> | 8
> | |
> | v
> | New Header
> | ---
> ----->|x|
> 9
> |
> v
> ------10<-----
> early exit | | |
> v v | latch
> 14 11---->13
> | |
> | v
> | 12
> | |
> | v
> ------> 5
>
> That is to say, the first vector loop executes so long as the early exit isn't
> needed. Once the exit is taken, the scalar code will perform at most VF extra
> iterations. The exact number depending on peeling and iteration start and
> which
> exit was taken (natural or early). For this scalar loop, all early exits are
> treated the same.
>
> When we vectorize we move any statement not related to the early break
> itself and that would be incorrect to execute before the break (i.e. has side
> effects) to after the break. If this is not possible we decline to vectorize.
>
> This means that we check at the start of iterations whether we are going to
> exit or not. During the analyis phase we check whether we are allowed to do
> this moving of statements. Also note that we only move the scalar
> statements, but only do so after peeling but just before we start transforming
> statements.
>
> Codegen:
>
> for e.g.
>
> #define N 803
> unsigned vect_a[N];
> unsigned vect_b[N];
>
> unsigned test4(unsigned x)
> {
> unsigned ret = 0;
> for (int i = 0; i < N; i++)
> {
> vect_b[i] = x + i;
> if (vect_a[i] > x)
> break;
> vect_a[i] = x;
>
> }
> return ret;
> }
>
> We generate for Adv. SIMD:
>
> test4:
> adrp x2, .LC0
> adrp x3, .LANCHOR0
> dup v2.4s, w0
> add x3, x3, :lo12:.LANCHOR0
> movi v4.4s, 0x4
> add x4, x3, 3216
> ldr q1, [x2, #:lo12:.LC0]
> mov x1, 0
> mov w2, 0
> .p2align 3,,7
> .L3:
> ldr q0, [x3, x1]
> add v3.4s, v1.4s, v2.4s
> add v1.4s, v1.4s, v4.4s
> cmhi v0.4s, v0.4s, v2.4s
> umaxp v0.4s, v0.4s, v0.4s
> fmov x5, d0
> cbnz x5, .L6
> add w2, w2, 1
> str q3, [x1, x4]
> str q2, [x3, x1]
> add x1, x1, 16
> cmp w2, 200
> bne .L3
> mov w7, 3
> .L2:
> lsl w2, w2, 2
> add x5, x3, 3216
> add w6, w2, w0
> sxtw x4, w2
> ldr w1, [x3, x4, lsl 2]
> str w6, [x5, x4, lsl 2]
> cmp w0, w1
> bcc .L4
> add w1, w2, 1
> str w0, [x3, x4, lsl 2]
> add w6, w1, w0
> sxtw x1, w1
> ldr w4, [x3, x1, lsl 2]
> str w6, [x5, x1, lsl 2]
> cmp w0, w4
> bcc .L4
> add w4, w2, 2
> str w0, [x3, x1, lsl 2]
> sxtw x1, w4
> add w6, w1, w0
> ldr w4, [x3, x1, lsl 2]
> str w6, [x5, x1, lsl 2]
> cmp w0, w4
> bcc .L4
> str w0, [x3, x1, lsl 2]
> add w2, w2, 3
> cmp w7, 3
> beq .L4
> sxtw x1, w2
> add w2, w2, w0
> ldr w4, [x3, x1, lsl 2]
> str w2, [x5, x1, lsl 2]
> cmp w0, w4
> bcc .L4
> str w0, [x3, x1, lsl 2]
> .L4:
> mov w0, 0
> ret
> .p2align 2,,3
> .L6:
> mov w7, 4
> b .L2
>
> and for SVE:
>
> test4:
> adrp x2, .LANCHOR0
> add x2, x2, :lo12:.LANCHOR0
> add x5, x2, 3216
> mov x3, 0
> mov w1, 0
> cntw x4
> mov z1.s, w0
> index z0.s, #0, #1
> ptrue p1.b, all
> ptrue p0.s, all
> .p2align 3,,7
> .L3:
> ld1w z2.s, p1/z, [x2, x3, lsl 2]
> add z3.s, z0.s, z1.s
> cmplo p2.s, p0/z, z1.s, z2.s
> b.any .L2
> st1w z3.s, p1, [x5, x3, lsl 2]
> add w1, w1, 1
> st1w z1.s, p1, [x2, x3, lsl 2]
> add x3, x3, x4
> incw z0.s
> cmp w3, 803
> bls .L3
> .L5:
> mov w0, 0
> ret
> .p2align 2,,3
> .L2:
> cntw x5
> mul w1, w1, w5
> cbz w5, .L5
> sxtw x1, w1
> sub w5, w5, #1
> add x5, x5, x1
> add x6, x2, 3216
> b .L6
> .p2align 2,,3
> .L14:
> str w0, [x2, x1, lsl 2]
> cmp x1, x5
> beq .L5
> mov x1, x4
> .L6:
> ldr w3, [x2, x1, lsl 2]
> add w4, w0, w1
> str w4, [x6, x1, lsl 2]
> add x4, x1, 1
> cmp w0, w3
> bcs .L14
> mov w0, 0
> ret
>
> On the workloads this work is based on we see between 2-3x performance
> uplift using this patch.
>
> Follow up plan:
> - Boolean vectorization has several shortcomings. I've filed PR110223 with
> the
> bigger ones that cause vectorization to fail with this patch.
> - SLP support. This is planned for GCC 15 as for majority of the cases build
> SLP itself fails. This means I'll need to spend time in making this more
> robust first. Additionally it requires:
> * Adding support for vectorizing CFG (gconds)
> * Support for CFG to differ between vector and scalar loops.
> Both of which would be disruptive to the tree and I suspect I'll be
> handling
> fallouts from this patch for a while. So I plan to work on the surrounding
> building blocks first for the remainder of the year.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Also ran across various workloads and no issues.
>
> When closer to acceptance I will run on other targets as well and clean up
> related testsuite fallouts there.
>
> --- inline copy of patch --
>
> --