length for BB SLP

Richard Biener Wed, 12 Nov 2025 06:38:45 -0800

On Tue, 11 Nov 2025, Christopher Bazley wrote:

> 
> On 07/11/2025 13:53, Richard Biener wrote:
> > On Thu, 6 Nov 2025, Christopher Bazley wrote:
> >
> >> On 05/11/2025 12:25, Richard Biener wrote:
> >>> On Tue, 4 Nov 2025, Christopher Bazley wrote:
> >>>
> >>>> On 28/10/2025 13:29, Richard Biener wrote:
> >>>>>> +/* Materialize mask number INDEX for a group of scalar stmts in
> >>>>>> SLP_NODE
> >>>>>> that
> >>>>>> +   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX <
> >>>>>> NVECTORS.
> >>>>>> +   Masking is only required for the tail, therefore NULL_TREE is
> >>>>>> returned
> >>>>>> for
> >>>>>> +   every value of INDEX except the last.  Insert any set-up statements
> >>>>>> before
> >>>>>> +   GSI.  */
> >>>>> I think it might happen that some vectors are fully masked, say for
> >>>>> a conversion from double to int and V2DImode vs. V4SImode when we
> >>>>> have 5 lanes the conversion likely expects 4 V2DImode inputs to
> >>>>> produce 2 V4SImode outputs, but the 4th V2DImode input has no active
> >>>>> lanes at all.
> >>>>>
> >>>>> But maybe you handle this situation differently, I'll see.
> >>>> You hypothesise a conversion from 4 of V2DI = 8DI (8DI - 5DI = 3DI
> >>>> inactive,
> >>>> and floor(3DI / 2DI)=1 of 2DI fully masked) to 2 of V4SI = 8SI (8SI - 5SI
> >>>> =
> >>>> 3SI inactive and floor(3SI / 4SI)=0 of V4SI fully masked).
> >>>>
> >>>> I don't think that the "1 of 2DI is fully masked" would ever happen
> >>>> though,
> >>>> because a group of 5DI would be split long before the vectoriser attempts
> >>>> to
> >>>> materialise masks. The only reason that a group of 5DI might be allowed
> >>>> to
> >>>> survive that long would be if the number of subparts of the natural
> >>>> vector
> >>>> type (the one currently being tried by vect_slp_region) were at least 5,
> >>>> a
> >>>> factor of 5, or both. No such vector types exist.
> >>>>
> >>>> For example, consider this translation unit:
> >>>>
> >>>> #include <stdint.h>
> >>>>
> >>>> void convert(const uint64_t (*const di)[5], uint32_t (*const si)[5])
> >>>> {
> >>>>     (*si)[0] = (*di)[0];
> >>>>     (*si)[1] = (*di)[1];
> >>>>     (*si)[2] = (*di)[2];
> >>>>     (*si)[3] = (*di)[3];
> >>>>     (*si)[4] = (*di)[4];
> >>>> }
> >>>>
> >>>> Is compiled (with -O2 -ftree-vectorize -march=armv9-a+sve
> >>>> --param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable)
> >>>> as:
> >>>>
> >>>> convert:
> >>>> .LFB0:
> >>>>           .cfi_startproc
> >>>>           ldp     q30, q31, [x0] ; vector load the first four lanes
> >>>>           ptrue   p7.d, vl2 ; enable two lanes for vector stores
> >>>>           add     x2, x1, 8
> >>>>           ldr     x0, [x0, 32] ; load the fifth lane
> >>>>           st1w    z30.d, p7, [x1] ; store least-significant 32 bits of
> >>>> the first two lanes
> >>>>         st1w    z31.d, p7, [x2] ; store least-significant 32 bits of
> >>>> lanes
> >>>> 3
> >>>> and 4
> >>>>           str     w0, [x1, 16] ; store least-significant 32 bits of fifth
> >>>>   lane
> >>>>           ret
> >>>>           .cfi_endproc
> >>>>
> >>>> The slp2 dump shows:
> >>>>
> >>>> note:   Starting SLP discovery for
> >>>> note:     (*si_13(D))[0] = _2;
> >>>> note:     (*si_13(D))[1] = _4;
> >>>> note:     (*si_13(D))[2] = _6;
> >>>> note:     (*si_13(D))[3] = _8;
> >>>> note:     (*si_13(D))[4] = _10;
> >>>> note:   Created SLP node 0x4bd9e00
> >>>> note:   starting SLP discovery for node 0x4bd9e00
> >>>> note:   get vectype for scalar type (group size 5): uint32_t
> >>>> note:   get_vectype_for_scalar_type: natural type for uint32_t (ignoring
> >>>> group size 5): vector([4,4]) unsigned int
> >>>> note:   vectype: vector([4,4]) unsigned int
> >>>> note:   nunits = [4,4]
> >>>> missed:   Build SLP failed: unrolling required in basic block SLP
> >>>>
> >>>> This fails the check in vect_record_nunits because the group size of 5
> >>>> may
> >>>> be
> >>>> larger than the number of subparts of vector([4,4]) unsigned int (which
> >>>> could
> >>>> be as low as 4) and 5 is never an integral multiple of [4,4].
> >>>>
> >>>> The vectoriser therefore splits the group of 5SI into 4SI + 1SI:
> >>> I had the impression the intent of this series is to _not_ split the
> >>> groups in this case.  On x86 with V2DImode / V4SImode (aka SSE2)
> >> Not exactly. Richard Sandiford did tell me (months ago) that this task is
> >> about trying to avoid splitting, but I think that is not the whole story.
> >> Richard's initial example of a function that is not currently vectorised,
> >> but
> >> could be with tail-predication, was:
> >>
> >> void
> >> foo (char *x, int n)
> >> {
> >>    x[0] += 1;
> >>    x[1] += 2;
> >>    x[2] += 1;
> >>    x[3] += 2;
> >>    x[4] += 1;
> >>    x[5] += 2;
> >> }
> >>
> >> A group of 6QI such as that shown in the function above would not need to
> >> be
> >> split because each lane is only one byte wide, not a double word (unlike in
> >> your example of a conversion from 5DF to 5SI). A group of 6QI can always be
> >> stored in one vector of type VNx16QI, because VNx16QI's minimum number of
> >> lanes is 16.
> >>
> >>      ptrue    p7.b, vl6
> >>      ptrue    p6.b, all
> >>      ld1b    z31.b, p7/z, [x0] ; one predicated load
> >>      adrp    x1, .LC0
> >>      add    x1, x1, :lo12:.LC0
> >>      ld1rqb    z30.b, p6/z, [x1]
> >>      add    z30.b, z31.b, z30.b
> >>      st1b    z30.b, p7, [x0] ; one predicated store
> >>      ret
> >>
> >> If some target architecture provides both VNx8DF and VNx8SI then your
> >> example
> >> conversion wouldn't result in a split either because the group size of 5
> >> would
> >> certainly be smaller than the number of subparts of vector([8,8]) double
> >> and
> >> the fact that 5 is not an integral multiple of [8,8] would be irrelevant.
> >> (SVE
> >> doesn't provide either type in implementations that I'm aware of.)
> >>
> >> However, I believe it could also be beneficial to be able to vectorise
> >> functions with more than a small number of operations in them (e.g., 26
> >> instead of 6 operations):
> >>
> >> void
> >> foo (char *x, int n)
> >> {
> >>    x[0] += 1;
> >>    x[1] += 2;
> >>    x[2] += 1;
> >>    x[3] += 2;
> >>    x[4] += 1;
> >>    x[5] += 2;
> >>    x[6] += 1;
> >>    x[7] += 2;
> >>    x[8] += 1;
> >>    x[9] += 2;
> >>    x[10] += 1;
> >>    x[11] += 2;
> >>    x[12] += 1;
> >>    x[13] += 2;
> >>    x[14] += 1;
> >>    x[15] += 2;
> >>    x[16] += 1;
> >>    x[17] += 2;
> >>    x[18] += 1;
> >>    x[19] += 2;
> >>    x[20] += 1;
> >>    x[21] += 2;
> >>    x[22] += 1;
> >>    x[23] += 2;
> >>    x[24] += 1;
> >>    x[25] += 2;
> >> }
> >>
> >> Admittedly, such cases are probably rarer than small groups in real code.
> >>
> >> In such cases, even a group of byte-size operations might need to be split
> >> in
> >> order to be vectorised. e.g., a group of 26QI additions could be vectorised
> >> with VNx16QI as 16QI + 10QI. A mask would be generated for both groups:
> > Note you say "split" and mean you have two vector operations in the end.
> > But I refer to with "split" to the split into two different SLP graphs,
> > usually, even with BB vectorization, a single SLP node can happily
> > represent multiple vectors (with the same vector type) when necessary
> > to fill all lanes.
> 
> Thanks for clarifying that.
> 
> My original concept of splitting was probably based on something Richard
> Sandiford said about the desirability of using Advanced SIMD instead of SVE to
> vectorise the part of a large group that does not require tail-predication. At
> that time, I was not aware that the target backend could automatically
> generate Advanced SIMD instructions for WHILE_ULT operations in which the mask
> has all bits set. I therefore assumed that it would be necessary to split such
> groups. Splitting is also a natural consequence of the existing control flow
> in the vect_analyze_slp_instance function.
> 
> > But to agree to that we still might want to do some splitting, at least
> > on x86 where we have multiple vector sizes (and thus types for the
> > same element type), your first example with 6 lanes could be split
> > into a V4QImode subgraph and a V2QImode subgraph.  I don't think
> > x86 has V2QImode, but just make that V4DImode and V2DImode.  A
> > variant without the need for splitting would be using V2DImode
> > (with three vectors) or a variant using V4DImode and masking
> > for the second vector.
> 
> Is your concern that adding !known_ge (nunits.min, group_size) to the
> conjunction in the vect_analyze_slp_instance function prevents splitting of BB
> SLP groups known to be smaller than the minimum number of lanes of any of the
> chosen vector types? Can such groups really be usefully split?


If we have 5 lanes and V8DI we can either not split and mask or
split to V4DI and a scalar (a masked V2DI should be left unvectorized).
IIRC you adjusted get_vectype_for_scalar_type to get you V8DI for
the 5 lane case while I think currently the function refuses to return
a vector with more lanes than requested.

That said, my overall concern is with how your changes work on a
fixed-multi-length ISA.

While we iterate over modes as you say above, the actual mode isn't
really enforced for BB vectorization since the number of requested
lanes is allowed to override it.  So I'm not sure how we get to
a proper choice with the above V8DI vs. V4DI + scalar, or with 7
lanes to V8DI vs. 2xV4DI.  Which is why my gut feeling tells me
that some analysis on the unsplit SLP graph is probably necessary
to derive at an optimal vector mode / split configuration.

But I think at this point we want to get the basics working without
breaking too much of what currently works fine.

> Let's suppose the group size is 6 and the natural vector type (for the current
> iteration of the outer loop) is V8DI.
> 
> Previously, this example would have failed the following test (condition
> true):
> 
>       if (!max_nunits.is_constant (&const_max_nunits) || const_max_nunits >
> group_size)
> 
> which would have resulted in "Build SLP failed: store group size not a
> multiple of the vector size in basic block SLP" and vect_analyze_slp_instance
> returning false, instead of the group being split.
> 
> Any split would only occur when the next iteration of the outer loop selects
> V4DI, for which !known_ge (nunits.min, group_size) is true with my changes to
> the function (because 4 < 8). Consequently, the BB SLP block would still be
> entered, and the const_max_nunits > group_size test would be repeated. This
> time would pass (condition false) because 4 <= 8, therefore "SLP discovery
> succeeded but node needs splitting" and the group could be split into V4DImode
> and V2DImode as you described.
> 
> > Your AdvSIMD substitution for the larger case could be done by
> > splitting the graph and choosing AdvSIMD for the half that does
> > not need predication but SVE for the other half.
> That's what the current implementation does.
> > That said, as long as the vector type is the same for each
> > part covering distinct lanes there is no need for splitting.
> > What I'd like to understand is whether the implementation at
> > hand from you for the masking assumes that if masking is required
> > (we padded lanes) whether that requires there to be exactly
> > one hardware vector for each SLP node.  Below you say
> > that's an "invariant", so that's a yes?
> 
> The vect_analyze_slp_instance function only creates a new SLP instance for BB
> vectorisation with an unrolling factor not equal to one if the minimum number
> of lanes for all of the vector types is sufficient to store the whole group.
> That implies that there is exactly one hardware vector.
> 
> The vect_get_num_copies function also relies on that assumption:
> 
>   vf *= SLP_TREE_LANES (node);
>   tree vectype = SLP_TREE_VECTYPE (node);
>   if (known_ge (TYPE_VECTOR_SUBPARTS (vectype), vf))
>     return 1;
> 
> Otherwise, exact_div would fail in a callee, vect_get_num_vectors.
> 
> My current implementation of the vect_slp_get_bb_mask function returns
> NULL_TREE (i.e. 'no mask') for all vectors of an SLP node with multiple
> vectors:
> 
>   /* vect_get_num_copies only allows a partial vector if it is the only
>      vector.  */
>   if (nvectors > 1)
>     return NULL_TREE;
> 
> That means the tail of such a node would not be masked correctly if it needs
> to be masked at all. Even if that guard were removed, the following statement
> would also need to be made more complex to handle cases in which the group
> size is not also the number of active lanes:
> 
> tree end_index = build_int_cst (cmp_type, group_size);

Yes, I know.  Which is why I wonder about the overall design how to
[not] represent the padding lanes.

Richard.

> >    I'm not sure that
> > will work out for all cases in the end.  I'm fine with
> > requiring this initially but please keep in mind that we'd
> > want to lift this restriction without re-doing most of what
> > you do in a different way.
> >
> > Richard.
> >
> >> void foo (char * x, int n)
> >> {
> >>    char * vectp.14;
> >>    vector([16,16]) char * vectp_x.13;
> >>    vector([16,16]) char vect__34.12;
> >>    vector([16,16]) char vect__33.11;
> >>    char * vectp.10;
> >>    vector([16,16]) char * vectp_x.9;
> >>    char * vectp.8;
> >>    vector([16,16]) char * vectp_x.7;
> >>    vector([16,16]) char vect__2.6;
> >>    vector([16,16]) char vect__1.5;
> >>    char * vectp.4;
> >>    vector([16,16]) char * vectp_x.3;
> >>    vector([16,16]) <signed-boolean:1> slp_mask_82;
> >>    vector([16,16]) <signed-boolean:1> slp_mask_86;
> >>    vector([16,16]) <signed-boolean:1> slp_mask_89;
> >>    vector([16,16]) <signed-boolean:1> slp_mask_93;
> >>
> >>    <bb 2> [local count: 1073741824]:
> >>    vectp.4_81 = x_54(D);
> >>    slp_mask_82 = .WHILE_ULT (0, 16, { 0, ... });
> >>    vect__1.5_83 = .MASK_LOAD (vectp.4_81, 8B, slp_mask_82, { 0, ... });
> >>    vect__2.6_84 = vect__1.5_83 + { 1, 2, ... };
> >>    vectp.8_85 = x_54(D);
> >>    slp_mask_86 = .WHILE_ULT (0, 16, { 0, ... });
> >>    .MASK_STORE (vectp.8_85, 8B, slp_mask_86, vect__2.6_84);
> >>    vectp.10_88 = x_54(D) + 16;
> >>    slp_mask_89 = .WHILE_ULT (0, 10, { 0, ... });
> >>    vect__33.11_90 = .MASK_LOAD (vectp.10_88, 8B, slp_mask_89, { 0, ... });
> >>    vect__34.12_91 = vect__33.11_90 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0,
> >> 0, 0, 0, 0, 0, ... };
> >>    vectp.14_92 = x_54(D) + 16;
> >>    slp_mask_93 = .WHILE_ULT (0, 10, { 0, ... });
> >>    .MASK_STORE (vectp.14_92, 8B, slp_mask_93, vect__34.12_91);
> >>    return;
> >>
> >> }
> >>
> >> If advantageous, the AArch64 backend later substitutes Advanced SIMD
> >> instructions for the group that uses variable-length vector type with a
> >> mask
> >> of a known regular length:
> >>
> >> mov x1, x0 mov w2, 513 ptrue p6.b, all ldr q29, [x0] ; first load is
> >> replaced
> >> with Advanced SIMD mov z28.h, w2 add z28.b, z29.b, z28.b ; first add is
> >> done
> >> using SVE (z29.b aliases q29) mov x3, 10 whilelo p7.b, xzr, x3 adrp x2,
> >> .LC0
> >> add x2, x2, :lo12:.LC0 ld1rqb z30.b, p6/z, [x2] str q28, [x1], 16 ; first
> >> store is replaced with Advanced SIMD (q28 aliases z28.b) ld1b z31.b, p7/z,
> >> [x1] ; second load is predicated SVE add z30.b, z31.b, z30.b ; second add
> >> is
> >> also done using SVE st1b z30.b, p7, [x1] ; second store is predicated SVE
> >> ret
> >>
> >> With -msve-vector-bits=128 the GIMPLE produced by the vectoriser doesn't
> >> specify any masks at all, but instead splits the group of 26 into 16 + 8 +
> >> 2:
> >>
> >> void foo (char * x, int n) { char * vectp.20; vector(2) char * vectp_x.19;
> >> vector(2) char vect__50.18; vector(2) char vect__49.17; char * vectp.16;
> >> vector(2) char * vectp_x.15; char * vectp.14; vector(8) char * vectp_x.13;
> >> vector(8) char vect__34.12; vector(8) char vect__33.11; char * vectp.10;
> >> vector(8) char * vectp_x.9; char * vectp.8; vector(16) char * vectp_x.7;
> >> vector(16) char vect__2.6; vector(16) char vect__1.5; char * vectp.4;
> >> vector(16) char * vectp_x.3; <bb 2> [local count: 1073741824]: vectp.4_81 =
> >> x_54(D); vect__1.5_82 = MEM <vector(16) char> [(char *)vectp.4_81];
> >> vect__2.6_84 = vect__1.5_82 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
> >> 1, 2
> >> }; vectp.4_83 = vectp.4_81 + 10; vectp.8_85 = x_54(D); MEM <vector(16)
> >> char>
> >> [(char *)vectp.8_85] = vect__2.6_84; vectp.10_87 = x_54(D) + 16;
> >> vect__33.11_88 = MEM <vector(8) char> [(char *)vectp.10_87]; vect__34.12_90
> >> =
> >> vect__33.11_88 + { 1, 2, 1, 2, 1, 2, 1, 2 }; vectp.10_89 = x_54(D) + 34;
> >> vectp.14_91 = x_54(D) + 16; MEM <vector(8) char> [(char *)vectp.14_91] =
> >> vect__34.12_90; vectp.16_93 = x_54(D) + 24; vect__49.17_94 = MEM <vector(2)
> >> char> [(char *)vectp.16_93]; vect__50.18_96 = vect__49.17_94 + { 1, 2 };
> >> vectp.16_95 = x_54(D) + 48; _49 = MEM[(char *)x_54(D) + 24B]; _50 = _49 +
> >> 1;
> >> _51 = MEM[(char *)x_54(D) + 25B]; _52 = _51 + 2; vectp.20_97 = x_54(D) +
> >> 24;
> >> MEM <vector(2) char> [(char *)vectp.20_97] = vect__50.18_96; return; }
> >>
> >> The AArch64 backend still uses SVE if available though:
> >>
> >>      adrp    x1, .LC0
> >>      ldr    d29, [x0, 16] ; load the middle 8 bytes using Advanced SIMD
> >>      ptrue    p7.b, vl16 ; this SVE mask is actually for 2 lanes, when
> >> interpreted as doubles later!
> >>      ldr    q27, [x0] ; load the first 16 bytes using Advanced SIMD
> >>      index    z30.d, #1, #1
> >>      ldr    d28, [x1, #:lo12:.LC0]
> >>      adrp    x1, .LC1
> >>      ldr    q26, [x1, #:lo12:.LC1]
> >>      add    v28.8b, v29.8b, v28.8b ; add the middle 8 bytes using
> >> Advanced SIMD (v29.8b aliases d29)
> >>      add    x1, x0, 24 ; offset to the last two bytes [x0,24] and [x0,25]
> >>      add    v26.16b, v27.16b, v26.16b ; add the first 16 bytes using
> >> Advanced SIMD (v27.16b aliases q27)
> >>      str    d28, [x0, 16] ; store the middle 8 bytes using Advanced SIMD
> >>      str    q26, [x0] ; store the first 16 bytes using Advanced SIMD
> >>      ld1b    z31.d, p7/z, [x1] ; load the last two bytes using SVE
> >>      add    z30.b, z31.b, z30.b
> >>      st1b    z30.d, p7, [x1] ; store the last two bytes using SVE
> >>      ret
> >>
> >> So you see there is only a loose relationship between GIMPLE vector types
> >> and
> >> instructions chosen by the backend.
> >>
> >>> you'd have three V2DImode vectors, the last with one masked lane and two
> >>> V4SImode vectors, the last with three masked lanes.
> >>> The 2nd V2DImode -> V4SImode (2nd because two output vectors)
> >>> conversion expects two V2DImode inputs because it uses two-to-one
> >>> vector pack instructions.  But the 2nd V2DImode input does not exist.
> >> I'm not familiar with other CPU architectures, but I suspect they are
> >> neither
> >> helped nor hindered by my change.
> >>
> >>> That said, downthread you have comments that only a single vector
> >>> element is supported when using masked operation (I don't remember
> >>> exactly where).  So you are hoping that the group splitting provides
> >>> you with a fully "leaf" situation here?
> >> I think it's an invariant.
> >>> Keep in mind that splitting is not always a good option, like with
> >>>
> >>>    a[0] = b[0];
> >>>    a[1] = b[2];
> >>>    a[2] = b[1];
> >>>    a[3] = b[3];
> >>>
> >>> we do not split along V2DImode boundaries but having 2xV2DImode
> >>> allows to handle the loads efficiently with shuffling.  Similar
> >>> situations may arise when there's vector parts.
> >>>
> >>> That said, if you think the current limitiation to leafs does not
> >>> restrict us design-wise then it's an OK initial limitation.
> >> Thanks!
> >>
> >>
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [RFC 3/9] Implement recording/getting of mask/length for BB SLP

Reply via email to