On Thu, 6 Nov 2025, Christopher Bazley wrote:
>
> On 05/11/2025 12:25, Richard Biener wrote:
> > On Tue, 4 Nov 2025, Christopher Bazley wrote:
> >
> >> On 28/10/2025 13:29, Richard Biener wrote:
> >>>> +/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE
> >>>> that
> >>>> + operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX <
> >>>> NVECTORS.
> >>>> + Masking is only required for the tail, therefore NULL_TREE is
> >>>> returned
> >>>> for
> >>>> + every value of INDEX except the last. Insert any set-up statements
> >>>> before
> >>>> + GSI. */
> >>> I think it might happen that some vectors are fully masked, say for
> >>> a conversion from double to int and V2DImode vs. V4SImode when we
> >>> have 5 lanes the conversion likely expects 4 V2DImode inputs to
> >>> produce 2 V4SImode outputs, but the 4th V2DImode input has no active
> >>> lanes at all.
> >>>
> >>> But maybe you handle this situation differently, I'll see.
> >> You hypothesise a conversion from 4 of V2DI = 8DI (8DI - 5DI = 3DI
> >> inactive,
> >> and floor(3DI / 2DI)=1 of 2DI fully masked) to 2 of V4SI = 8SI (8SI - 5SI =
> >> 3SI inactive and floor(3SI / 4SI)=0 of V4SI fully masked).
> >>
> >> I don't think that the "1 of 2DI is fully masked" would ever happen though,
> >> because a group of 5DI would be split long before the vectoriser attempts
> >> to
> >> materialise masks. The only reason that a group of 5DI might be allowed to
> >> survive that long would be if the number of subparts of the natural vector
> >> type (the one currently being tried by vect_slp_region) were at least 5, a
> >> factor of 5, or both. No such vector types exist.
> >>
> >> For example, consider this translation unit:
> >>
> >> #include <stdint.h>
> >>
> >> void convert(const uint64_t (*const di)[5], uint32_t (*const si)[5])
> >> {
> >> (*si)[0] = (*di)[0];
> >> (*si)[1] = (*di)[1];
> >> (*si)[2] = (*di)[2];
> >> (*si)[3] = (*di)[3];
> >> (*si)[4] = (*di)[4];
> >> }
> >>
> >> Is compiled (with -O2 -ftree-vectorize -march=armv9-a+sve
> >> --param=aarch64-autovec-preference=sve-only -msve-vector-bits=scalable) as:
> >>
> >> convert:
> >> .LFB0:
> >> .cfi_startproc
> >> ldp q30, q31, [x0] ; vector load the first four lanes
> >> ptrue p7.d, vl2 ; enable two lanes for vector stores
> >> add x2, x1, 8
> >> ldr x0, [x0, 32] ; load the fifth lane
> >> st1w z30.d, p7, [x1] ; store least-significant 32 bits of
> >> the first two lanes
> >> st1w z31.d, p7, [x2] ; store least-significant 32 bits of lanes
> >> 3
> >> and 4
> >> str w0, [x1, 16] ; store least-significant 32 bits of fifth
> >> lane
> >> ret
> >> .cfi_endproc
> >>
> >> The slp2 dump shows:
> >>
> >> note: Starting SLP discovery for
> >> note: (*si_13(D))[0] = _2;
> >> note: (*si_13(D))[1] = _4;
> >> note: (*si_13(D))[2] = _6;
> >> note: (*si_13(D))[3] = _8;
> >> note: (*si_13(D))[4] = _10;
> >> note: Created SLP node 0x4bd9e00
> >> note: starting SLP discovery for node 0x4bd9e00
> >> note: get vectype for scalar type (group size 5): uint32_t
> >> note: get_vectype_for_scalar_type: natural type for uint32_t (ignoring
> >> group size 5): vector([4,4]) unsigned int
> >> note: vectype: vector([4,4]) unsigned int
> >> note: nunits = [4,4]
> >> missed: Build SLP failed: unrolling required in basic block SLP
> >>
> >> This fails the check in vect_record_nunits because the group size of 5 may
> >> be
> >> larger than the number of subparts of vector([4,4]) unsigned int (which
> >> could
> >> be as low as 4) and 5 is never an integral multiple of [4,4].
> >>
> >> The vectoriser therefore splits the group of 5SI into 4SI + 1SI:
> > I had the impression the intent of this series is to _not_ split the
> > groups in this case. On x86 with V2DImode / V4SImode (aka SSE2)
> Not exactly. Richard Sandiford did tell me (months ago) that this task is
> about trying to avoid splitting, but I think that is not the whole story.
> Richard's initial example of a function that is not currently vectorised, but
> could be with tail-predication, was:
>
> void
> foo (char *x, int n)
> {
> x[0] += 1;
> x[1] += 2;
> x[2] += 1;
> x[3] += 2;
> x[4] += 1;
> x[5] += 2;
> }
>
> A group of 6QI such as that shown in the function above would not need to be
> split because each lane is only one byte wide, not a double word (unlike in
> your example of a conversion from 5DF to 5SI). A group of 6QI can always be
> stored in one vector of type VNx16QI, because VNx16QI's minimum number of
> lanes is 16.
>
> ptrue p7.b, vl6
> ptrue p6.b, all
> ld1b z31.b, p7/z, [x0] ; one predicated load
> adrp x1, .LC0
> add x1, x1, :lo12:.LC0
> ld1rqb z30.b, p6/z, [x1]
> add z30.b, z31.b, z30.b
> st1b z30.b, p7, [x0] ; one predicated store
> ret
>
> If some target architecture provides both VNx8DF and VNx8SI then your example
> conversion wouldn't result in a split either because the group size of 5 would
> certainly be smaller than the number of subparts of vector([8,8]) double and
> the fact that 5 is not an integral multiple of [8,8] would be irrelevant. (SVE
> doesn't provide either type in implementations that I'm aware of.)
>
> However, I believe it could also be beneficial to be able to vectorise
> functions with more than a small number of operations in them (e.g., 26
> instead of 6 operations):
>
> void
> foo (char *x, int n)
> {
> x[0] += 1;
> x[1] += 2;
> x[2] += 1;
> x[3] += 2;
> x[4] += 1;
> x[5] += 2;
> x[6] += 1;
> x[7] += 2;
> x[8] += 1;
> x[9] += 2;
> x[10] += 1;
> x[11] += 2;
> x[12] += 1;
> x[13] += 2;
> x[14] += 1;
> x[15] += 2;
> x[16] += 1;
> x[17] += 2;
> x[18] += 1;
> x[19] += 2;
> x[20] += 1;
> x[21] += 2;
> x[22] += 1;
> x[23] += 2;
> x[24] += 1;
> x[25] += 2;
> }
>
> Admittedly, such cases are probably rarer than small groups in real code.
>
> In such cases, even a group of byte-size operations might need to be split in
> order to be vectorised. e.g., a group of 26QI additions could be vectorised
> with VNx16QI as 16QI + 10QI. A mask would be generated for both groups:
Note you say "split" and mean you have two vector operations in the end.
But I refer to with "split" to the split into two different SLP graphs,
usually, even with BB vectorization, a single SLP node can happily
represent multiple vectors (with the same vector type) when necessary
to fill all lanes.
But to agree to that we still might want to do some splitting, at least
on x86 where we have multiple vector sizes (and thus types for the
same element type), your first example with 6 lanes could be split
into a V4QImode subgraph and a V2QImode subgraph. I don't think
x86 has V2QImode, but just make that V4DImode and V2DImode. A
variant without the need for splitting would be using V2DImode
(with three vectors) or a variant using V4DImode and masking
for the second vector.
Your AdvSIMD substitution for the larger case could be done by
splitting the graph and choosing AdvSIMD for the half that does
not need predication but SVE for the other half.
That said, as long as the vector type is the same for each
part covering distinct lanes there is no need for splitting.
What I'd like to understand is whether the implementation at
hand from you for the masking assumes that if masking is required
(we padded lanes) whether that requires there to be exactly
one hardware vector for each SLP node. Below you say
that's an "invariant", so that's a yes? I'm not sure that
will work out for all cases in the end. I'm fine with
requiring this initially but please keep in mind that we'd
want to lift this restriction without re-doing most of what
you do in a different way.
Richard.
> void foo (char * x, int n)
> {
> char * vectp.14;
> vector([16,16]) char * vectp_x.13;
> vector([16,16]) char vect__34.12;
> vector([16,16]) char vect__33.11;
> char * vectp.10;
> vector([16,16]) char * vectp_x.9;
> char * vectp.8;
> vector([16,16]) char * vectp_x.7;
> vector([16,16]) char vect__2.6;
> vector([16,16]) char vect__1.5;
> char * vectp.4;
> vector([16,16]) char * vectp_x.3;
> vector([16,16]) <signed-boolean:1> slp_mask_82;
> vector([16,16]) <signed-boolean:1> slp_mask_86;
> vector([16,16]) <signed-boolean:1> slp_mask_89;
> vector([16,16]) <signed-boolean:1> slp_mask_93;
>
> <bb 2> [local count: 1073741824]:
> vectp.4_81 = x_54(D);
> slp_mask_82 = .WHILE_ULT (0, 16, { 0, ... });
> vect__1.5_83 = .MASK_LOAD (vectp.4_81, 8B, slp_mask_82, { 0, ... });
> vect__2.6_84 = vect__1.5_83 + { 1, 2, ... };
> vectp.8_85 = x_54(D);
> slp_mask_86 = .WHILE_ULT (0, 16, { 0, ... });
> .MASK_STORE (vectp.8_85, 8B, slp_mask_86, vect__2.6_84);
> vectp.10_88 = x_54(D) + 16;
> slp_mask_89 = .WHILE_ULT (0, 10, { 0, ... });
> vect__33.11_90 = .MASK_LOAD (vectp.10_88, 8B, slp_mask_89, { 0, ... });
> vect__34.12_91 = vect__33.11_90 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0,
> 0, 0, 0, 0, 0, ... };
> vectp.14_92 = x_54(D) + 16;
> slp_mask_93 = .WHILE_ULT (0, 10, { 0, ... });
> .MASK_STORE (vectp.14_92, 8B, slp_mask_93, vect__34.12_91);
> return;
>
> }
>
> If advantageous, the AArch64 backend later substitutes Advanced SIMD
> instructions for the group that uses variable-length vector type with a mask
> of a known regular length:
>
> mov x1, x0 mov w2, 513 ptrue p6.b, all ldr q29, [x0] ; first load is replaced
> with Advanced SIMD mov z28.h, w2 add z28.b, z29.b, z28.b ; first add is done
> using SVE (z29.b aliases q29) mov x3, 10 whilelo p7.b, xzr, x3 adrp x2, .LC0
> add x2, x2, :lo12:.LC0 ld1rqb z30.b, p6/z, [x2] str q28, [x1], 16 ; first
> store is replaced with Advanced SIMD (q28 aliases z28.b) ld1b z31.b, p7/z,
> [x1] ; second load is predicated SVE add z30.b, z31.b, z30.b ; second add is
> also done using SVE st1b z30.b, p7, [x1] ; second store is predicated SVE ret
>
> With -msve-vector-bits=128 the GIMPLE produced by the vectoriser doesn't
> specify any masks at all, but instead splits the group of 26 into 16 + 8 + 2:
>
> void foo (char * x, int n) { char * vectp.20; vector(2) char * vectp_x.19;
> vector(2) char vect__50.18; vector(2) char vect__49.17; char * vectp.16;
> vector(2) char * vectp_x.15; char * vectp.14; vector(8) char * vectp_x.13;
> vector(8) char vect__34.12; vector(8) char vect__33.11; char * vectp.10;
> vector(8) char * vectp_x.9; char * vectp.8; vector(16) char * vectp_x.7;
> vector(16) char vect__2.6; vector(16) char vect__1.5; char * vectp.4;
> vector(16) char * vectp_x.3; <bb 2> [local count: 1073741824]: vectp.4_81 =
> x_54(D); vect__1.5_82 = MEM <vector(16) char> [(char *)vectp.4_81];
> vect__2.6_84 = vect__1.5_82 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2
> }; vectp.4_83 = vectp.4_81 + 10; vectp.8_85 = x_54(D); MEM <vector(16) char>
> [(char *)vectp.8_85] = vect__2.6_84; vectp.10_87 = x_54(D) + 16;
> vect__33.11_88 = MEM <vector(8) char> [(char *)vectp.10_87]; vect__34.12_90 =
> vect__33.11_88 + { 1, 2, 1, 2, 1, 2, 1, 2 }; vectp.10_89 = x_54(D) + 34;
> vectp.14_91 = x_54(D) + 16; MEM <vector(8) char> [(char *)vectp.14_91] =
> vect__34.12_90; vectp.16_93 = x_54(D) + 24; vect__49.17_94 = MEM <vector(2)
> char> [(char *)vectp.16_93]; vect__50.18_96 = vect__49.17_94 + { 1, 2 };
> vectp.16_95 = x_54(D) + 48; _49 = MEM[(char *)x_54(D) + 24B]; _50 = _49 + 1;
> _51 = MEM[(char *)x_54(D) + 25B]; _52 = _51 + 2; vectp.20_97 = x_54(D) + 24;
> MEM <vector(2) char> [(char *)vectp.20_97] = vect__50.18_96; return; }
>
> The AArch64 backend still uses SVE if available though:
>
> adrp x1, .LC0
> ldr d29, [x0, 16] ; load the middle 8 bytes using Advanced SIMD
> ptrue p7.b, vl16 ; this SVE mask is actually for 2 lanes, when
> interpreted as doubles later!
> ldr q27, [x0] ; load the first 16 bytes using Advanced SIMD
> index z30.d, #1, #1
> ldr d28, [x1, #:lo12:.LC0]
> adrp x1, .LC1
> ldr q26, [x1, #:lo12:.LC1]
> add v28.8b, v29.8b, v28.8b ; add the middle 8 bytes using
> Advanced SIMD (v29.8b aliases d29)
> add x1, x0, 24 ; offset to the last two bytes [x0,24] and [x0,25]
> add v26.16b, v27.16b, v26.16b ; add the first 16 bytes using
> Advanced SIMD (v27.16b aliases q27)
> str d28, [x0, 16] ; store the middle 8 bytes using Advanced SIMD
> str q26, [x0] ; store the first 16 bytes using Advanced SIMD
> ld1b z31.d, p7/z, [x1] ; load the last two bytes using SVE
> add z30.b, z31.b, z30.b
> st1b z30.d, p7, [x1] ; store the last two bytes using SVE
> ret
>
> So you see there is only a loose relationship between GIMPLE vector types and
> instructions chosen by the backend.
>
> > you'd have three V2DImode vectors, the last with one masked lane and two
> > V4SImode vectors, the last with three masked lanes.
> > The 2nd V2DImode -> V4SImode (2nd because two output vectors)
> > conversion expects two V2DImode inputs because it uses two-to-one
> > vector pack instructions. But the 2nd V2DImode input does not exist.
>
> I'm not familiar with other CPU architectures, but I suspect they are neither
> helped nor hindered by my change.
>
> > That said, downthread you have comments that only a single vector
> > element is supported when using masked operation (I don't remember
> > exactly where). So you are hoping that the group splitting provides
> > you with a fully "leaf" situation here?
> I think it's an invariant.
> > Keep in mind that splitting is not always a good option, like with
> >
> > a[0] = b[0];
> > a[1] = b[2];
> > a[2] = b[1];
> > a[3] = b[3];
> >
> > we do not split along V2DImode boundaries but having 2xV2DImode
> > allows to handle the loads efficiently with shuffling. Similar
> > situations may arise when there's vector parts.
> >
> > That said, if you think the current limitiation to leafs does not
> > restrict us design-wise then it's an OK initial limitation.
> Thanks!
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)