Richard Biener <rguent...@suse.de> writes:
> On Thu, 8 May 2025, Pengfei Li wrote:
>
>> This patch improves the auto-vectorization for loops with known small
>> trip counts by enabling the use of subvectors - bit fields of original
>> wider vectors. A subvector must have the same vector element type as the
>> original vector and enough bits for all vector elements to be processed
>> in the loop. Using subvectors is beneficial because machine instructions
>> operating on narrower vectors usually show better performance.
>> 
>> To enable this optimization, this patch introduces a new target hook.
>> This hook allows the vectorizer to query the backend for a suitable
>> subvector type given the original vector type and the number of elements
>> to be processed in the small-trip-count loop. The target hook also has a
>> could_trap parameter to say if the subvector is allowed to have more
>> bits than needed.
>> 
>> This optimization is currently enabled for AArch64 only. Below example
>> shows how it uses AdvSIMD vectors as subvectors of SVE vectors for
>> higher instruction throughput.
>> 
>> Consider this loop operating on an array of 16-bit integers:
>> 
>>      for (int i = 0; i < 5; i++) {
>>        a[i] = a[i] < 0 ? -a[i] : a[i];
>>      }
>> 
>> Before this patch, the generated AArch64 code would be:
>> 
>>      ptrue   p7.h, vl5
>>      ptrue   p6.b, all
>>      ld1h    z31.h, p7/z, [x0]
>>      abs     z31.h, p6/m, z31.h
>>      st1h    z31.h, p7, [x0]
>
> p6.b has all lanes active - why is the abs then not
> simply unmasked?

There is no unpredicated abs for SVE.  The predicate has to be there,
and so expand introduces one even when the gimple stmt is unconditional.

>> After this patch, it is optimized to:
>> 
>>      ptrue   p7.h, vl5
>>      ld1h    z31.h, p7/z, [x0]
>>      abs     v31.8h, v31.8h
>>      st1h    z31.h, p7, [x0]
>
> Help me decipher this - I suppose z31 and v31 "overlap" in the
> register file?  And z31 is a variable-length vector but
> z31.8h is a 8 element fixed length vector?  How can we

v31.8h, but otherwise yes.

> end up with just 8 elements here?  From the upper interation
> bound?

Yeah.

> I'm not sure why you need any target hook here.  It seems you
> do already have suitable vector modes so why not just ask
> for a suitable vector?  Is it because you need to have
> that register overlap guarantee (otherwise you'd get
> a move)?

Yeah, the optimisation only makes sense for overlaid vector registers.

> Why do we not simply use fixed-length SVE here in the first place?

Fixed-length SVE is restricted to cases where the exact runtime length
is known: the compile-time length is both a minimum and a maximum.
In contrast, the code above would work even for 256-bit SVE.

> To me doing this in this way in the vectorizer looks
> somewhat out-of-place.
>
> That said, we already have unmasked ABS in the IL:
>
>   vect__1.6_15 = .MASK_LOAD (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, ... }, { 0, ... });
>   vect__2.7_16 = ABSU_EXPR <vect__1.6_15>;
>   vect__3.8_17 = VIEW_CONVERT_EXPR<vector([8,8]) short int>(vect__2.7_16);
>   .MASK_STORE (&a, 16B, { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, ... }, vect__3.8_17); [tail call]
>
> so what's missing here?  I suppose having a constant masked ABSU here
> would allow RTL expansion to select a fixed-size mode?
>
> And the vectorizer could simply use the existing
> related_vector_mode hook instead?

I agree it's a bit awkward.  The problem is that we want conflicting
things.  On the one hand, it would make conceptual sense to use SVE
instructions to provide conditional optabs for Advanced SIMD vector modes.
E.g. SVE's LD1W could act as a predicated load for an Advanced SIMD
int32x4_t vector.  The main problem with that is that Advanced SIMD's
native boolean vector type is an integer vector of 0s and -1s, rather
than an SVE predicate.  For some (native Advanced SIMD) operations we'd
want one type of boolean, for some (SVE emulating Advanced SIMD)
operations we'd want the other type of boolean.

The patch goes the other way and treats using Advanced SIMD as an
optimisation for SVE loops.

related_vector_mode suffers from the same problem.  If we ask for a
vector mode of >=5 halfwords for a load or store, we want the SVE mode,
since that can be conditional on an SVE predicate.  But if we ask for
a vector mode of >=5 halfwords for an integer absolute operation,
we want the Advanced SIMD mode.  So I suppose the new hook is effectively
providing context.  Perhaps we could do that using an extra parameter to
related_vector_mode, if that seems better.

It's somewhat difficult to recover this information after vectorisation,
since like you say, the statements are often unconditional and operate
on all lanes.

Thanks,
Richard

Reply via email to