Tamar Christina <[email protected]> writes:
> Hi All,
>
> In this patch series I'm adding support for zero extending using permutes
> instead of requiring multi-step decomposition.
>
> This codegen has the benefit of needing fewer instructions and having much
> higher throughput than uxtl. We previously replaced pairs of uxtl/uxtl2s with
> ZIPs to increase throughput, but latency was still an issue due to the
> depencency chain created.
>
> To fix this we can now use TBLs. The indexes are listed output of the loop as
> well and can be shared amongst zero extends of the same type. The additional
> benefit of this is that if the values are being permuted after extensions they
> will be simplified and merged leading to better overall code.
>
> e.g. an LD4 can be replaced with LDR since the permutes being performance for
> the extensions can be merged with the load permutes. The way LOAD_LANES is
> currently implemented means this can't be done in GCC yet, but I'm aiming for
> this in GCC 15.
>
> I've additionally only added support for non-VLA. The problem with VLA is
> that
> the index registers are hard or impossible to make.
>
> On Adv. SIMD we use -1 to indicate an out of range register so we can
> transform
> the two regs TBL into a one reg one. However on e.g. a byte array, on VLA 255
> would be a valid entry. e.g, at VL > 2048. Which means that's already not a
> transformation we can make.
>
> Secondly the actual mask looks something like {x,x,x,n,n+1, x,x,x, n+2, n+3}
> and while I think I can represent this in vect_perm_builder, I couldn't think
> of
> any real efficient VLA way to build such masks.. It would require a lot of
> setup code.
>
> Lastly I don't think this transformation would make much sense for SVE, as SVE
> has loads and converts that can do multi-step types. For instance the loop
> below is already pretty good for SVE (though it's missed that the load can do
> more than one step, presumably because a single extend is merged only in RTL).
>
> While I tried hard, for these reasons I don't support VLA, which I hope is
> ok..
Yeah, I agree we can skip VLA. For .B->.D, we could in principle use
a TBL .B with a mask created by:
INDEX Zmask.D, #0, #1
and using INCD to generate subsequence vectors.
We could then mask the result with:
AND Zresult.D, Zresult.D, #0xff
which still saves one level of unpacks. But like you say, it doesn't
scale to VL>2048 bits, and setting up each index, although relatively
simple individually, is going to mount up.
> Concretely on AArch64 this changes:
>
> void test4(unsigned char *x, long long *y, int n) {
> for(int i = 0; i < n; i++) {
> y[i] = x[i];
> }
> }
>
> from generating:
>
> .L4:
> ldr q30, [x4], 16
> add x3, x3, 128
> zip1 v1.16b, v30.16b, v31.16b
> zip2 v30.16b, v30.16b, v31.16b
> zip1 v2.8h, v1.8h, v31.8h
> zip1 v0.8h, v30.8h, v31.8h
> zip2 v1.8h, v1.8h, v31.8h
> zip2 v30.8h, v30.8h, v31.8h
> zip1 v26.4s, v2.4s, v31.4s
> zip1 v29.4s, v0.4s, v31.4s
> zip1 v28.4s, v1.4s, v31.4s
> zip1 v27.4s, v30.4s, v31.4s
> zip2 v2.4s, v2.4s, v31.4s
> zip2 v0.4s, v0.4s, v31.4s
> zip2 v1.4s, v1.4s, v31.4s
> zip2 v30.4s, v30.4s, v31.4s
> stp q26, q2, [x3, -128]
> stp q28, q1, [x3, -96]
> stp q29, q0, [x3, -64]
> stp q27, q30, [x3, -32]
> cmp x4, x5
> bne .L4
>
> and instead we get:
>
> .L4:
> add x3, x3, 128
> ldr q23, [x4], 16
> tbl v5.16b, {v23.16b}, v31.16b
> tbl v4.16b, {v23.16b}, v30.16b
> tbl v3.16b, {v23.16b}, v29.16b
> tbl v2.16b, {v23.16b}, v28.16b
> tbl v1.16b, {v23.16b}, v27.16b
> tbl v0.16b, {v23.16b}, v26.16b
> tbl v22.16b, {v23.16b}, v25.16b
> tbl v23.16b, {v23.16b}, v24.16b
> stp q5, q4, [x3, -128]
> stp q3, q2, [x3, -96]
> stp q1, q0, [x3, -64]
> stp q22, q23, [x3, -32]
> cmp x4, x5
> bne .L4
>
> Which results in up to 40% performance uplift on certain workloads.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64.cc (aarch64_use_permute_for_promotion): New.
> (TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION): Use it.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/vect-tbl-zero-extend_1.c: New test.
LGTM, but...
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index
> 102680a0efca1ce928e6945033c01cfb68a65152..b90577f4fc8157b3e02936256c8af8b2b7fac144
> 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -28404,6 +28404,29 @@ aarch64_empty_mask_is_expensive (unsigned)
> return false;
> }
>
> +/* Implement TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION. Assume that
> + predicated operations when available are beneficial when doing more than
> + one step conversion. */
> +
> +static bool
> +aarch64_use_permute_for_promotion (const_tree in_type, const_tree out_type)
> +{
> + /* AArch64's vect_perm_constant doesn't currently support two 64 bit
> shuffles
> + into a 128 bit vector type. So for now reject it. */
> + if (maybe_ne (GET_MODE_BITSIZE (TYPE_MODE (in_type)),
> + GET_MODE_BITSIZE (TYPE_MODE (out_type))))
> + return false;
...I think we should explicitly reject VLA vectors here, since it isn't
a documented part of the interface that the vectors must be constant sized
(and IMO that's a good thing).
Also, it would be good to have SVE tests with -msve-vector-bits=512 or
something.
Thanks,
Richard
> +
> + auto bitsize_in = element_precision (in_type);
> + auto bitsize_out = element_precision (out_type);
> +
> + /* We don't want to use the permutes for a single widening step because
> we're
> + picking there between two zip and tbl sequences with the same throughput
> + and latencies. However the zip doesn't require a mask and uses less
> + registers so we prefer that. */
> + return (bitsize_out / bitsize_in) > 2;
> +}
> +
> /* Return 1 if pseudo register should be created and used to hold
> GOT address for PIC code. */
>
> @@ -31113,6 +31136,9 @@ aarch64_libgcc_floating_mode_supported_p
> #undef TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE
> #define TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE \
> aarch64_conditional_operation_is_expensive
> +#undef TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> +#define TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION \
> + aarch64_use_permute_for_promotion
> #undef TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> #define TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE \
> aarch64_empty_mask_is_expensive
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..3c088ced63543c203d1cc020de5d67807b48b3fb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> @@ -0,0 +1,52 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> +
> +void test1(unsigned short *x, double *y, int n) {
> + for(int i = 0; i < (n & -8); i++) {
> + unsigned short a = x[i*4+0];
> + unsigned short b = x[i*4+1];
> + unsigned short c = x[i*4+2];
> + unsigned short d = x[i*4+3];
> + y[i] = (double)a + (double)b + (double)c + (double)d;
> + }
> +}
> +
> +void test2(unsigned char *x, double *y, int n) {
> + for(int i = 0; i < (n & -8); i++) {
> + unsigned short a = x[i*4+0];
> + unsigned short b = x[i*4+1];
> + unsigned short c = x[i*4+2];
> + unsigned short d = x[i*4+3];
> + y[i] = (double)a + (double)b + (double)c + (double)d;
> + }
> +}
> +
> +void test3(unsigned short *x, double *y, int n) {
> + for(int i = 0; i < (n & -8); i++) {
> + unsigned int a = x[i];
> + y[i] = (double)a;
> + }
> +}
> +
> +void test4(unsigned short *x, long long *y, int n) {
> + for(int i = 0; i < (n & -8); i++) {
> + y[i] = x[i];
> + }
> +}
> +
> +void test5(unsigned int *x, long long *y, int n) {
> + for(int i = 0; i < (n & -8); i++) {
> + y[i] = x[i];
> + }
> +}
> +
> +void test6(unsigned char *x, long long *y, int n) {
> + for(int i = 0; i < (n & -8); i++) {
> + y[i] = x[i];
> + }
> +}
> +
> +/* { dg-final { scan-assembler-times {\tzip1} 1 } } */
> +/* { dg-final { scan-assembler-times {\tzip2} 1 } } */
> +/* { dg-final { scan-assembler-times {\ttbl} 64 } } */
> +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */