Re: [PATCH][AArch64] Implement usadv16qi and ssadv16qi standard names

James Greenhalgh Fri, 18 May 2018 18:10:42 -0700

On Mon, May 14, 2018 at 08:38:40AM -0500, Kyrill Tkachov wrote:
> Hi all,
> 
> This patch implements the usadv16qi and ssadv16qi standard names.
> See the thread at on g...@gcc.gnu.org [1] for background.
> 
> The V16QImode variant is important to get right as it is the most commonly 
> used pattern:
> reducing vectors of bytes into an int.
> The midend expects the optab to compute the absolute differences of operands 
> 1 and 2 and
> reduce them while widening along the way up to SImode. So the inputs are 
> V16QImode and
> the output is V4SImode.
> 
> I've tried out a few different strategies for that, the one I settled with is 
> to emit:
> UABDL2    tmp.8h, op1.16b, op2.16b
> UABAL    tmp.8h, op1.16b, op2.16b
> UADALP    op3.4s, tmp.8h
> 
> To work through the semantics let's say operands 1 and 2 are:
> op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
> op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }
> op3 { c0, c1, c2, c3 }
> 
> The UABDL2 takes the upper V8QI elements, computes their absolute 
> differences, widens them and stores them into the V8HImode tmp:
> 
> tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), 
> ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) }
> 
> The UABAL after that takes the lower V8QI elements, computes their absolute 
> differences, widens them and accumulates them into the V8HImode tmp from the 
> previous step:
> 
> tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), 
> ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), 
> ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), 
> ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
> 
> Finally the UADALP does a pairwise widening reduction and accumulation into 
> the V4SImode op3:
> op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), 
> c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), 
> c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), 
> c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) }
> 
> (sorry for the text dump)
> 
> Remember, according to [1] the exact reduction sequence doesn't matter (for 
> integer arithmetic at least).
> I've considered other sequences as well (thanks Wilco), for example
> * UABD + UADDLP + UADALP
> * UABLD2 + UABDL + UADALP + UADALP
> 
> I ended up settling in the sequence in this patch as it's short (3 
> instructions) and in the future we can potentially
> look to optimise multiple occurrences of these into something even faster 
> (for example accumulating into H registers for longer
> before doing a single UADALP in the end to accumulate into the final S 
> register).
> 
> If your microarchitecture has some some strong preferences for a particular 
> sequence, please let me know or, even better, propose a patch
> to parametrise the generation sequence by code (or the appropriate RTX cost).
> 
> 
> This expansion allows the vectoriser to avoid unpacking the bytes in two 
> steps and performing V4SI arithmetic on them.
> So, for the code:
> 
> unsigned char pix1[N], pix2[N];
> 
> int foo (void)
> {
>    int i_sum = 0;
>    int i;
> 
>    for (i = 0; i < 16; i++)
>      i_sum += __builtin_abs (pix1[i] - pix2[i]);
> 
>    return i_sum;
> }
> 
> we now generate on aarch64:
> foo:
>          adrp    x1, pix1
>          add     x1, x1, :lo12:pix1
>          movi    v0.4s, 0
>          adrp    x0, pix2
>          add     x0, x0, :lo12:pix2
>          ldr     q2, [x1]
>          ldr     q3, [x0]
>          uabdl2  v1.8h, v2.16b, v3.16b
>          uabal   v1.8h, v2.8b, v3.8b
>          uadalp  v0.4s, v1.8h
>          addv    s0, v0.4s
>          umov    w0, v0.s[0]
>          ret
> 
> 
> instead of:
> foo:
>          adrp    x1, pix1
>          adrp    x0, pix2
>          add     x1, x1, :lo12:pix1
>          add     x0, x0, :lo12:pix2
>          ldr     q0, [x1]
>          ldr     q4, [x0]
>          ushll   v1.8h, v0.8b, 0
>          ushll2  v0.8h, v0.16b, 0
>          ushll   v2.8h, v4.8b, 0
>          ushll2  v4.8h, v4.16b, 0
>          usubl   v3.4s, v1.4h, v2.4h
>          usubl2  v1.4s, v1.8h, v2.8h
>          usubl   v2.4s, v0.4h, v4.4h
>          usubl2  v0.4s, v0.8h, v4.8h
>          abs     v3.4s, v3.4s
>          abs     v1.4s, v1.4s
>          abs     v2.4s, v2.4s
>          abs     v0.4s, v0.4s
>          add     v1.4s, v3.4s, v1.4s
>          add     v1.4s, v2.4s, v1.4s
>          add     v0.4s, v0.4s, v1.4s
>          addv    s0, v0.4s
>          umov    w0, v0.s[0]
>          ret
> 
> So I expect this new expansion to be better than the status quo in any case.
> Bootstrapped and tested on aarch64-none-linux-gnu.
> This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72.
> 
> Ok for trunk?


You don't say it explicitly here, but I presume the mid-end takes care of
zeroing the accumulator register before the loop (i.e. op3 in your sequence
in aarch64-simd.md)?

If so, looks good to me.

Ok for trunk.

By the way, now you have the patterns, presumably you could also wire them
up in arm_neon.h

Thanks for the patch!

James


> 
> Thanks,
> Kyrill
> 
> [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html
> 
> 
> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> 
>      * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL,
>      UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2,
>      UNSPEC_UADALP values.
>      * config/aarch64/iterators.md (ABAL): New int iterator.
>      (ABDL2): Likewise.
>      (ADALP): Likewise.
>      (sur): Add mappings for the above.
>      * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3):
>      New define_insn.
>      (aarch64_<sur>abal<mode>_4): Likewise.
>      (aarch64_<sur>adalp<mode>_3): Likewise.
>      (<sur>sadv16qi): New define_expand.
> 
> 2018-05-11  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
> 
>      * gcc.c-torture/execute/ssad-run.c: New test.
>      * gcc.c-torture/execute/usad-run.c: Likewise.
>      * gcc.target/aarch64/ssadv16qi.c: Likewise.
>      * gcc.target/aarch64/usadv16qi.c: Likewise.

Re: [PATCH][AArch64] Implement usadv16qi and ssadv16qi standard names

Reply via email to