On Mon, May 14, 2018 at 08:38:40AM -0500, Kyrill Tkachov wrote: > Hi all, > > This patch implements the usadv16qi and ssadv16qi standard names. > See the thread at on g...@gcc.gnu.org [1] for background. > > The V16QImode variant is important to get right as it is the most commonly > used pattern: > reducing vectors of bytes into an int. > The midend expects the optab to compute the absolute differences of operands > 1 and 2 and > reduce them while widening along the way up to SImode. So the inputs are > V16QImode and > the output is V4SImode. > > I've tried out a few different strategies for that, the one I settled with is > to emit: > UABDL2 tmp.8h, op1.16b, op2.16b > UABAL tmp.8h, op1.16b, op2.16b > UADALP op3.4s, tmp.8h > > To work through the semantics let's say operands 1 and 2 are: > op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 } > op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 } > op3 { c0, c1, c2, c3 } > > The UABDL2 takes the upper V8QI elements, computes their absolute > differences, widens them and stores them into the V8HImode tmp: > > tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), > ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) } > > The UABAL after that takes the lower V8QI elements, computes their absolute > differences, widens them and accumulates them into the V8HImode tmp from the > previous step: > > tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), > ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), > ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), > ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) } > > Finally the UADALP does a pairwise widening reduction and accumulation into > the V4SImode op3: > op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), > c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), > c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), > c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) } > > (sorry for the text dump) > > Remember, according to [1] the exact reduction sequence doesn't matter (for > integer arithmetic at least). > I've considered other sequences as well (thanks Wilco), for example > * UABD + UADDLP + UADALP > * UABLD2 + UABDL + UADALP + UADALP > > I ended up settling in the sequence in this patch as it's short (3 > instructions) and in the future we can potentially > look to optimise multiple occurrences of these into something even faster > (for example accumulating into H registers for longer > before doing a single UADALP in the end to accumulate into the final S > register). > > If your microarchitecture has some some strong preferences for a particular > sequence, please let me know or, even better, propose a patch > to parametrise the generation sequence by code (or the appropriate RTX cost). > > > This expansion allows the vectoriser to avoid unpacking the bytes in two > steps and performing V4SI arithmetic on them. > So, for the code: > > unsigned char pix1[N], pix2[N]; > > int foo (void) > { > int i_sum = 0; > int i; > > for (i = 0; i < 16; i++) > i_sum += __builtin_abs (pix1[i] - pix2[i]); > > return i_sum; > } > > we now generate on aarch64: > foo: > adrp x1, pix1 > add x1, x1, :lo12:pix1 > movi v0.4s, 0 > adrp x0, pix2 > add x0, x0, :lo12:pix2 > ldr q2, [x1] > ldr q3, [x0] > uabdl2 v1.8h, v2.16b, v3.16b > uabal v1.8h, v2.8b, v3.8b > uadalp v0.4s, v1.8h > addv s0, v0.4s > umov w0, v0.s[0] > ret > > > instead of: > foo: > adrp x1, pix1 > adrp x0, pix2 > add x1, x1, :lo12:pix1 > add x0, x0, :lo12:pix2 > ldr q0, [x1] > ldr q4, [x0] > ushll v1.8h, v0.8b, 0 > ushll2 v0.8h, v0.16b, 0 > ushll v2.8h, v4.8b, 0 > ushll2 v4.8h, v4.16b, 0 > usubl v3.4s, v1.4h, v2.4h > usubl2 v1.4s, v1.8h, v2.8h > usubl v2.4s, v0.4h, v4.4h > usubl2 v0.4s, v0.8h, v4.8h > abs v3.4s, v3.4s > abs v1.4s, v1.4s > abs v2.4s, v2.4s > abs v0.4s, v0.4s > add v1.4s, v3.4s, v1.4s > add v1.4s, v2.4s, v1.4s > add v0.4s, v0.4s, v1.4s > addv s0, v0.4s > umov w0, v0.s[0] > ret > > So I expect this new expansion to be better than the status quo in any case. > Bootstrapped and tested on aarch64-none-linux-gnu. > This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72. > > Ok for trunk?
You don't say it explicitly here, but I presume the mid-end takes care of zeroing the accumulator register before the loop (i.e. op3 in your sequence in aarch64-simd.md)? If so, looks good to me. Ok for trunk. By the way, now you have the patterns, presumably you could also wire them up in arm_neon.h Thanks for the patch! James > > Thanks, > Kyrill > > [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html > > > 2018-05-11 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL, > UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2, > UNSPEC_UADALP values. > * config/aarch64/iterators.md (ABAL): New int iterator. > (ABDL2): Likewise. > (ADALP): Likewise. > (sur): Add mappings for the above. > * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3): > New define_insn. > (aarch64_<sur>abal<mode>_4): Likewise. > (aarch64_<sur>adalp<mode>_3): Likewise. > (<sur>sadv16qi): New define_expand. > > 2018-05-11 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * gcc.c-torture/execute/ssad-run.c: New test. > * gcc.c-torture/execute/usad-run.c: Likewise. > * gcc.target/aarch64/ssadv16qi.c: Likewise. > * gcc.target/aarch64/usadv16qi.c: Likewise.