On Thu, 21 Jan 2021, Richard Sandiford wrote:
> Joel Hutton via Gcc-patches <[email protected]> writes:
> > Hi all,
> >
> > This patch removes support for the widening subtract operation in the
> > aarch64 backend as it is causing a performance regression.
> >
> > In the following example:
> >
> > #include <stdint.h>
> > extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t *restrict
> > pix2)
> > {
> > for( int y = 0; y < 4; y++ )
> > {
> > for( int x = 0; x < 4; x++ )
> > d[x + y*4] = pix1[x] - pix2[x];
> > pix1 += 16;
> > pix2 += 16;
> > }
> >
> > The widening minus pattern is recognized and substituted, but cannot be
> > used due to the input vector type chosen in slp vectorization. This results
> > in an attempt to do an 8 byte->8 short widening subtract operation, which
> > is not supported.
> >
> > The issue is documented in PR 98772.
>
> IMO removing just the sub patterns is too arbitrary. Like you say,
> the PR affects all widening instructions. but it happens that the
> first encountered regression used subtraction.
>
> I think we should try to fix the PR instead. The widening operations
> can only be used for SLP if the group_size is at least double the
> number of elements in the vectype, so one idea (not worked through)
> is to make the vect_build_slp_tree family of routines undo pattern
> recognition for widening operations if the group size is less than that.
>
> Richi would know better than me though.
Why should the widen ops be only usable with such constraints?
As I read md.texi they for example do v8qi -> v8hi operations
(where the v8qi is either lo or hi part of a v16qi vector).
The dumps show we use a VF of 4 which makes us have two v8hi
vectors and one v16qi which vectorizable_conversion should
be able to handle by emitting hi/lo widened subtracts.
Of course the dumps also show we fail vector construction
because of the permute. If as I say you force strieded code-gen
we get optimal
wdiff:
.LFB0:
.cfi_startproc
lsl w3, w3, 4
ldr s0, [x1]
ldr s1, [x2]
sxtw x4, w3
ldr s3, [x1, w3, sxtw]
add x5, x2, x4
ldr s2, [x2, w3, sxtw]
add x1, x1, x4
add x2, x1, x4
add x4, x5, x4
ins v0.s[1], v3.s[0]
ldr s4, [x5, w3, sxtw]
ins v1.s[1], v2.s[0]
ldr s5, [x1, w3, sxtw]
ldr s2, [x4, w3, sxtw]
ldr s3, [x2, w3, sxtw]
ins v0.s[2], v5.s[0]
ins v1.s[2], v4.s[0]
ins v0.s[3], v3.s[0]
ins v1.s[3], v2.s[0]
usubl v2.8h, v0.8b, v1.8b
usubl2 v0.8h, v0.16b, v1.16b
stp q2, q0, [x0]
ret
from
void wdiff( short d[16], unsigned char *restrict pix1, unsigned char
*restrict pix2, int s)
{
for( int y = 0; y < 4; y++ )
{
for( int x = 0; x < 4; x++ )
d[x + y*4] = pix1[x] - pix2[x];
pix1 += 16*s;
pix2 += 16*s;
}
}
so the fix, if, is to fix the bug that mentions this issue
and appropriately classify / vectorize the load.
Richard
> Thanks,
> Richard
>
> >
> >
> > [AArch64] Remove backend support for widen-sub
> >
> > This patch removes support for the widening subtract operation in the
> > aarch64 backend as it is causing a performance regression.
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-simd.md
> > (vec_widen_<su>subl_lo_<mode>): Removed.
> > (vec_widen_<su>subl_hi_<mode>): Removed.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/aarch64/vect-widen-sub.c: Removed.
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)