On Thu, 21 Jan 2021, Richard Sandiford wrote: > Joel Hutton via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > Hi all, > > > > This patch removes support for the widening subtract operation in the > > aarch64 backend as it is causing a performance regression. > > > > In the following example: > > > > #include <stdint.h> > > extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t *restrict > > pix2) > > { > > for( int y = 0; y < 4; y++ ) > > { > > for( int x = 0; x < 4; x++ ) > > d[x + y*4] = pix1[x] - pix2[x]; > > pix1 += 16; > > pix2 += 16; > > } > > > > The widening minus pattern is recognized and substituted, but cannot be > > used due to the input vector type chosen in slp vectorization. This results > > in an attempt to do an 8 byte->8 short widening subtract operation, which > > is not supported. > > > > The issue is documented in PR 98772. > > IMO removing just the sub patterns is too arbitrary. Like you say, > the PR affects all widening instructions. but it happens that the > first encountered regression used subtraction. > > I think we should try to fix the PR instead. The widening operations > can only be used for SLP if the group_size is at least double the > number of elements in the vectype, so one idea (not worked through) > is to make the vect_build_slp_tree family of routines undo pattern > recognition for widening operations if the group size is less than that. > > Richi would know better than me though.
Why should the widen ops be only usable with such constraints? As I read md.texi they for example do v8qi -> v8hi operations (where the v8qi is either lo or hi part of a v16qi vector). The dumps show we use a VF of 4 which makes us have two v8hi vectors and one v16qi which vectorizable_conversion should be able to handle by emitting hi/lo widened subtracts. Of course the dumps also show we fail vector construction because of the permute. If as I say you force strieded code-gen we get optimal wdiff: .LFB0: .cfi_startproc lsl w3, w3, 4 ldr s0, [x1] ldr s1, [x2] sxtw x4, w3 ldr s3, [x1, w3, sxtw] add x5, x2, x4 ldr s2, [x2, w3, sxtw] add x1, x1, x4 add x2, x1, x4 add x4, x5, x4 ins v0.s[1], v3.s[0] ldr s4, [x5, w3, sxtw] ins v1.s[1], v2.s[0] ldr s5, [x1, w3, sxtw] ldr s2, [x4, w3, sxtw] ldr s3, [x2, w3, sxtw] ins v0.s[2], v5.s[0] ins v1.s[2], v4.s[0] ins v0.s[3], v3.s[0] ins v1.s[3], v2.s[0] usubl v2.8h, v0.8b, v1.8b usubl2 v0.8h, v0.16b, v1.16b stp q2, q0, [x0] ret from void wdiff( short d[16], unsigned char *restrict pix1, unsigned char *restrict pix2, int s) { for( int y = 0; y < 4; y++ ) { for( int x = 0; x < 4; x++ ) d[x + y*4] = pix1[x] - pix2[x]; pix1 += 16*s; pix2 += 16*s; } } so the fix, if, is to fix the bug that mentions this issue and appropriately classify / vectorize the load. Richard > Thanks, > Richard > > > > > > > [AArch64] Remove backend support for widen-sub > > > > This patch removes support for the widening subtract operation in the > > aarch64 backend as it is causing a performance regression. > > > > gcc/ChangeLog: > > > > * config/aarch64/aarch64-simd.md > > (vec_widen_<su>subl_lo_<mode>): Removed. > > (vec_widen_<su>subl_hi_<mode>): Removed. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/vect-widen-sub.c: Removed. > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)