On Thu, 21 Jan 2021, Richard Sandiford wrote:

> Joel Hutton via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Hi all, 
> >
> > This patch removes support for the widening subtract operation in the 
> > aarch64 backend as it is causing a performance regression.
> >
> > In the following example:
> >
> > #include <stdint.h>
> > extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t *restrict 
> > pix2)
> > {
> >    for( int y = 0; y < 4; y++ )
> >   {    
> >     for( int x = 0; x < 4; x++ )
> >       d[x + y*4] = pix1[x] - pix2[x];
> >     pix1 += 16;  
> >     pix2 += 16;
> >  }
> >
> > The widening minus pattern is recognized and substituted, but cannot be 
> > used due to the input vector type chosen in slp vectorization. This results 
> > in an attempt to do an 8 byte->8 short widening subtract operation, which 
> > is not supported. 
> >
> > The issue is documented in PR 98772.
> 
> IMO removing just the sub patterns is too arbitrary.  Like you say,
> the PR affects all widening instructions. but it happens that the
> first encountered regression used subtraction.
> 
> I think we should try to fix the PR instead.  The widening operations
> can only be used for SLP if the group_size is at least double the
> number of elements in the vectype, so one idea (not worked through)
> is to make the vect_build_slp_tree family of routines undo pattern
> recognition for widening operations if the group size is less than that.
> 
> Richi would know better than me though.

Why should the widen ops be only usable with such constraints?
As I read md.texi they for example do v8qi -> v8hi operations
(where the v8qi is either lo or hi part of a v16qi vector).

The dumps show we use a VF of 4 which makes us have two v8hi
vectors and one v16qi which vectorizable_conversion should
be able to handle by emitting hi/lo widened subtracts.

Of course the dumps also show we fail vector construction
because of the permute.  If as I say you force strieded code-gen
we get optimal

wdiff:
.LFB0:
        .cfi_startproc
        lsl     w3, w3, 4
        ldr     s0, [x1]
        ldr     s1, [x2]
        sxtw    x4, w3
        ldr     s3, [x1, w3, sxtw]
        add     x5, x2, x4
        ldr     s2, [x2, w3, sxtw]
        add     x1, x1, x4
        add     x2, x1, x4
        add     x4, x5, x4
        ins     v0.s[1], v3.s[0]
        ldr     s4, [x5, w3, sxtw]
        ins     v1.s[1], v2.s[0]
        ldr     s5, [x1, w3, sxtw]
        ldr     s2, [x4, w3, sxtw]
        ldr     s3, [x2, w3, sxtw]
        ins     v0.s[2], v5.s[0]
        ins     v1.s[2], v4.s[0]
        ins     v0.s[3], v3.s[0]
        ins     v1.s[3], v2.s[0]
        usubl   v2.8h, v0.8b, v1.8b
        usubl2  v0.8h, v0.16b, v1.16b
        stp     q2, q0, [x0]
        ret

from

void wdiff( short d[16], unsigned char *restrict pix1, unsigned char 
*restrict pix2, int s)
{
   for( int y = 0; y < 4; y++ )
  {
    for( int x = 0; x < 4; x++ )
      d[x + y*4] = pix1[x] - pix2[x];
    pix1 += 16*s;
    pix2 += 16*s;
 }
}

so the fix, if, is to fix the bug that mentions this issue
and appropriately classify / vectorize the load.

Richard

> Thanks,
> Richard
> 
> >
> >
> > [AArch64] Remove backend support for widen-sub
> >
> > This patch removes support for the widening subtract operation in the 
> > aarch64 backend as it is causing a performance regression.
> >
> > gcc/ChangeLog:
> >
> >         * config/aarch64/aarch64-simd.md    
> >         (vec_widen_<su>subl_lo_<mode>): Removed.
> >         (vec_widen_<su>subl_hi_<mode>): Removed.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/aarch64/vect-widen-sub.c: Removed.
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Reply via email to