[Bug tree-optimization/89007] New: Implement generic vector average expansion

ktkachov at gcc dot gnu.org Wed, 23 Jan 2019 00:53:16 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89007


            Bug ID: 89007
           Summary: Implement generic vector average expansion
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

GCC 9 knows how to recognise vector average operations since PR 85694. Some
targets have optabs to do it in one instruction.

For the targets that don't, we could still do better than the fallback widening
-> arithmetic -> narrowing sequence though. Maybe we could implement a generic
expansion for the case when there is no target optab.

For example:
#define N 1024
unsigned char dst[N];
unsigned char in1[N];
unsigned char in2[N];

void
foo ()
{
  for( int x = 0; x < N; x++ )
    dst[x] = (in1[x] + in2[x] + 1) >> 1;
}

For aarch64 -march=armv8-a+sve -O3 we generate:
.L2:
        ld1b    z0.b, p0/z, [x5, x0]
        ld1b    z2.b, p0/z, [x4, x0]
        uunpklo z1.h, z0.b
        uunpklo z3.h, z2.b
        uunpkhi z0.h, z0.b
        uunpkhi z2.h, z2.b
        add     z1.h, z1.h, z3.h
        add     z0.h, z0.h, z2.h
        add     z1.h, z1.h, #1
        add     z0.h, z0.h, #1
        lsr     z1.h, z1.h, #1
        lsr     z0.h, z0.h, #1
        uzp1    z0.b, z1.b, z0.b
        st1b    z0.b, p0, [x2, x0]
        incb    x0
        whilelo p0.b, x0, x3
        bne     .L2


But we could generate the more optimal:
    ld1b    {z0.b}, p0/z, [x0, x4]
    ld1b    {z2.b}, p0/z, [x1, x4]
    orr     z4.d, z0.d, z2.d         // use and for floor rounding
    and     z4.b, z4.b, #1
    lsr     z0.b, z0.b, #1            // use asr for signed numbers
    lsr     z2.b, z2.b, #1            // likewise
    add     z0.b, z0.b, z2.b
    add     z0.b, z0.b, z4.b
    st1b    {z0.b}, p0, [x2, x4]

I think this doesn't require too much fancy target support, just some vector
masking operations

[Bug tree-optimization/89007] New: Implement generic vector average expansion

Reply via email to