https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89007
Bug ID: 89007
Summary: Implement generic vector average expansion
Product: gcc
Version: 9.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ktkachov at gcc dot gnu.org
Target Milestone: ---
GCC 9 knows how to recognise vector average operations since PR 85694. Some
targets have optabs to do it in one instruction.
For the targets that don't, we could still do better than the fallback widening
-> arithmetic -> narrowing sequence though. Maybe we could implement a generic
expansion for the case when there is no target optab.
For example:
#define N 1024
unsigned char dst[N];
unsigned char in1[N];
unsigned char in2[N];
void
foo ()
{
for( int x = 0; x < N; x++ )
dst[x] = (in1[x] + in2[x] + 1) >> 1;
}
For aarch64 -march=armv8-a+sve -O3 we generate:
.L2:
ld1b z0.b, p0/z, [x5, x0]
ld1b z2.b, p0/z, [x4, x0]
uunpklo z1.h, z0.b
uunpklo z3.h, z2.b
uunpkhi z0.h, z0.b
uunpkhi z2.h, z2.b
add z1.h, z1.h, z3.h
add z0.h, z0.h, z2.h
add z1.h, z1.h, #1
add z0.h, z0.h, #1
lsr z1.h, z1.h, #1
lsr z0.h, z0.h, #1
uzp1 z0.b, z1.b, z0.b
st1b z0.b, p0, [x2, x0]
incb x0
whilelo p0.b, x0, x3
bne .L2
But we could generate the more optimal:
ld1b {z0.b}, p0/z, [x0, x4]
ld1b {z2.b}, p0/z, [x1, x4]
orr z4.d, z0.d, z2.d // use and for floor rounding
and z4.b, z4.b, #1
lsr z0.b, z0.b, #1 // use asr for signed numbers
lsr z2.b, z2.b, #1 // likewise
add z0.b, z0.b, z2.b
add z0.b, z0.b, z4.b
st1b {z0.b}, p0, [x2, x4]
I think this doesn't require too much fancy target support, just some vector
masking operations