https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89007
Bug ID: 89007 Summary: Implement generic vector average expansion Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- GCC 9 knows how to recognise vector average operations since PR 85694. Some targets have optabs to do it in one instruction. For the targets that don't, we could still do better than the fallback widening -> arithmetic -> narrowing sequence though. Maybe we could implement a generic expansion for the case when there is no target optab. For example: #define N 1024 unsigned char dst[N]; unsigned char in1[N]; unsigned char in2[N]; void foo () { for( int x = 0; x < N; x++ ) dst[x] = (in1[x] + in2[x] + 1) >> 1; } For aarch64 -march=armv8-a+sve -O3 we generate: .L2: ld1b z0.b, p0/z, [x5, x0] ld1b z2.b, p0/z, [x4, x0] uunpklo z1.h, z0.b uunpklo z3.h, z2.b uunpkhi z0.h, z0.b uunpkhi z2.h, z2.b add z1.h, z1.h, z3.h add z0.h, z0.h, z2.h add z1.h, z1.h, #1 add z0.h, z0.h, #1 lsr z1.h, z1.h, #1 lsr z0.h, z0.h, #1 uzp1 z0.b, z1.b, z0.b st1b z0.b, p0, [x2, x0] incb x0 whilelo p0.b, x0, x3 bne .L2 But we could generate the more optimal: ld1b {z0.b}, p0/z, [x0, x4] ld1b {z2.b}, p0/z, [x1, x4] orr z4.d, z0.d, z2.d // use and for floor rounding and z4.b, z4.b, #1 lsr z0.b, z0.b, #1 // use asr for signed numbers lsr z2.b, z2.b, #1 // likewise add z0.b, z0.b, z2.b add z0.b, z0.b, z4.b st1b {z0.b}, p0, [x2, x4] I think this doesn't require too much fancy target support, just some vector masking operations