https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113813
Bug ID: 113813
Summary: Reduction of xor/and/ior of 16 bytes can be improved
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: pinskia at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
Take:
```
#define SIGN unsigned
#define TYPE char
#define SIZE 16
void sor(SIGN TYPE *a, SIGN TYPE *r)
{
SIGN TYPE b = 0;
for(int i = 0; i < SIZE; i++)
b |= a[i];
*r = b;
}
void sxor(SIGN TYPE *a, SIGN TYPE *r)
{
SIGN TYPE b = 0;
for(int i = 0; i < SIZE; i++)
b ^= a[i];
*r = b;
}
void sand(SIGN TYPE *a, SIGN TYPE *r)
{
SIGN TYPE b = -1;
for(int i = 0; i < SIZE; i++)
b &= a[i];
*r = b;
}
```
Currently for sor GCC (at `-O3 -march=armv9-a+sve2 -fno-vect-cost-model`)
produces:
```
ptrue p7.b, vl16
ptrue p6.b, all
ld1b z31.b, p7/z, [x0]
mov z30.b, #0
sel z30.b, p7, z31.b, z30.b
orv b30, p6, z30.b
str b30, [x1]
```
But this could be improved to just:
```
ptrue p7.b, vl16
ld1b z31.b, p7/z, [x0]
orv b30, p7, z30.b
str b30, [x1]
```
Similarly for sxor/sand.
The same is true for short/int (8/4).
Note without -fno-vect-cost-model, it is just so much worse (on the trunk
only).
Note we should be able to use the SVE instruction when perfering NEON auto-vec
too.