https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116186
Bug ID: 116186
Summary: the scalar cost for popcount is off for
-mcpu=neoverse-n2 (and generic-armv9-a)
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: pinskia at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
Take:
```
void
f_v4si (unsigned int *__restrict b, unsigned int *__restrict d)
{
d[0] = __builtin_popcountll (b[0]);
d[1] = __builtin_popcountll (b[1]);
d[2] = __builtin_popcountll (b[2]);
d[3] = __builtin_popcountll (b[3]);
}
```
This should SLP but currently does not with `-O3 -mcpu=neoverse-n2` due to the
cost model:
```
/app/example.cpp:5:8: note: Cost model analysis for part in loop 0:
Vector cost: 7
Scalar cost: 4
/app/example.cpp:5:8: missed: not vectorized: vectorization is not profitable.
```
But the cost of the scalar popcount here is basically similar to the cost of
doing V2SI.
With generic-armv8 we get:
```
/app/example.cpp:5:8: note: Cost model analysis for part in loop 0:
Vector cost: 3
Scalar cost: 4
```
And it is vectorized.