https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122014
Bug ID: 122014
Summary: (AArch64) Optimize 8-bit and 16-bit popcount as
special cases
Product: gcc
Version: 15.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: Explorer09 at gmail dot com
Target Milestone: ---
GCC for AArch64 supports using CNT instruction for popcount. However, when it
comes to popcount for 8-bit or 16-bit integers, the code GCC emits is not the
shortest possible.
This is a feature request - to implement 8-bit and 16-bit popcount operations
as special cases.
I show my implementation using intrinsics, and a comparison with GCC's builtin.
In particular there's no need to bitwise-AND the value if the input is 8 bits
or 16 bits.
The implementation using intrinsics should also work with ARMv7-A+NEON, but I'm
filing this report for AArch64 only, as it seems that GCC doesn't yet support
popcount using NEON there.
```c
#include <stdint.h>
#if defined(__ARM_NEON)
#include <arm_neon.h>
unsigned int popcount_8(uint8_t x) {
// Set all lanes at once so that the compiler doesn't need to mask
// out the upper bits.
uint8x8_t v = vdup_n_u8(x);
v = vcnt_u8(v);
return vget_lane_u8(v, 0);
}
unsigned int popcount_16(uint16_t x) {
uint16x4_t v_h = vdup_n_u16(x);
uint8x8_t v_b = vcnt_u8(vreinterpret_u8_u16(v_h));
v_h = vpaddl_u8(v_b);
return vget_lane_u16(v_h, 0);
}
#endif
unsigned int popcount_8_b(uint8_t x) {
return (unsigned int)__builtin_popcountg(x);
}
unsigned int popcount_16_b(uint16_t x) {
return (unsigned int)__builtin_popcountg(x);
}
```
(Tested in Compiler Explorer)
ARM64 GCC 15.2.0 with `-Os` option:
```assembly
popcount_8:
dup v31.8b, w0
cnt v31.8b, v31.8b
umov w0, v31.b[0]
ret
popcount_16:
dup v31.4h, w0
cnt v31.8b, v31.8b
uaddlp v31.4h, v31.8b
umov w0, v31.h[0]
ret
popcount_8_b:
and w0, w0, 255
fmov d31, x0
cnt v31.8b, v31.8b
smov w0, v31.b[0]
ret
popcount_16_b:
and x0, x0, 65535
fmov d31, x0
cnt v31.8b, v31.8b
addv b31, v31.8b
fmov w0, s31
ret
```
armv8-a clang 21.1.0 with `-Os` option:
```assembly
popcount_8:
fmov s0, w0
cnt v0.8b, v0.8b
umov w0, v0.b[0]
ret
popcount_16:
dup v0.4h, w0
cnt v0.8b, v0.8b
uaddlp v0.4h, v0.8b
umov w0, v0.h[0]
ret
```
(It looks like both FMOV instruction and DUP instruction work - pick one that
is cheaper.)
(I've also reported the issue in Clang:
https://github.com/llvm/llvm-project/issues/159552 )