https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113859
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- SI (and DI) can be optimized too. LLVM is produces for int: ldr d0, [x0] cnt v0.8b, v0.8b uaddlp v0.4h, v0.8b uaddlp v0.2s, v0.4h str d0, [x1] ret And for long: ``` ldr q0, [x0] cnt v0.16b, v0.16b uaddlp v0.8h, v0.16b uaddlp v0.4s, v0.8h uaddlp v0.2d, v0.4s str q0, [x1] ret ``` That is for SLP version: ``` void f(unsigned long * __restrict b, unsigned long * __restrict d) { d[0] = __builtin_popcountll(b[0]); d[1] = __builtin_popcountll(b[1]); } ``` s/long/int/ in the first case. Note using SVE is better than the above if it is available and that is part of PR 113860 though.