http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041
Gunther Piez <gpiez at web dot de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |gpiez at web dot de --- Comment #10 from Gunther Piez <gpiez at web dot de> 2012-10-26 15:51:24 UTC --- Just noted the exceptional slowness of the provided __builtin_popcountll() even on ARMv5. I already used the above parallel bit count algorithm in the case that a native bit count instruction (like the SSE popcnt or NEON vcnt) is not present, but native 64 bit registers are available. But on a 32 bit architecture like ARM I figured it made sense to just use the __builtin_popcountll() because the many 64 bit instructions in the algorithm may be very slow without NEON or similar support on a pure 32 bit architecture. But "optimizing" my code with some macro magic to make it use the library popcount made the whole program 25% slower, although only a minor part of it actually does use the popcount instruction.