http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041



Gunther Piez <gpiez at web dot de> changed:



           What    |Removed                     |Added

----------------------------------------------------------------------------

                 CC|                            |gpiez at web dot de



--- Comment #10 from Gunther Piez <gpiez at web dot de> 2012-10-26 15:51:24 UTC 
---

Just noted the exceptional slowness of the provided __builtin_popcountll() even

on ARMv5.



I already used the above parallel bit count algorithm in the case that a native

bit count instruction (like the SSE popcnt or NEON vcnt) is not present, but

native 64 bit registers are available. 



But on a 32 bit architecture like ARM I figured it made sense to just use the

__builtin_popcountll() because the many 64 bit instructions in the algorithm

may be very slow without NEON or similar support on a pure 32 bit architecture.



But "optimizing" my code with some macro magic to make it use the library

popcount made the whole program 25% slower, although only a minor part of it

actually does use the popcount instruction.

Reply via email to