http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041
--- Comment #20 from Marc Glisse <glisse at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #18) > I think it is a bad idea to introduce the IFUNC into libgcc_s, because then > while you speed up the few users of this builtin, you slow down all users of > libgcc_s (pretty much all C++ programs and lots of C programs), because they > will need to resolve the ifunc. I assume it is only those that use the builtin at least once, no? At least LD_DEBUG seems to say so. I have no idea how heavy the ifunc resolution is, so ok. We are back to only considering the non-table version... (By the way, shouldn't these builtins act like C99 inline functions, so we can sometimes inline them at -O3 (it could also enable vectorization)? Or maybe they already do and it's just that I didn't test hard enough) (In reply to Cristian RodrÃguez from comment #19) > Hold on..Apparently I used ambiguous language in my comment.. adding ifuncs > to libgcc* was not my real suggestion, but to EMIT such IFUNC s in the > resulting final user code when the target environment allows it. One > generic, one hardware/arch specific. Not sure if that's much better. Ideally we'd clone the hot loop that uses it and propagate the versioning to that, not just the instruction, but I don't think we have any code for that. Although if gcc saw the full code: if(__builtin_cpu_supports("popcnt"))_mm_popcnt_u64(x);else{call lib}, it might already manage to clone the loop.