On Sat, Nov 28, 2009 at 2:13 PM, Yang Zhao <[email protected]> wrote: > The speed-up is definitely there, but __builtin_popcount() will still > be drastically faster when architecture-specific optimizations are > enabled:
I don't think this is the case (except for with SSE4's popcnt instruction, which your CFLAGS seem to be enabling.) Even when compiling with the Intel CC, which can undoubtedly can optimize code for Core 2 better than gcc, fast_bitcount is significantly faster. $ icc -O3 -ipo -march=core2 bc.c -o bc ipo: remark #11001: performing single-file optimizations ipo: remark #11005: generating object file /tmp/ipo_icce1aegt.o bc.c(61): (col. 5) remark: LOOP WAS VECTORIZED. $ ./bc 1 billion of __builtin_popcount(), fast_bitcount(), and naive() (in that order) __builtin_popcount(): 5.361 seconds fast_bitcount(): 1.274 seconds kr_bitcount(): 20.302 seconds naive(): 34.547 seconds Matt ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Mesa3d-dev mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mesa3d-dev
