On Fri, Nov 16, 2012 at 02:12:46AM -0800, Colin Percival wrote: > On 11/15/12 16:09, Solar Designer wrote: > > The 30% speedup on AMD Bulldozer is primarily due to the use of XOP bit > > rotate intrinsics, indeed. With only this one change and no other > > changes, the speedup was about 25%. > > Sounds like those are worth having... at the expense of needing yet another > compile-time option (or run-time detection, ick).
I think we can start by using #ifdef __XOP__, like I used in the proposed patch. The compiler defines this when it is permitted to generate XOP instructions - e.g., gcc does it when run with -mxop, or when run with -march=native and the host's CPU supports XOP (as well as for specific arch names that imply XOP support). Yes, we could also have --enable-xop, which would add -mxop, and/or we could have runtime detection (best for users, but most complicated - especially if we want the code to be almost as fast as each of the compile-time choices). > I'll put your legal name in, just in case... OK. Alexander
