------- Comment #2 from herumi at nifty dot com 2010-02-05 17:20 ------- >You should split your application into files that are compiled with either -msse2 or -msse4. Using -msse4, you will get what you asked for.
I see, but according to Intel 64 and IA-32 Architectures Optimization Reference Manual (http://www.intel.com/Assets/PDF/manual/248966.pdf), their throughput and latency are the following: CPU1: 06_{1ah,1eh,1fh,2eh} family CPU2: 06_{17,1d} latency throughtput CPU1 CPU2 CPU1 CPU2 pextrd reg, xmm1, imm 3 5 1 1 ; p.C-5 movd r32, xmm 1 1 0.33 0.33 ; p.C-10 (see Table C-3 and Table C-6a in Appendix C) movd is faster than pextrd, so I think gcc should use movd. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42968