Hi, I have been hand optimising a loop that GCC 4.6 was not able to vectorise.
I have been keeping an eye on the assembly output of this loop and have noticed GCC inserting unnecessary MOVAPS instructions. It only happens when I use the same variable for both inputs to a SSE intrinsic. E.g: __xmm128 input = /* something */; _mm_shuffle_ps(input, input, _MM_SHUFFLE(2, 3, 0, 1)); Will produce instructions like this: movaps %xmm0,%xmm1 shufps $0xb1,%xmm0,%xmm1 Where optimally it should be: shufps $0xb1,%xmm0,%xmm0 It appears that as _mm_shuffle_ps is a function it is treating the two inputs separately rather than combining them when they reference the same register. I have looked at Visual C++ 2010 and it does not add these extra MOVAPS instructions. It is likely that this does not have a large impact in performance as the CPU should just rename the registers, but it does add more instructions to decode and more stuff to place into instruction cache etc. The output of gcc -v Using built-in specs. COLLECT_GCC=gcc-4.6 COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-unknown-linux-gnu/4.6.2/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ../gcc-4.6.2/configure --program-suffix=-4.6 --enable-threads --enable-languages=c,c++ --enable-linker-build-id Thread model: posix gcc version 4.6.2 (GCC) I am running 64 bit Ubuntu 10.10 so I compiled my own gcc 4.6 from source. The flags I used to compile: -O3 -march=native -ffast-math -fbuiltin -g on a Intel Core 2 Duo. Thanks, Leith Bade le...@leithalweapon.geek.nz