Hi,

I have been hand optimising a loop that GCC 4.6 was not able to vectorise.

I have been keeping an eye on the assembly output of this loop and
have noticed GCC inserting unnecessary MOVAPS instructions.

It only happens when I use the same variable for both inputs to a SSE intrinsic.

E.g:
__xmm128 input = /* something */;
_mm_shuffle_ps(input, input, _MM_SHUFFLE(2, 3, 0, 1));

Will produce instructions like this:
movaps %xmm0,%xmm1
shufps $0xb1,%xmm0,%xmm1

Where optimally it should be:
shufps $0xb1,%xmm0,%xmm0

It appears that as _mm_shuffle_ps is a function it is treating the two
inputs separately rather than combining them when they reference the
same register.

I have looked at Visual C++ 2010 and it does not add these extra
MOVAPS instructions.

It is likely that this does not have a large impact in performance as
the CPU should just rename the registers, but it does add more
instructions to decode and more stuff to place into instruction cache
etc.

The output of gcc -v
Using built-in specs.
COLLECT_GCC=gcc-4.6
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-unknown-linux-gnu/4.6.2/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-4.6.2/configure --program-suffix=-4.6
--enable-threads --enable-languages=c,c++ --enable-linker-build-id
Thread model: posix
gcc version 4.6.2 (GCC)

I am running 64 bit Ubuntu 10.10 so I compiled my own gcc 4.6 from source.

The flags I used to compile:
-O3 -march=native -ffast-math -fbuiltin -g
on a Intel Core 2 Duo.

Thanks,
Leith Bade
le...@leithalweapon.geek.nz

Reply via email to