I've noticed that GCC (my current version is 4.4.1) doesn't fully optimize SSE shuffle merges, as seen in this example:
#include <xmmintrin.h> extern void printv(__m128 m); int main() { m = _mm_shuffle_ps(m, m, 0xC9); // Those two shuffles together swap pairs m = _mm_shuffle_ps(m, m, 0x2D); // And could be optimized to 0x4E printv(m); return 0; } This code generates the following assembly: movaps .LC1, %xmm1 shufps $201, %xmm1, %xmm1 shufps $45, %xmm1, %xmm1 ; <-- Both should merge to 78 movaps %xmm1, %xmm0 movaps %xmm1, -24(%ebp) .LC0: .long 1065353216 ; 1.0f .long 1073741824 ; 2.0f .long 1077936128 ; 3.0f .long 1082130432 ; 4.0f Would be nice to see it as an enhancement! -- Summary: SSE shuffle merge Product: gcc Version: 4.4.1 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: liranuna at gmail dot com GCC build triplet: x86_64-linux-gnu GCC host triplet: x86_64-linux-gnu GCC target triplet: x86_64-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147