http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54174
Bug #: 54174 Summary: Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0) Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: d...@nimrod.no Pasting the following test code into test.c and compiling with gcc -Wall -O -mavx -S test.c ---- typedef float v4sf __attribute__ ((vector_size (4*4))); typedef float v8sf __attribute__ ((vector_size (4*8))); v4sf add(v8sf v) { v4sf a = __builtin_ia32_vextractf128_ps256(v, 0); v4sf b = __builtin_ia32_vextractf128_ps256(v, 1); return a + b; } ---- makes gcc generate the following code: vmovaps %xmm0, %xmm1 vextractf128 $0x1, %ymm0, %xmm0 vaddps %xmm0, %xmm1, %xmm0 However if the statements for a and b are swapped, i.e. v4sf b = __builtin_ia32_vextractf128_ps256(v, 1); v4sf a = __builtin_ia32_vextractf128_ps256(v, 0); then gcc is able to optimize away the vmovaps instruction: vextractf128 $0x1, %ymm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 It thus seems like optimization rules are in place to make __builtin_ia32_vextractf128_ps256(v, 0) a noop, however regardless of this a vmovaps is generated (or perhaps rather not optimized away) in most cases.