http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54174
Bug #: 54174
Summary: Missed optimization: Unnecessary vmovaps generated for
__builtin_ia32_vextractf128_ps256(v, 0)
Classification: Unclassified
Product: gcc
Version: 4.7.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: [email protected]
ReportedBy: [email protected]
Pasting the following test code into test.c and compiling with gcc -Wall -O
-mavx -S test.c
----
typedef float v4sf __attribute__ ((vector_size (4*4)));
typedef float v8sf __attribute__ ((vector_size (4*8)));
v4sf add(v8sf v)
{
v4sf a = __builtin_ia32_vextractf128_ps256(v, 0);
v4sf b = __builtin_ia32_vextractf128_ps256(v, 1);
return a + b;
}
----
makes gcc generate the following code:
vmovaps %xmm0, %xmm1
vextractf128 $0x1, %ymm0, %xmm0
vaddps %xmm0, %xmm1, %xmm0
However if the statements for a and b are swapped, i.e.
v4sf b = __builtin_ia32_vextractf128_ps256(v, 1);
v4sf a = __builtin_ia32_vextractf128_ps256(v, 0);
then gcc is able to optimize away the vmovaps instruction:
vextractf128 $0x1, %ymm0, %xmm1
vaddps %xmm1, %xmm0, %xmm0
It thus seems like optimization rules are in place to make
__builtin_ia32_vextractf128_ps256(v, 0) a noop, however regardless of this a
vmovaps is generated (or perhaps rather not optimized away) in most cases.