http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54174

             Bug #: 54174
           Summary: Missed optimization: Unnecessary vmovaps generated for
                    __builtin_ia32_vextractf128_ps256(v, 0)
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: d...@nimrod.no


Pasting the following test code into test.c and compiling with gcc -Wall -O
-mavx -S test.c

----
typedef float v4sf __attribute__ ((vector_size (4*4)));
typedef float v8sf __attribute__ ((vector_size (4*8)));

v4sf add(v8sf v)
{
  v4sf a = __builtin_ia32_vextractf128_ps256(v, 0);
  v4sf b = __builtin_ia32_vextractf128_ps256(v, 1);
  return a + b;
}
----

makes gcc generate the following code:

    vmovaps    %xmm0, %xmm1
    vextractf128    $0x1, %ymm0, %xmm0
    vaddps    %xmm0, %xmm1, %xmm0

However if the statements for a and b are swapped, i.e.

  v4sf b = __builtin_ia32_vextractf128_ps256(v, 1);
  v4sf a = __builtin_ia32_vextractf128_ps256(v, 0);

then gcc is able to optimize away the vmovaps instruction:

    vextractf128    $0x1, %ymm0, %xmm1
    vaddps    %xmm1, %xmm0, %xmm0

It thus seems like optimization rules are in place to make
__builtin_ia32_vextractf128_ps256(v, 0) a noop, however regardless of this a
vmovaps is generated (or perhaps rather not optimized away) in most cases.

Reply via email to