http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572

             Bug #: 52572
           Summary: suboptimal assignment to avx element
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: [email protected]
        ReportedBy: [email protected]


For the following program:
#include <x86intrin.h>
__m256d f(__m256d x){
  x[0]=0;
  return x;
}

gcc -O3 generates:
    vmovlpd    .LC0(%rip), %xmm0, %xmm1
    vinsertf128    $0x0, %xmm1, %ymm0, %ymm0
or with -Os:
    vxorps    %xmm2, %xmm2, %xmm2
    vmovsd    %xmm2, %xmm0, %xmm1
    vinsertf128    $0x0, %xmm1, %ymm0, %ymm0

If I understand correctly, it first constructs {0,x[1],0,0} and then merges it
with the upper part of x. However, using the legacy movlpd instruction would
avoid zeroing the upper 128 bits and thus the vinsertf128 wouldn't be needed.

Is there a policy not to generate the non-VEX instructions anymore, or is this
a missed optimization?

Setting x[1] is similar. For x[2] or x[3], we get extract+mov+insert, but it
might be better to do something with vblendpd.

Reply via email to