http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572
Bug #: 52572
Summary: suboptimal assignment to avx element
Classification: Unclassified
Product: gcc
Version: 4.7.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
AssignedTo: [email protected]
ReportedBy: [email protected]
For the following program:
#include <x86intrin.h>
__m256d f(__m256d x){
x[0]=0;
return x;
}
gcc -O3 generates:
vmovlpd .LC0(%rip), %xmm0, %xmm1
vinsertf128 $0x0, %xmm1, %ymm0, %ymm0
or with -Os:
vxorps %xmm2, %xmm2, %xmm2
vmovsd %xmm2, %xmm0, %xmm1
vinsertf128 $0x0, %xmm1, %ymm0, %ymm0
If I understand correctly, it first constructs {0,x[1],0,0} and then merges it
with the upper part of x. However, using the legacy movlpd instruction would
avoid zeroing the upper 128 bits and thus the vinsertf128 wouldn't be needed.
Is there a policy not to generate the non-VEX instructions anymore, or is this
a missed optimization?
Setting x[1] is similar. For x[2] or x[3], we get extract+mov+insert, but it
might be better to do something with vblendpd.