http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54422
Bug #: 54422
Summary: Merge adjacent stores of elements of a vector (or
loads)
Classification: Unclassified
Product: gcc
Version: 4.8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: tree-optimization
AssignedTo: [email protected]
ReportedBy: [email protected]
Target: x86_64-linux-gnu
Hello,
#include <x86intrin.h>
void f1(__m128d*dd,__m128d e){
double*d=(double*)dd;
d[0]=e[0];
d[1]=e[1];
}
void f2(__m128d*dd,__m128d e){
_mm_storeu_pd((double*)dd,e);
}
void f3(__m128d*dd,__m128d e){
__builtin_memcpy(dd,&e,16);
}
for this code, gcc -O3 -mavx2 generates:
for f2:
vmovupd %xmm0, (%rdi)
(it could possibly have guessed that the alignment was right, but I don't mind
today)
for f1:
vmovlpd %xmm0, (%rdi)
vmovhpd %xmm0, 8(%rdi)
(this is my main issue, could it merge those into a vmovupd?)
for f3:
vmovdqa %xmm0, -40(%rsp)
movq -40(%rsp), %rax
vmovapd %xmm0, -24(%rsp)
movq %rax, (%rdi)
movq -16(%rsp), %rax
movq %rax, 8(%rdi)
(I hope the sse memcpy patch at
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00336.html will eventually help
with that)
At tree level, for f1, we have:
_3 = BIT_FIELD_REF <e_5(D), 64, 0>;
MEM[(double *)dd_1(D)] = _3;
_6 = BIT_FIELD_REF <e_5(D), 64, 64>;
MEM[(double *)dd_1(D) + 8B] = _6;
merging those 2 looks like it might be possible (though I am not familiar with
that part of the compiler, maybe only the backend can handle it). Note that I
am interested in both the aligned and unaligned cases (if f1 takes a double*
argument instead of a __m128d*), and in both loads and stores.
Most relevant other bugs I found were: PR 41464, PR 23684, PR 47059.