The following testcase
#include <emmintrin.h>
typedef union {
__m128i v;
int m[4];
} VectorUnion;
VectorUnion one()
{
VectorUnion r = { _mm_set1_epi32(1) };
return r;
}
int main()
{
VectorUnion x = one();
if (0xffff == _mm_movemask_epi8(_mm_cmpeq_epi32(x.v, x.v))) {
return 0;
}
return 1;
}
compiles (-Wall -Wextra -O2 -mssse3) to
00000000004004d0 <main>:
4004d0: 66 0f 6f 05 38 01 00 00 movdqa 0x138(%rip),%xmm0
4004d8: 66 0f 7f 44 24 d8 movdqa %xmm0,-0x28(%rsp)
4004de: 48 8b 44 24 d8 mov -0x28(%rsp),%rax
4004e3: 48 89 44 24 e8 mov %rax,-0x18(%rsp)
4004e8: 48 8b 44 24 e0 mov -0x20(%rsp),%rax
4004ed: 48 89 44 24 f0 mov %rax,-0x10(%rsp)
4004f2: 66 0f 6f 44 24 e8 movdqa -0x18(%rsp),%xmm0
4004f8: 66 0f 76 c0 pcmpeqd %xmm0,%xmm0
4004fc: 66 0f d7 c0 pmovmskb %xmm0,%eax
As can be seen the xmm0 register is stored on the stack, then copied via two 64
bit moves on the stack and then, from there, loaded back into xmm0. The values
on the stack are not needed/used later on.
I expected gcc to note those no-op moves and produce code like
movdqa 0x138(%rip),%xmm0
pcmpeqd %xmm0,%xmm0
pmovmskb %xmm0,%eax
--
Summary: missed optimization when using union of __m128i and
int[4]
Product: gcc
Version: 4.3.2
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: kretz at kde dot org
GCC build triplet: x86_64-unknown-linux-gnu
GCC host triplet: x86_64-unknown-linux-gnu
GCC target triplet: x86_64-unknown-linux-gnu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40122