https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64749
Bug ID: 64749 Summary: "truncating" instructions generated instead of a load one using SSE & AVX2 intrinsics Product: gcc Version: 4.8.4 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: adrien at guinet dot me Created attachment 34553 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34553&action=edit test case The code attached compiles and runs fine (that is the output of the program is the good one) using GCC 4.9. When compiled with GCC 4.8, the output is different and incorrect. Indeed, when compiled with GCC 4.8, some kind of truncating is introduced at the begginig of the loop (in f2). Here is the relevant assembly code (output of GCC 4.8) : xor eax, eax mov rbp, rsp and rsp, 0FFFFFFFFFFFFFFE0h vbroadcastss ymm3, xmm6 add rsp, 10h nop dword ptr [rax] loc_400970: vpmovzxwd ymm4, xmmword ptr [rdx+rax*4] vpmovzxwd ymm2, xmmword ptr [rcx+rax*4] vmovdqa [rsp-8+var_28], ymm4 ; truncation here is done vmovdqa xmm5, xmmword ptr [rsp-8+var_28] vpmulld ymm0, ymm4, ymm2 ; here it uses xmm5 which isn't thus the good value. ; xmm5 and ymm4 should be set like with something like this (like GCC 4.9 does): ; vmovqda xmm5, xmmword ptr [rdx+rax*4] ; vpmovzxwd ymm4, xmm5 vpmulhuw xmm1, xmm5, xmmword ptr [r8+rax*4] vpmovzxwd ymm1, xmm1 vpmulld ymm1, ymm1, ymm3 vpsubd ymm0, ymm0, ymm1 vmovdqa xmmword ptr [rsi+rax*4], xmm0 add rax, 8 cmp rdi, rax ja short loc_400970 GCC 4.9 indeed behaves correctly and generate this assembly code : vbroadcastss ymm3, dword ptr [rbp-14h] xor eax, eax nop dword ptr [rax+00h] loc_4009A8: vmovdqa xmm0, xmmword ptr [rdx+rax*4] ; 128-bits load vpmulhuw xmm2, xmm0, xmmword ptr [r8+rax*4] ; correctly uses xmm0 vpmovzxwd ymm2, xmm2 ; 16->32 bits conversion here vpmulld ymm2, ymm2, ymm3 vpmovzxwd ymm1, xmm0 vpmovzxwd ymm0, xmmword ptr [rcx+rax*4] vpmulld ymm0, ymm1, ymm0 vpsubd ymm0, ymm0, ymm2 vmovaps xmmword ptr [rsi+rax*4], xmm0 add rax, 8 cmp rdi, rax ja short loc_4009A8 Thanks for any help about this! P.S: sorry but I didn't manage to have a shorter test case :/