https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754
--- Comment #9 from Marc Glisse <glisse at gcc dot gnu.org> --- (In reply to Allan Jensen from comment #7) > This is significantly worse with integer operands. > > _mm256_storeu_si256((__m256i *)&data[3], > _mm256_add_epi32(_mm256_loadu_si256((const __m256i *)&data[0]), > _mm256_loadu_si256((const __m256i *)&data[1])) > ); Please don't post isolated lines of code, always complete examples ready to be copy-pasted and compiled. The declaration of data is relevant to the generated code. > compiles to: > > vmovdqu 0x20(%rax),%xmm0 > vinserti128 $0x1,0x30(%rax),%ymm0,%ymm0 > vmovdqu (%rax),%xmm1 > vinserti128 $0x1,0x10(%rax),%ymm1,%ymm1 > vpaddd %ymm1,%ymm0,%ymm0 > vmovups %xmm0,0x60(%rax) > vextracti128 $0x1,%ymm0,0x70(%rax) With trunk and -march=skylake (or haswell), I can get vmovdqu data(%rip), %ymm0 vpaddd data+32(%rip), %ymm0, %ymm0 vmovdqu %ymm0, data+96(%rip) so this looks fixed?