https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #9 from Marc Glisse <glisse at gcc dot gnu.org> ---
(In reply to Allan Jensen from comment #7)
> This is significantly worse with integer operands.
> 
> _mm256_storeu_si256((__m256i *)&data[3],
>     _mm256_add_epi32(_mm256_loadu_si256((const __m256i *)&data[0]),
>                      _mm256_loadu_si256((const __m256i *)&data[1]))
>     );

Please don't post isolated lines of code, always complete examples ready to be
copy-pasted and compiled. The declaration of data is relevant to the generated
code.

> compiles to:
> 
> vmovdqu 0x20(%rax),%xmm0
> vinserti128 $0x1,0x30(%rax),%ymm0,%ymm0
> vmovdqu (%rax),%xmm1
> vinserti128 $0x1,0x10(%rax),%ymm1,%ymm1
> vpaddd %ymm1,%ymm0,%ymm0
> vmovups %xmm0,0x60(%rax)
> vextracti128 $0x1,%ymm0,0x70(%rax)

With trunk and -march=skylake (or haswell), I can get

        vmovdqu data(%rip), %ymm0
        vpaddd  data+32(%rip), %ymm0, %ymm0
        vmovdqu %ymm0, data+96(%rip)

so this looks fixed?

Reply via email to