https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924

--- Comment #1 from Peter Cordes <peter at cordes dot ca> ---
There's  __m128i _mm_loadl_epi64 (__m128i const*
mem_addr)(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=movq&expand=5450,4247,3115&techs=SSE2),
which gcc makes available in 32-bit mode.

This does solve the correctness problem for 32-bit, but gcc still compiles it
to a separate vmovq before a vpmovzxbd %xmm,%ymm.  (Using _mm_loadu_si128 still
optimizes away to vpmovzxbd (%eax), %ymm0.)

https://godbolt.org/g/Zuf26P

Reply via email to