https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924
--- Comment #1 from Peter Cordes <peter at cordes dot ca> --- There's __m128i _mm_loadl_epi64 (__m128i const* mem_addr)(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=movq&expand=5450,4247,3115&techs=SSE2), which gcc makes available in 32-bit mode. This does solve the correctness problem for 32-bit, but gcc still compiles it to a separate vmovq before a vpmovzxbd %xmm,%ymm. (Using _mm_loadu_si128 still optimizes away to vpmovzxbd (%eax), %ymm0.) https://godbolt.org/g/Zuf26P