https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119368
Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> --- I think it is not "splitting [...] to 256-bit loads" but rather splitting an SSE4.1 extending load into full-width load followed by extension of the just-loaded vector (throwing away half of the vector). The corresponding intrinsics and built-ins are completely misdesigned, as they require writing code as if a full vector loaded from memory: #include <immintrin.h> __m128i f(__m128i *x) { return _mm_cvtepi16_epi32(*x); } This minimal testcase demonstrates the fundamental issue with -O2 -msse4.1. LLVM manages to fold the load, producing pmovsxwd (but with better designed intrinsics the effort on the compiler side would be smaller).