[Bug target/119368] immintrin code running slower with gcc than clang

amonakov at gcc dot gnu.org via Gcc-bugs Wed, 19 Mar 2025 04:35:37 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119368


Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
I think it is not "splitting [...] to 256-bit loads" but rather splitting an
SSE4.1 extending load into full-width load followed by extension of the
just-loaded vector (throwing away half of the vector). The corresponding
intrinsics and built-ins are completely misdesigned, as they require writing
code as if a full vector loaded from memory:

#include <immintrin.h>

__m128i f(__m128i *x)
{
    return _mm_cvtepi16_epi32(*x);
}

This minimal testcase demonstrates the fundamental issue with -O2 -msse4.1.
LLVM manages to fold the load, producing pmovsxwd (but with better designed
intrinsics the effort on the compiler side would be smaller).

[Bug target/119368] immintrin code running slower with gcc than clang

Reply via email to