https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95483
--- Comment #2 from Thiago Macieira <thiago at kde dot org> --- Hello Evan I was about to report that _mm_loadu_epi16 is missing, but I'm glad you've got a more complete listing. FYI, here's a Godbolt link showing ICC and Clang with this intrinsic: https://gcc.godbolt.org/z/8nMcPE. I'll only have to report to Microsoft and will reference this bug report so they check their own implementation. FYI, for anyone stumbling upon this report when their code failed: most of the missing intrinsics can be worked around by combining one or more and will result in the same code. (In reply to Evan Nemerson from comment #0) > Here is the list: > > AVX _mm256_cvtsi256_si32 > AVX-512 _mm512_cvtsi512_si32 _mm256_extract_epi32 or _mm_cvtsi128_si32(mm256_castsi256_si128(x)) Ditto for 512-bit. > AVX2 _mm_broadcastsd_pd If using AVX2 is acceptable, one can use _mm_broadcastq_epi64 with suitable casting between __m128i and __m128d. > AVX2 _mm_broadcastsi128_si256 Looks like a typo; this one exists as _mm256 and so it should be. > AVX-512 _mm512_storeu_epi16 > AVX-512 _mm512_storeu_epi8 > AVX-512 _mm256_storeu_epi16 > AVX-512 _mm256_storeu_epi8 > AVX-512 _mm_storeu_epi16 > AVX-512 _mm_storeu_epi8 > AVX-512 _mm512_loadu_epi16 > AVX-512 _mm512_loadu_epi8 > AVX-512 _mm256_loadu_epi16 > AVX-512 _mm256_loadu_epi8 > AVX-512 _mm_loadu_epi16 > AVX-512 _mm_loadu_epi8 > AVX-512 _mm256_store_epi32 > AVX-512 _mm_store_epi32 > AVX-512 _mm256_loadu_epi64 > AVX-512 _mm256_loadu_epi32 > AVX-512 _mm_loadu_epi64 > AVX-512 _mm_loadu_epi32 > AVX-512 _mm256_load_epi64 > AVX-512 _mm256_load_epi32 > AVX-512 _mm_load_epi64 > AVX-512 _mm_load_epi32 All of these can be implemented as the mask (for storing) or maskz (for loading) equivalents with a mask of ~0 (UINT64_MAX for the epi8 ones). For example _mm256_loadu_epi16(ptr) becomes _mm256_maskz_loadu_epi16(~0, ptr) > AVX-512 _mm_cvtsd_i32 > AVX-512 _mm_cvtsd_i64 > AVX-512 _mm_cvtss_i32 > AVX-512 _mm_cvtss_i64 > AVX-512 _mm_cvti32_sd > AVX-512 _mm_cvti64_sd > AVX-512 _mm_cvti32_ss > AVX-512 _mm_cvti64_ss Not sure why those are needed; they generate the same instruction as _mm_cvtsX_siYY. Clang's header is even: #define _mm_cvtss_i32 _mm_cvtss_si32 #define _mm_cvtsd_i32 _mm_cvtsd_si32 #define _mm_cvti32_sd _mm_cvtsi32_sd #define _mm_cvti32_ss _mm_cvtsi32_ss #ifdef __x86_64__ #define _mm_cvtss_i64 _mm_cvtss_si64 #define _mm_cvtsd_i64 _mm_cvtsd_si64 #define _mm_cvti64_sd _mm_cvtsi64_sd #define _mm_cvti64_ss _mm_cvtsi64_ss #endif ICC does the same. > SSE _mm_storeu_si16 > SSE2 _mm_storeu_si32 With casting of the pointer: *dest = _mm_cvtsi128_si16(mm) If the casting is too scary or triggers aliasing warnings, then: uintXX_t val = _mm_cvtsi128_siXX(mm); memcpy(dest, &val, sizeof(val)); GCC optimises the memcpy and reg-reg MOVD into a single MOVD into memory. > SSE _mm_loadu_si16 > SSE2 _mm_loadu_si32 Ditto for the _mm_cvtsiXX_si128.