https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95483

--- Comment #2 from Thiago Macieira <thiago at kde dot org> ---
Hello Evan

I was about to report that _mm_loadu_epi16 is missing, but I'm glad you've got
a more complete listing. FYI, here's a Godbolt link showing ICC and Clang with
this intrinsic: https://gcc.godbolt.org/z/8nMcPE. I'll only have to report to
Microsoft and will reference this bug report so they check their own
implementation.

FYI, for anyone stumbling upon this report when their code failed: most of the
missing intrinsics can be worked around by combining one or more and will
result in the same code.

(In reply to Evan Nemerson from comment #0)
> Here is the list:
> 
>   AVX _mm256_cvtsi256_si32
>   AVX-512 _mm512_cvtsi512_si32

_mm256_extract_epi32
 or
_mm_cvtsi128_si32(mm256_castsi256_si128(x))

Ditto for 512-bit.

>   AVX2 _mm_broadcastsd_pd

If using AVX2 is acceptable, one can use _mm_broadcastq_epi64 with suitable
casting between __m128i and __m128d.

>   AVX2 _mm_broadcastsi128_si256

Looks like a typo; this one exists as _mm256 and so it should be.

>   AVX-512 _mm512_storeu_epi16
>   AVX-512 _mm512_storeu_epi8
>   AVX-512 _mm256_storeu_epi16
>   AVX-512 _mm256_storeu_epi8
>   AVX-512 _mm_storeu_epi16
>   AVX-512 _mm_storeu_epi8
>   AVX-512 _mm512_loadu_epi16
>   AVX-512 _mm512_loadu_epi8
>   AVX-512 _mm256_loadu_epi16
>   AVX-512 _mm256_loadu_epi8
>   AVX-512 _mm_loadu_epi16
>   AVX-512 _mm_loadu_epi8
>   AVX-512 _mm256_store_epi32
>   AVX-512 _mm_store_epi32
>   AVX-512 _mm256_loadu_epi64
>   AVX-512 _mm256_loadu_epi32
>   AVX-512 _mm_loadu_epi64
>   AVX-512 _mm_loadu_epi32
>   AVX-512 _mm256_load_epi64
>   AVX-512 _mm256_load_epi32
>   AVX-512 _mm_load_epi64
>   AVX-512 _mm_load_epi32

All of these can be implemented as the mask (for storing) or maskz (for
loading) equivalents with a mask of ~0 (UINT64_MAX for the epi8 ones). For
example
  _mm256_loadu_epi16(ptr)
becomes
  _mm256_maskz_loadu_epi16(~0, ptr)

>   AVX-512 _mm_cvtsd_i32
>   AVX-512 _mm_cvtsd_i64
>   AVX-512 _mm_cvtss_i32
>   AVX-512 _mm_cvtss_i64
>   AVX-512 _mm_cvti32_sd
>   AVX-512 _mm_cvti64_sd
>   AVX-512 _mm_cvti32_ss
>   AVX-512 _mm_cvti64_ss

Not sure why those are needed; they generate the same instruction as
_mm_cvtsX_siYY. Clang's header is even:

#define _mm_cvtss_i32 _mm_cvtss_si32
#define _mm_cvtsd_i32 _mm_cvtsd_si32
#define _mm_cvti32_sd _mm_cvtsi32_sd
#define _mm_cvti32_ss _mm_cvtsi32_ss
#ifdef __x86_64__
#define _mm_cvtss_i64 _mm_cvtss_si64
#define _mm_cvtsd_i64 _mm_cvtsd_si64
#define _mm_cvti64_sd _mm_cvtsi64_sd
#define _mm_cvti64_ss _mm_cvtsi64_ss
#endif

ICC does the same.

>   SSE _mm_storeu_si16
>   SSE2 _mm_storeu_si32

With casting of the pointer:
*dest = _mm_cvtsi128_si16(mm)

If the casting is too scary or triggers aliasing warnings, then:

  uintXX_t val = _mm_cvtsi128_siXX(mm);
  memcpy(dest, &val, sizeof(val));

GCC optimises the memcpy and reg-reg MOVD into a single MOVD into memory.

>   SSE _mm_loadu_si16
>   SSE2 _mm_loadu_si32

Ditto for the _mm_cvtsiXX_si128.

Reply via email to