https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #1 from Peter Cordes <peter at cordes dot ca> ---
BTW, if we *are* using vpmovwb, it supports a memory operand.  It doesn't save
any front-end uops on Skylake-avx512, just code-size.  Unless it means less
efficient packing in the uop cache (since all uops from one instruction have to
go in the same line) it should be better to fold the stores than to use
separate store instructions.

        vpmovwb %zmm0,    (%rcx)
        vpmovwb %zmm1, 32(%rcx)

is 6 fused-domain uops (2 * 2 p5 shuffle uops, 2 micro-fused stores), according
to IACA.

It's possible to coax gcc into emitting it with intrinsics, but only with a -1
mask:

// https://godbolt.org/g/SBZX1W
void vpmovwb(__m512i a, char *p) {
  _mm256_storeu_si256(p, _mm512_cvtepi16_epi8(a));
}
        vpmovwb %zmm0, %ymm0
        vmovdqu64       %ymm0, (%rdi)
        ret

void vpmovwb_store(__m512i a, char *p) {
  _mm512_mask_cvtepi16_storeu_epi8(p, -1, a);
}
        vpmovwb %zmm0, (%rdi)
        ret

clang is the same here, not using a memory destination unless you hand-hold it
with a -1 mask.


Also note the lack of vzeroupper here, and in the auto-vectorized function,
even with an explicit -mvzeroupper.

Reply via email to