https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459
--- Comment #1 from Peter Cordes <peter at cordes dot ca> --- BTW, if we *are* using vpmovwb, it supports a memory operand. It doesn't save any front-end uops on Skylake-avx512, just code-size. Unless it means less efficient packing in the uop cache (since all uops from one instruction have to go in the same line) it should be better to fold the stores than to use separate store instructions. vpmovwb %zmm0, (%rcx) vpmovwb %zmm1, 32(%rcx) is 6 fused-domain uops (2 * 2 p5 shuffle uops, 2 micro-fused stores), according to IACA. It's possible to coax gcc into emitting it with intrinsics, but only with a -1 mask: // https://godbolt.org/g/SBZX1W void vpmovwb(__m512i a, char *p) { _mm256_storeu_si256(p, _mm512_cvtepi16_epi8(a)); } vpmovwb %zmm0, %ymm0 vmovdqu64 %ymm0, (%rdi) ret void vpmovwb_store(__m512i a, char *p) { _mm512_mask_cvtepi16_storeu_epi8(p, -1, a); } vpmovwb %zmm0, (%rdi) ret clang is the same here, not using a memory destination unless you hand-hold it with a -1 mask. Also note the lack of vzeroupper here, and in the auto-vectorized function, even with an explicit -mvzeroupper.