https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459
Peter Cordes <peter at cordes dot ca> changed:
What |Removed |Added
----------------------------------------------------------------------------
See Also| |https://gcc.gnu.org/bugzill
| |a/show_bug.cgi?id=89346
Summary|AVX512F instruction costs: |AVX512BW instruction costs:
|vmovdqu8 stores may be an |vpmovwb is 2 uops on
|extra uop, and vpmovwb is 2 |Skylake and not always
|uops on Skylake and not |worth using vs. vpack +
|always worth using |vpermq lane-crossing fixup
--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
Turns out vmovdqu8 with no masking doesn't cost an extra uop. IACA was wrong,
and Agner Fog's results were *only* for the masked case. The only downside of
that is the code-size cost of using EVEX load/store instructions instead of
AVX2 VEX. That's bug 89346
https://www.uops.info/table.html confirms that SKX non-masked vmovdqu8 load and
store are both single uop. (Or the usual micro-fused store-address +
store-data).
https://www.uops.info/html-tp/SKX/VMOVDQU8_ZMM_M512-Measurements.html
https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_ZMM-Measurements.html
And between registers it can be eliminated if there's no masking.
But *with* masking, as a load it's a micro-fused load+ALU uop, and as a masked
store it's just a normal store uop for xmm and ymm. But zmm masked store is 5
uops (micro-fused to 4 front-end uops)! (Unlike vmovdqu16 or 32 masked stores
which are efficient even for zmm).
https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_K_ZMM-Measurements.html
uops.info's table also shows us that IACA3.0 is wrong about vmovdqu8 as an
*unmasked* ZMM store: IACA thinks that's also 5 uops.
Retitling this bug report since that part was based on Intel's bogus data, not
real testing.
vpmovwb is still 2 uops, and current trunk gcc still uses 2x vpmovwb +
vinserti64x4 for ZMM auto-vec. -mprefer-vector-width=512 is not the default,
but people may enable it in code that heavily uses 512-bit vectors.
YMM auto-vec is unchanged since previous comments: we do get vpackusbw +
vpermq, but an indexed addressing mode defeats micro-fusion. And we have
redundant VPAND after shifting.
---
For icelake-client/server (AVX512VBMI) GCC is using vpermt2b, but it doesn't
fold the shifts into the 2-source byte shuffle. (vpermt2b has 5c latency and
2c throughput on ICL, so probably its uop count is the same as uops.info
measured for CannonLake: 1*p05 + 2*p5. Possible 2x 1-uop vpermb with
merge-masking for the 2nd into the first would work better.)
IceLake vpmovwb ymm,zmm is still 2-cycle throughput, 4-cycle latency, so
probably still 2 uops.