https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246

--- Comment #1 from Peter Cordes <peter at cordes dot ca> ---
And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long.  This
problem only applies to char and short.  Possibly because AVX2 includes vpermd
ymm.

----

Apparently CannonLake has 1 uop vpermb but 2 uop vpermw, according to real
testing on real hardware by https://uops.info/.  Their automated test methods
are generally reliable.

That seems to be true for Ice Lake, too, so when AVX512VBMI is available we
should be using vpermb any time we might have used vpermw with a
compile-time-constant control vector.


(verpmw requires AVX512BW, e.g. SKX and Cascade Lake.  vpermb requires
AVX512VBMI, only Ice Lake and the mostly aborted CannonLake.)

Instlat provides some confirmation:
https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel00706E5_IceLakeY_InstLatX64.txt
 shows vpermb at 3 cycle latency, but vpermw at 4 cycle latency (presumably a
chain of 2 uops, 1c and 3c being the standard latencies that exist in recent
Intel CPUs).  InstLat doesn't document which input the dep chain goes through,
so it's not 100% confirmation of only 1 uop.  But it's likely that ICL has 1
uop vpermb given that CNL definitely does.

uops.info lists latencies separately from each input to the result, sometimes
letting us figure out that e.g. one of the inputs isn't needed until the 2nd
uop.  Seems to be the case for CannonLake vpermw: latency from one of the
inputs is only 3 cycles, the other is 4. 
https://www.uops.info/html-lat/CNL/VPERMW_YMM_YMM_YMM-Measurements.html

Reply via email to