https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246
--- Comment #1 from Peter Cordes <peter at cordes dot ca> --- And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long. This problem only applies to char and short. Possibly because AVX2 includes vpermd ymm. ---- Apparently CannonLake has 1 uop vpermb but 2 uop vpermw, according to real testing on real hardware by https://uops.info/. Their automated test methods are generally reliable. That seems to be true for Ice Lake, too, so when AVX512VBMI is available we should be using vpermb any time we might have used vpermw with a compile-time-constant control vector. (verpmw requires AVX512BW, e.g. SKX and Cascade Lake. vpermb requires AVX512VBMI, only Ice Lake and the mostly aborted CannonLake.) Instlat provides some confirmation: https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel00706E5_IceLakeY_InstLatX64.txt shows vpermb at 3 cycle latency, but vpermw at 4 cycle latency (presumably a chain of 2 uops, 1c and 3c being the standard latencies that exist in recent Intel CPUs). InstLat doesn't document which input the dep chain goes through, so it's not 100% confirmation of only 1 uop. But it's likely that ICL has 1 uop vpermb given that CNL definitely does. uops.info lists latencies separately from each input to the result, sometimes letting us figure out that e.g. one of the inputs isn't needed until the 2nd uop. Seems to be the case for CannonLake vpermw: latency from one of the inputs is only 3 cycles, the other is 4. https://www.uops.info/html-lat/CNL/VPERMW_YMM_YMM_YMM-Measurements.html