https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069
--- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- (In reply to Haochen Jiang from comment #10) > A patch like Comment 8 could definitely solve the problem. But I need to > test more benchmarks to see if there is surprise. > > But, yes, as Uros said in Comment 9, maybe there is a chance we could do it > better. Could you add "arch=skylake-avx512" to target_clones and try disable whole ix86_expand_vecop_qihi2 to see if there's any performance improvement? For x86, cross-lane permutation(truncation) is not very efficient(3-4 cycles for both vpermq and vpmovwb).