https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101846
--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> --- For bar, the problem is that while vpmovdw is AVX512F, we actually recognize it only at combine time as vpermw (with selected exact permutation) combined with low part extraction. And vpermw is only AVX512BW. In order to optimize it, we'd need to implement what LLVM actually has support for, namely the "I don't care" possibilities for the permutations. So, instead of what we emit right now in GIMPLE: _1 = VEC_PERM_EXPR <x_2(D), x_2(D), { 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }>; _3 = BIT_FIELD_REF <_1, 256, 0>; we'd need to emit _1 = VEC_PERM_EXPR <x_2(D), x_2(D), { 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY }>; (we'd need a special VEC_PERM_EXPR variant for that which would only accept VECTOR_CSTs and reserve all ones for the "ANY" case in there). And, the hard part, adjust the target const vec perm code to handle those efficiently - as a wildcard for whatever other element of the vector or constant 0. One thing are the code which verifies the d->perm[?] values which would treat the wildcards as anything but for a successful match we'd actually need to compute what value is best based on the non-wildcard values in the permutation. Another are the many cases where we construct RTL and try to recog it, we'd need some new RTL which would stand for CONST_INT_WILDCARD that would compare equal to any int, but would need some way how the pattern if matched would actually tells us back which number it wants to use. With that support, we could recognize the { 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY, ANY } V32HI permutation as matching the vpmovdw instruction which puts 0s in the upper half of the vector. The foo case is doable even without this I think, the question is whether we should try to split arbitrary permutation of 64-byte vectors into permutations of the two halves merged then together if the permutation allows that (first half of elements is from first halves of the inputs and second half of elements is from second halves of the inputs).