Similar patch is under discussion at another mail thread: https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00025.html The main point is that each expand function called in order to number of expanded insns starting from 1. expand_vec_perm_palignr is expected to produce not more than 2 insns. palignr for both SSE and AVX2 can reduce 2 operand permutation to 1 operand for the cases in expand_vec_perm_palignr. All 1 operand permutations in SSE requires 1 insn in worst case, but 4 in AVX2 case.
The other point is that rotate permutation always becomes 1 operand after "canonicalize_perm" and it is more tricky to catch it in the expand_vec_perm_palignr. That is why I'm trying to catch special AVX2 rotate case by "select" (similar to already implemented SSE rotate). On Wed, Oct 1, 2014 at 11:32 AM, Jakub Jelinek <ja...@redhat.com> wrote: > On Wed, Oct 01, 2014 at 08:26:27AM +0200, Uros Bizjak wrote: >> On Wed, Oct 1, 2014 at 12:13 AM, Evgeny Stupachenko <evstu...@gmail.com> >> wrote: >> > expand_vselect for some reason ignores the expander. >> > Does it work with expanders? >> > The comment talks about insn only: >> > /* Construct (set target (vec_select op0 (parallel perm))) and >> > return true if that's a valid instruction in the active ISA. */ >> >> It looks to me that the whole approach is wrong from the beginning. >> >> There is already a function that generates perm/palignr sequence, >> conveniently named expand_vec_perm_palignr. This function should be >> extended to handle AVX2 sequence. You don't have to add any new >> patterns, existing avx2_permv2ti and avx2_palignr2ti should do the >> trick. > > Yeah, I'd expect something like (so far not even compile tested): > > --- gcc/config/i386/i386.c 2014-09-26 10:33:18.789645102 +0200 > +++ gcc/config/i386/i386.c 2014-10-01 09:27:41.435813522 +0200 > @@ -43297,44 +43297,79 @@ expand_vec_perm_palignr (struct expand_v > rtx shift, target; > struct expand_vec_perm_d dcopy; > > - /* Even with AVX, palignr only operates on 128-bit vectors. */ > - if (!TARGET_SSSE3 || GET_MODE_SIZE (d->vmode) != 16) > + /* palignr doesn't help for one_operand_p case. */ > + if (d->one_operand_p) > return false; > > - min = nelt, max = 0; > + /* Even with AVX, palignr only operates on 128-bit vectors, > + in AVX2 palignr operates on both 128-bit lanes. */ > + if ((!TARGET_SSSE3 || GET_MODE_SIZE (d->vmode) != 16) > + && (!TARGET_AVX2 || GET_MODE_SIZE (d->vmode) != 32)) > + return false; > + > + min = 2 * nelt, max = 0; > for (i = 0; i < nelt; ++i) > { > unsigned e = d->perm[i]; > + if (GET_MODE_SIZE (d->vmode) == 32) > + e = (e & ((nelt / 2) - 1)) | ((e & nelt) >> 1); > if (e < min) > min = e; > if (e > max) > max = e; > } > - if (min == 0 || max - min >= nelt) > + if (min == 0 > + || max - min >= (GET_MODE_SIZE (d->vmode) == 32 ? nelt / 2 : nelt)) > return false; > > /* Given that we have SSSE3, we know we'll be able to implement the > - single operand permutation after the palignr with pshufb. */ > - if (d->testing_p) > + single operand permutation after the palignr with pshufb for > + 128-bit vectors. */ > + if (d->testing_p && GET_MODE_SIZE (d->vmode) == 16) > return true; > > dcopy = *d; > - shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode))); > - target = gen_reg_rtx (TImode); > - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d->op1), > - gen_lowpart (TImode, d->op0), shift)); > - > - dcopy.op0 = dcopy.op1 = gen_lowpart (d->vmode, target); > - dcopy.one_operand_p = true; > > in_order = true; > for (i = 0; i < nelt; ++i) > { > - unsigned e = dcopy.perm[i] - min; > + unsigned e = dcopy.perm[i]; > + if (GET_MODE_SIZE (d->vmode) == 32 > + && e >= nelt > + && (e & (nelt / 2 - 1)) < min) > + e = e - min - (nelt / 2); > + else > + e = e - min; > if (e != i) > in_order = false; > dcopy.perm[i] = e; > } > + dcopy.one_operand_p = true; > + > + /* For AVX2, test whether we can permute the result in one instruction. */ > + if (d->testing_p) > + { > + if (in_order) > + return true; > + dcopy.op1 = dcopy.op0; > + return expand_vec_perm_1 (&dcopy); > + } > + > + shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode))); > + if (GET_MODE_SIZE (d->vmode) == 16) > + { > + target = gen_reg_rtx (TImode); > + emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d->op1), > + gen_lowpart (TImode, d->op0), shift)); > + } > + else > + { > + target = gen_reg_rtx (V2TImode); > + emit_insn (gen_avx2_palignr2vti (target, gen_lowpart (V2TImode, > d->op1), > + gen_lowpart (V2TImode, d->op0), > shift)); > + } > + > + dcopy.op0 = dcopy.op1 = gen_lowpart (d->vmode, target); > > /* Test for the degenerate case where the alignment by itself > produces the desired permutation. */ > >> And, as said by H.J., please add the testcase from the PR that will >> exercise the code path. Without the testcase included, the patch is >> unreviewable, and this is the reason why no maintainer (including me) >> wants to approve it in its current form. > > I'd hope my gcc.dg/torture/vshuf* tests catch it, if not in normal > mode, at least in the expensive one. OT, note, those tests haven't been > extended for AVX-512 modes (for most modes it is trivial, just > V64QI will need new vshuf-64.inc). > > Jakub