Re: [PATCH X86, PR62128] Rotate pattern for AVX2

Evgeny Stupachenko Wed, 01 Oct 2014 03:39:43 -0700

Similar patch is under discussion at another mail thread:
https://gcc.gnu.org/ml/gcc-patches/2014-10/msg00025.html
The main point is that each expand function called in order to number
of expanded insns starting from 1. expand_vec_perm_palignr is expected
to produce not more than 2 insns.
palignr for both SSE and AVX2 can reduce 2 operand permutation to 1
operand for the cases in expand_vec_perm_palignr. All 1 operand
permutations in SSE requires 1 insn in worst case, but 4 in AVX2 case.


The other point is that rotate permutation always becomes 1 operand
after "canonicalize_perm" and it is more tricky to catch it in the
expand_vec_perm_palignr. That is why I'm trying to catch special AVX2
rotate case by "select" (similar to already implemented SSE rotate).

On Wed, Oct 1, 2014 at 11:32 AM, Jakub Jelinek <ja...@redhat.com> wrote:
> On Wed, Oct 01, 2014 at 08:26:27AM +0200, Uros Bizjak wrote:
>> On Wed, Oct 1, 2014 at 12:13 AM, Evgeny Stupachenko <evstu...@gmail.com> 
>> wrote:
>> > expand_vselect for some reason ignores the expander.
>> > Does it work with expanders?
>> > The comment talks about insn only:
>> > /* Construct (set target (vec_select op0 (parallel perm))) and
>> >    return true if that's a valid instruction in the active ISA.  */
>>
>> It looks to me that the whole approach is wrong from the beginning.
>>
>> There is already a function that generates perm/palignr sequence,
>> conveniently named expand_vec_perm_palignr. This function should be
>> extended to handle AVX2 sequence. You don't have to add any new
>> patterns, existing avx2_permv2ti and avx2_palignr2ti should do the
>> trick.
>
> Yeah, I'd expect something like (so far not even compile tested):
>
> --- gcc/config/i386/i386.c      2014-09-26 10:33:18.789645102 +0200
> +++ gcc/config/i386/i386.c      2014-10-01 09:27:41.435813522 +0200
> @@ -43297,44 +43297,79 @@ expand_vec_perm_palignr (struct expand_v
>    rtx shift, target;
>    struct expand_vec_perm_d dcopy;
>
> -  /* Even with AVX, palignr only operates on 128-bit vectors.  */
> -  if (!TARGET_SSSE3 || GET_MODE_SIZE (d->vmode) != 16)
> +  /* palignr doesn't help for one_operand_p case.  */
> +  if (d->one_operand_p)
>      return false;
>
> -  min = nelt, max = 0;
> +  /* Even with AVX, palignr only operates on 128-bit vectors,
> +     in AVX2 palignr operates on both 128-bit lanes.  */
> +  if ((!TARGET_SSSE3 || GET_MODE_SIZE (d->vmode) != 16)
> +      && (!TARGET_AVX2 || GET_MODE_SIZE (d->vmode) != 32))
> +    return false;
> +
> +  min = 2 * nelt, max = 0;
>    for (i = 0; i < nelt; ++i)
>      {
>        unsigned e = d->perm[i];
> +      if (GET_MODE_SIZE (d->vmode) == 32)
> +       e = (e & ((nelt / 2) - 1)) | ((e & nelt) >> 1);
>        if (e < min)
>         min = e;
>        if (e > max)
>         max = e;
>      }
> -  if (min == 0 || max - min >= nelt)
> +  if (min == 0
> +      || max - min >= (GET_MODE_SIZE (d->vmode) == 32 ? nelt / 2 : nelt))
>      return false;
>
>    /* Given that we have SSSE3, we know we'll be able to implement the
> -     single operand permutation after the palignr with pshufb.  */
> -  if (d->testing_p)
> +     single operand permutation after the palignr with pshufb for
> +     128-bit vectors.  */
> +  if (d->testing_p && GET_MODE_SIZE (d->vmode) == 16)
>      return true;
>
>    dcopy = *d;
> -  shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
> -  target = gen_reg_rtx (TImode);
> -  emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d->op1),
> -                                 gen_lowpart (TImode, d->op0), shift));
> -
> -  dcopy.op0 = dcopy.op1 = gen_lowpart (d->vmode, target);
> -  dcopy.one_operand_p = true;
>
>    in_order = true;
>    for (i = 0; i < nelt; ++i)
>      {
> -      unsigned e = dcopy.perm[i] - min;
> +      unsigned e = dcopy.perm[i];
> +      if (GET_MODE_SIZE (d->vmode) == 32
> +         && e >= nelt
> +         && (e & (nelt / 2 - 1)) < min)
> +       e = e - min - (nelt / 2);
> +      else
> +       e = e - min;
>        if (e != i)
>         in_order = false;
>        dcopy.perm[i] = e;
>      }
> +  dcopy.one_operand_p = true;
> +
> +  /* For AVX2, test whether we can permute the result in one instruction.  */
> +  if (d->testing_p)
> +    {
> +      if (in_order)
> +       return true;
> +      dcopy.op1 = dcopy.op0;
> +      return expand_vec_perm_1 (&dcopy);
> +    }
> +
> +  shift = GEN_INT (min * GET_MODE_BITSIZE (GET_MODE_INNER (d->vmode)));
> +  if (GET_MODE_SIZE (d->vmode) == 16)
> +    {
> +      target = gen_reg_rtx (TImode);
> +      emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, d->op1),
> +                                     gen_lowpart (TImode, d->op0), shift));
> +    }
> +  else
> +    {
> +      target = gen_reg_rtx (V2TImode);
> +      emit_insn (gen_avx2_palignr2vti (target, gen_lowpart (V2TImode, 
> d->op1),
> +                                      gen_lowpart (V2TImode, d->op0), 
> shift));
> +    }
> +
> +  dcopy.op0 = dcopy.op1 = gen_lowpart (d->vmode, target);
>
>    /* Test for the degenerate case where the alignment by itself
>       produces the desired permutation.  */
>
>> And, as said by H.J., please add the testcase from the PR that will
>> exercise the code path. Without the testcase included, the patch is
>> unreviewable, and this is the reason why no maintainer (including me)
>> wants to approve it in its current form.
>
> I'd hope my gcc.dg/torture/vshuf* tests catch it, if not in normal
> mode, at least in the expensive one.  OT, note, those tests haven't been
> extended for AVX-512 modes (for most modes it is trivial, just
> V64QI will need new vshuf-64.inc).
>
>         Jakub

Re: [PATCH X86, PR62128] Rotate pattern for AVX2

Reply via email to