On Wed, Mar 3, 2021 at 12:29 AM Jakub Jelinek <ja...@redhat.com> wrote:
>
> Hi!
>
> As the testcase shows, the
> (define_peephole2
>   [(set (match_operand 0 "sse_reg_operand")
>         (match_operand 1 "sse_reg_operand"))
>    (set (match_dup 0)
>         (match_operator 3 "commutative_operator"
>           [(match_dup 0)
>            (match_operand 2 "memory_operand")]))]
> peephole2 can for AVX512VL without AVX512BW (I guess it is a hyphothetical
> CPU, but unfortunately they are separate CPUID bits and we have separate
> options for them) turn something that is valid without that peephole2
> into something that is invalid (and in this case ICEs).
> The problem is that the vpadd[bw], vpmullw, vpmin[su][bw] and vpmax[su][bw]
> instructions require both AVX512BW and AVX512VL when they have
> 16-byte or 32-byte operands.  If operands[0] is %[xy]mm0 .. %[xy]mm15
> but operands[1] is %[xy]mm16 .. %[xy]mm31, then before we have
> a vector move which uses vmovdqa{32,64} and doesn't need AVX512BW,
> AVX512VL is I think implied from HARD_REGNO_MODE_OK only supporting
> V{16Q,32Q,8H,16H}imode in EXT_REX_SSE_REGNO_P regs with AVX512VL, and then
> we have a commutative operation with that %[xy]mm0 .. %[xy]mm15 destination
> and one source and a memory operand, so VEX encoded operation.
> And, the peephole2 wants to replace it with a load into the destination
> register from memory (ok) and then the commutative arith instruction.
> But that needs EVEX encoding because of the high register and so requires
> AVX512BW which might not be enabled.
> The exception is and/ior/xor, because the hw doesn't have
> vp{and,or,xor}{b,w} instructions at all, it uses vp{and,or,xor}d instead
> and that of course doesn't need AVX512BW.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> BTW, there are other bugs I need to look at, while the vp{min,max}ub with
> 16-byte operands instruction properly requires avx512bw for v constraints
> and otherwise uses x, e.g. the vpadd[bw] etc. instructions don't.
> I'll try to handle that incrementally later this week.
>
> 2021-03-02  Jakub Jelinek  <ja...@redhat.com>
>
>         PR target/99321
>         * config/i386/i386.md (mov + mem using comm arith peephole2):
>         Punt if operands[1] is EXT_REX_SSE_REGNO_P, AVX512BW is not enabled
>         and the inner mode is [QH]Imode.
>
>         * gcc.target/i386/pr99321.c: New test.

OK, a comment inline.

Thanks,
Uros.

> --- gcc/config/i386/i386.md.jj  2021-02-18 09:16:21.026841364 +0100
> +++ gcc/config/i386/i386.md     2021-03-02 17:57:55.878158963 +0100
> @@ -19843,7 +19843,18 @@ (define_peephole2
>         (match_operator 3 "commutative_operator"
>           [(match_dup 0)
>            (match_operand 2 "memory_operand")]))]
> -  "REGNO (operands[0]) != REGNO (operands[1])"
> +  "REGNO (operands[0]) != REGNO (operands[1])
> +   /* Punt if operands[1] is %[xy]mm16+ and AVX512BW is not enabled,
> +      as EVEX encoded vpadd[bw], vpmullw, vpmin[su][bw] and vpmax[su][bw]
> +      instructions require AVX512BW and AVX512VL, but with the original
> +      instructions it might require just AVX512VL.
> +      AVX512VL is implied from TARGET_HARD_REGNO_MODE_OK.  */
> +   && (!EXT_REX_SSE_REGNO_P (REGNO (operands[1]))
> +       || TARGET_AVX512BW
> +       || GET_MODE_SIZE (GET_MODE_INNER (GET_MODE (operands[0]))) > 2
> +       || GET_CODE (operands[3]) == AND
> +       || GET_CODE (operands[3]) == IOR
> +       || GET_CODE (operands[3]) == XOR)"

Can you please introduce the "logic_operator" predicate, similar to
plusminuslogic_operator and use it here?

>    [(set (match_dup 0) (match_dup 2))
>     (set (match_dup 0)
>         (match_op_dup 3 [(match_dup 0) (match_dup 1)]))])
> --- gcc/testsuite/gcc.target/i386/pr99321.c.jj  2021-03-02 18:05:02.561528747 
> +0100
> +++ gcc/testsuite/gcc.target/i386/pr99321.c     2021-03-02 18:05:51.121001799 
> +0100
> @@ -0,0 +1,41 @@
> +/* PR target/99321 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=btver2 -fno-tree-dce -mavx512vl -mno-avx512bw" } 
> */
> +
> +typedef unsigned __attribute__((__vector_size__ (8))) A;
> +typedef unsigned int __attribute__((__vector_size__ (8))) B;
> +typedef unsigned char __attribute__((__vector_size__ (16))) C;
> +typedef unsigned __attribute__((__vector_size__ (16))) D;
> +typedef unsigned int __attribute__((__vector_size__ (16))) E;
> +typedef unsigned __attribute__((__vector_size__ (16))) F;
> +typedef unsigned __attribute__((__vector_size__ (32))) G;
> +typedef int __attribute__((__vector_size__ (32))) H;
> +typedef unsigned int __attribute__((__vector_size__ (32))) I;
> +typedef char __attribute__((__vector_size__ (64))) J;
> +typedef unsigned int __attribute__((__vector_size__ (64))) K;
> +typedef unsigned long long __attribute__((__vector_size__ (64))) L;
> +unsigned char a;
> +unsigned b, c;
> +H d;
> +E e, f;
> +D g;
> +L h;
> +
> +A
> +foo0 (A i, C j, G k, B l, K m, B n, I o)
> +{
> +  J p, q = a != p;
> +  F r = b << f;
> +  int s = a * 15;
> +  C t = (1 << ((C) ((C) { 80 } >=j) & sizeof (0)) | (j ^ (C) { 5 }) << (j & 
> sizeof (0))) != 0;
> +  L u = h;
> +  H v = d - 40;
> +  u ^= -(long long) n;
> +  D w = (char) s > g;
> +  o ^= c / o;
> +  J x = p + q + (J) m + (J) u + (J) u;
> +  G y = ((union { J a; G b;}) x).b + ((union { J a; G b[2];}) x).b[1] + k + 
> v + o;
> +  C z = ((union { G a; C b;}) y).b + ((union { G a; C b;}) y).b + j + t + 
> (C) g + (C) w + (C) e + (C) f + (C) r;
> +  A zz = ((union { C a; A b;}) z).b + i + l + n;
> +  return zz;
> +}
>
>         Jakub
>

Reply via email to