On Wed, Mar 3, 2021 at 12:29 AM Jakub Jelinek <ja...@redhat.com> wrote: > > Hi! > > As the testcase shows, the > (define_peephole2 > [(set (match_operand 0 "sse_reg_operand") > (match_operand 1 "sse_reg_operand")) > (set (match_dup 0) > (match_operator 3 "commutative_operator" > [(match_dup 0) > (match_operand 2 "memory_operand")]))] > peephole2 can for AVX512VL without AVX512BW (I guess it is a hyphothetical > CPU, but unfortunately they are separate CPUID bits and we have separate > options for them) turn something that is valid without that peephole2 > into something that is invalid (and in this case ICEs). > The problem is that the vpadd[bw], vpmullw, vpmin[su][bw] and vpmax[su][bw] > instructions require both AVX512BW and AVX512VL when they have > 16-byte or 32-byte operands. If operands[0] is %[xy]mm0 .. %[xy]mm15 > but operands[1] is %[xy]mm16 .. %[xy]mm31, then before we have > a vector move which uses vmovdqa{32,64} and doesn't need AVX512BW, > AVX512VL is I think implied from HARD_REGNO_MODE_OK only supporting > V{16Q,32Q,8H,16H}imode in EXT_REX_SSE_REGNO_P regs with AVX512VL, and then > we have a commutative operation with that %[xy]mm0 .. %[xy]mm15 destination > and one source and a memory operand, so VEX encoded operation. > And, the peephole2 wants to replace it with a load into the destination > register from memory (ok) and then the commutative arith instruction. > But that needs EVEX encoding because of the high register and so requires > AVX512BW which might not be enabled. > The exception is and/ior/xor, because the hw doesn't have > vp{and,or,xor}{b,w} instructions at all, it uses vp{and,or,xor}d instead > and that of course doesn't need AVX512BW. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > BTW, there are other bugs I need to look at, while the vp{min,max}ub with > 16-byte operands instruction properly requires avx512bw for v constraints > and otherwise uses x, e.g. the vpadd[bw] etc. instructions don't. > I'll try to handle that incrementally later this week. > > 2021-03-02 Jakub Jelinek <ja...@redhat.com> > > PR target/99321 > * config/i386/i386.md (mov + mem using comm arith peephole2): > Punt if operands[1] is EXT_REX_SSE_REGNO_P, AVX512BW is not enabled > and the inner mode is [QH]Imode. > > * gcc.target/i386/pr99321.c: New test.
OK, a comment inline. Thanks, Uros. > --- gcc/config/i386/i386.md.jj 2021-02-18 09:16:21.026841364 +0100 > +++ gcc/config/i386/i386.md 2021-03-02 17:57:55.878158963 +0100 > @@ -19843,7 +19843,18 @@ (define_peephole2 > (match_operator 3 "commutative_operator" > [(match_dup 0) > (match_operand 2 "memory_operand")]))] > - "REGNO (operands[0]) != REGNO (operands[1])" > + "REGNO (operands[0]) != REGNO (operands[1]) > + /* Punt if operands[1] is %[xy]mm16+ and AVX512BW is not enabled, > + as EVEX encoded vpadd[bw], vpmullw, vpmin[su][bw] and vpmax[su][bw] > + instructions require AVX512BW and AVX512VL, but with the original > + instructions it might require just AVX512VL. > + AVX512VL is implied from TARGET_HARD_REGNO_MODE_OK. */ > + && (!EXT_REX_SSE_REGNO_P (REGNO (operands[1])) > + || TARGET_AVX512BW > + || GET_MODE_SIZE (GET_MODE_INNER (GET_MODE (operands[0]))) > 2 > + || GET_CODE (operands[3]) == AND > + || GET_CODE (operands[3]) == IOR > + || GET_CODE (operands[3]) == XOR)" Can you please introduce the "logic_operator" predicate, similar to plusminuslogic_operator and use it here? > [(set (match_dup 0) (match_dup 2)) > (set (match_dup 0) > (match_op_dup 3 [(match_dup 0) (match_dup 1)]))]) > --- gcc/testsuite/gcc.target/i386/pr99321.c.jj 2021-03-02 18:05:02.561528747 > +0100 > +++ gcc/testsuite/gcc.target/i386/pr99321.c 2021-03-02 18:05:51.121001799 > +0100 > @@ -0,0 +1,41 @@ > +/* PR target/99321 */ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mtune=btver2 -fno-tree-dce -mavx512vl -mno-avx512bw" } > */ > + > +typedef unsigned __attribute__((__vector_size__ (8))) A; > +typedef unsigned int __attribute__((__vector_size__ (8))) B; > +typedef unsigned char __attribute__((__vector_size__ (16))) C; > +typedef unsigned __attribute__((__vector_size__ (16))) D; > +typedef unsigned int __attribute__((__vector_size__ (16))) E; > +typedef unsigned __attribute__((__vector_size__ (16))) F; > +typedef unsigned __attribute__((__vector_size__ (32))) G; > +typedef int __attribute__((__vector_size__ (32))) H; > +typedef unsigned int __attribute__((__vector_size__ (32))) I; > +typedef char __attribute__((__vector_size__ (64))) J; > +typedef unsigned int __attribute__((__vector_size__ (64))) K; > +typedef unsigned long long __attribute__((__vector_size__ (64))) L; > +unsigned char a; > +unsigned b, c; > +H d; > +E e, f; > +D g; > +L h; > + > +A > +foo0 (A i, C j, G k, B l, K m, B n, I o) > +{ > + J p, q = a != p; > + F r = b << f; > + int s = a * 15; > + C t = (1 << ((C) ((C) { 80 } >=j) & sizeof (0)) | (j ^ (C) { 5 }) << (j & > sizeof (0))) != 0; > + L u = h; > + H v = d - 40; > + u ^= -(long long) n; > + D w = (char) s > g; > + o ^= c / o; > + J x = p + q + (J) m + (J) u + (J) u; > + G y = ((union { J a; G b;}) x).b + ((union { J a; G b[2];}) x).b[1] + k + > v + o; > + C z = ((union { G a; C b;}) y).b + ((union { G a; C b;}) y).b + j + t + > (C) g + (C) w + (C) e + (C) f + (C) r; > + A zz = ((union { C a; A b;}) z).b + i + l + n; > + return zz; > +} > > Jakub >