[Bug bootstrap/98338] [10/11 Regression] profiledbootstrap failure on x86_64-linux
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98338 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #25 from Jan Hubicka --- Fixed. Sorry for the delay - next time I should not commit into a private branch :(
[Bug tree-optimization/99101] optimization bug with -ffinite-loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99101 --- Comment #24 from Jan Hubicka --- I do not think there is problem with pdom for cyclic WRT acyclic paths. Both notions are equivalent here. If you have instruction I in, say, header of a loop and you determine it live then the condition controlling loopback is in control dependent blocks and that will bring it live which transitively brings everything else. The thing is that post dominance assumes that every path must progress to exit node (as promised by -ffinite-loops) volatile int xx; int main() { int jobs_ = 1; int at_eof_ = 0; while (1) { for (int i = 0; i < jobs_; i++) { if (at_eof_) continue; at_eof_ = 1; __builtin_printf ("1\n"); if (xx) return 1; } jobs_ = 0; } return 0; } has infinite loop that is sort of equivalent to volatile int xx; int main() { int jobs_ = 1; int at_eof_ = 0; while (1) { if (at_eof_) continue; at_eof_ = 1; __builtin_printf ("1\n"); if (xx) return 1; jobs_ = 0; while (jobs_ == 0); } return 0; } and we manage to "shortcut" "while (jobs_ == 0);" rather than forcing the original lop to be finite. Since the difference is not visible across any path that must progress to exit node, both are valid in this sense. With -fno-finite-loops pdoms still do not consider infinite paths, but since we make sure that every BB has a path to exit every infinite path can be approximated by sequence of finite paths. Since we keep all the finite paths consitent, the only problem may be that we will optimize out the condtiion deciding on back edge but we don't do that becuase we mark them necessary...
[Bug middle-end/99394] New: s254 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394 Bug ID: 99394 Summary: s254 benchmark of TSVC is vectorized by clang and not by gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Clang is vectorizing s254 loop with -mtune=archive on znver2 leading to about 758% speedup. Loop is: real_t s254(struct args_t * func_args) { //scalar and array expansion //carry around variable initialise_arrays(__func__); gettimeofday(&func_args->t1, NULL); real_t x; for (int nl = 0; nl < 4*iterations; nl++) { x = b[LEN_1D-1]; for (int i = 0; i < LEN_1D; i++) { a[i] = (b[i] + x) * (real_t).5; x = b[i]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } gettimeofday(&func_args->t2, NULL); return calc_checksum(__func__); } and clang produces: 00407d30 : 407d30: 41 56 push %r14 407d32: 53 push %rbx 407d33: 48 83 ec 28 sub$0x28,%rsp 407d37: 49 89 femov%rdi,%r14 407d3a: bf 6b e2 42 00 mov$0x42e26b,%edi 407d3f: e8 cc f8 00 00 call 417610 407d44: 31 db xor%ebx,%ebx 407d46: 4c 89 f7mov%r14,%rdi 407d49: 31 f6 xor%esi,%esi 407d4b: e8 10 93 ff ff call 401060 407d50: c4 62 7d 18 05 af 62vbroadcastss 0x262af(%rip),%ymm8 # 42e008 <_IO_stdin_used+0x8> 407d57: 02 00 407d59: c5 7c 11 04 24 vmovups %ymm8,(%rsp) 407d5e: 66 90 xchg %ax,%ax 407d60: 48 c7 c0 00 0c fe ffmov$0xfffe0c00,%rax 407d67: c4 e2 7d 18 05 8c a7vbroadcastss 0x4a78c(%rip),%ymm0 # 4524fc 407d6e: 04 00 407d70: c5 fc 28 88 00 25 45vmovaps 0x452500(%rax),%ymm1 407d77: 00 407d78: c5 fc 28 90 20 25 45vmovaps 0x452520(%rax),%ymm2 407d7f: 00 407d80: c5 fc 28 98 40 25 45vmovaps 0x452540(%rax),%ymm3 407d87: 00 407d88: c4 e3 7d 06 c1 21 vperm2f128 $0x21,%ymm1,%ymm0,%ymm0 407d8e: c5 fc 28 a0 60 25 45vmovaps 0x452560(%rax),%ymm4 407d95: 00 407d96: c5 fc c6 c1 03 vshufps $0x3,%ymm1,%ymm0,%ymm0 407d9b: c5 fc c6 c1 98 vshufps $0x98,%ymm1,%ymm0,%ymm0 407da0: c4 e3 75 06 ea 21 vperm2f128 $0x21,%ymm2,%ymm1,%ymm5 407da6: c5 d4 c6 ea 03 vshufps $0x3,%ymm2,%ymm5,%ymm5 407dab: c5 d4 c6 ea 98 vshufps $0x98,%ymm2,%ymm5,%ymm5 407db0: c4 e3 6d 06 f3 21 vperm2f128 $0x21,%ymm3,%ymm2,%ymm6 407db6: c5 cc c6 f3 03 vshufps $0x3,%ymm3,%ymm6,%ymm6 407dbb: c5 cc c6 f3 98 vshufps $0x98,%ymm3,%ymm6,%ymm6 407dc0: c4 e3 65 06 fc 21 vperm2f128 $0x21,%ymm4,%ymm3,%ymm7 407dc6: c5 c4 c6 fc 03 vshufps $0x3,%ymm4,%ymm7,%ymm7 407dcb: c5 c4 c6 fc 98 vshufps $0x98,%ymm4,%ymm7,%ymm7 407dd0: c5 f4 58 c0 vaddps %ymm0,%ymm1,%ymm0 407dd4: c5 ec 58 cd vaddps %ymm5,%ymm2,%ymm1 407dd8: c5 e4 58 d6 vaddps %ymm6,%ymm3,%ymm2 407ddc: c5 dc 58 df vaddps %ymm7,%ymm4,%ymm3 407de0: c5 bc 59 c0 vmulps %ymm0,%ymm8,%ymm0 407de4: c5 bc 59 c9 vmulps %ymm1,%ymm8,%ymm1 407de8: c5 bc 59 d2 vmulps %ymm2,%ymm8,%ymm2 407dec: c5 bc 59 db vmulps %ymm3,%ymm8,%ymm3 407df0: c5 fc 29 80 00 19 47vmovaps %ymm0,0x471900(%rax) 407df7: 00 407df8: c5 fc 29 88 20 19 47vmovaps %ymm1,0x471920(%rax) 407dff: 00 407e00: c5 fc 29 90 40 19 47vmovaps %ymm2,0x471940(%rax) 407e07: 00 407e08: c5 fc 29 98 60 19 47vmovaps %ymm3,0x471960(%rax) 407e0f: 00 407e10: c5 fc 28 c4 vmovaps %ymm4,%ymm0 407e14: 48 83 e8 80 sub$0xff80,%rax 407e18: 0f 85 52 ff ff ff jne407d70 407e1e: bf 00 25 45 00 mov$0x452500,%edi 407e23: be 00 31 43 00 mov$0x433100,%esi 407e28: ba 00 19 47 00 mov$0x471900,%edx 407e2d: b9 00 0d 49 00 mov$0x490d00,%ecx 407e32: 41 b8 00 01 4b 00 mov$0x4b0100,%r8d 407e38: 41 b9 00 f5 4c 00 mov$0x4cf500,%r9d 407e3e: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0 407e42: 68 00 f5 54 00 push $0x54f500 407e47: 68 00 f5 50 00 push $0x50f500 407e4c: c5 f8 77v
[Bug middle-end/99395] New: s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 Bug ID: 99395 Summary: s116 benchmark of TSVC is vectorized by clang and not by gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- s116 loop is: real_t s116(struct args_t * func_args) { //linear dependence testing initialise_arrays(__func__); gettimeofday(&func_args->t1, NULL); for (int nl = 0; nl < iterations*10; nl++) { for (int i = 0; i < LEN_1D - 5; i += 5) { a[i] = a[i + 1] * a[i]; a[i + 1] = a[i + 2] * a[i + 1]; a[i + 2] = a[i + 3] * a[i + 2]; a[i + 3] = a[i + 4] * a[i + 3]; a[i + 4] = a[i + 5] * a[i + 4]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } gettimeofday(&func_args->t2, NULL); return calc_checksum(__func__); } and vectorized code produced by clang11 is about 2 times faster on zen3 machine 00401d00 : 401d00: 41 56 push %r14 401d02: 53 push %rbx 401d03: 50 push %rax 401d04: 49 89 femov%rdi,%r14 401d07: bf 66 e1 42 00 mov$0x42e166,%edi 401d0c: e8 ff 58 01 00 call 417610 401d11: 31 db xor%ebx,%ebx 401d13: 4c 89 f7mov%r14,%rdi 401d16: 31 f6 xor%esi,%esi 401d18: e8 43 f3 ff ff call 401060 401d1d: eb 47 jmp401d66 401d1f: 90 nop 401d20: bf 00 25 45 00 mov$0x452500,%edi 401d25: be 00 31 43 00 mov$0x433100,%esi 401d2a: ba 00 19 47 00 mov$0x471900,%edx 401d2f: b9 00 0d 49 00 mov$0x490d00,%ecx 401d34: 41 b8 00 01 4b 00 mov$0x4b0100,%r8d 401d3a: 41 b9 00 f5 4c 00 mov$0x4cf500,%r9d 401d40: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0 401d44: 68 00 f5 54 00 push $0x54f500 401d49: 68 00 f5 50 00 push $0x50f500 401d4e: e8 6d 3c 01 00 call 4159c0 401d53: 48 83 c4 10 add$0x10,%rsp 401d57: 83 c3 01add$0x1,%ebx 401d5a: 81 fb 40 42 0f 00 cmp$0xf4240,%ebx 401d60: 0f 84 9a 00 00 00 je 401e00 401d66: c5 fa 10 05 92 07 05vmovss 0x50792(%rip),%xmm0# 452500 401d6d: 00 401d6e: 31 c0 xor%eax,%eax 401d70: c5 fa 10 0c 85 04 25vmovss 0x452504(,%rax,4),%xmm1 401d77: 45 00 401d79: c5 fa 59 c1 vmulss %xmm1,%xmm0,%xmm0 401d7d: c5 fa 11 04 85 00 25vmovss %xmm0,0x452500(,%rax,4) 401d84: 45 00 401d86: c5 f8 10 04 85 08 25vmovups 0x452508(,%rax,4),%xmm0 401d8d: 45 00 401d8f: c5 f0 c6 c8 00 vshufps $0x0,%xmm0,%xmm1,%xmm1 401d94: c5 f0 c6 c8 98 vshufps $0x98,%xmm0,%xmm1,%xmm1 401d99: c5 f8 59 c9 vmulps %xmm1,%xmm0,%xmm1 401d9d: c5 f8 11 0c 85 04 25vmovups %xmm1,0x452504(,%rax,4) 401da4: 45 00 401da6: 48 3d f5 7c 00 00 cmp$0x7cf5,%rax 401dac: 0f 87 6e ff ff ff ja 401d20 401db2: c4 e3 79 04 c0 e7 vpermilps $0xe7,%xmm0,%xmm0 401db8: c5 fa 10 0c 85 18 25vmovss 0x452518(,%rax,4),%xmm1 401dbf: 45 00 401dc1: c5 fa 59 c1 vmulss %xmm1,%xmm0,%xmm0 401dc5: c5 fa 11 04 85 14 25vmovss %xmm0,0x452514(,%rax,4) 401dcc: 45 00 401dce: c5 f8 10 04 85 1c 25vmovups 0x45251c(,%rax,4),%xmm0 401dd5: 45 00 401dd7: c5 f0 c6 c8 00 vshufps $0x0,%xmm0,%xmm1,%xmm1 401ddc: c5 f0 c6 c8 98 vshufps $0x98,%xmm0,%xmm1,%xmm1 401de1: c5 f8 59 c9 vmulps %xmm1,%xmm0,%xmm1 401de5: c5 fa 10 04 85 28 25vmovss 0x452528(,%rax,4),%xmm0 401dec: 45 00 401dee: c5 f8 11 0c 85 18 25vmovups %xmm1,0x452518(,%rax,4) 401df5: 45 00 401df7: 48 83 c0 0a add$0xa,%rax 401dfb: e9 70 ff ff ff jmp401d70 401e00: 49 83 c6 10 add$0x10,%r14 401e04: 4c 89 f7mov%r14,%rdi 401e07: 31 f6 xor%esi,%esi 401e09: e8 52 f2 ff ff call 401060 401e0e: bf 66 e1 42 00 mov$0x42e166,%edi 401e13: 48 83 c4 08 add$0x8,%rsp 401e17: 5b pop%rbx 401e18: 41 5e pop%r14 401e1a: e9 e1 51
[Bug middle-end/99397] New: s152 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99397 Bug ID: 99397 Summary: s152 benchmark of TSVC is vectorized by clang and not by gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- s152 is: void s152s(real_t a[LEN_1D], real_t b[LEN_1D], real_t c[LEN_1D], int i) { a[i] += b[i] * c[i]; } real_t s152(struct args_t * func_args) { //interprocedural data flow analysis //collecting information from a subroutine initialise_arrays(__func__); gettimeofday(&func_args->t1, NULL); for (int nl = 0; nl < iterations; nl++) { for (int i = 0; i < LEN_1D; i++) { b[i] = d[i] * e[i]; s152s(a, b, c, i); } dummy(a, b, c, d, e, aa, bb, cc, 0.); } gettimeofday(&func_args->t2, NULL); return calc_checksum(__func__); } and clang11 vectorizes it as: 004048b0 : 4048b0: 41 56 push %r14 4048b2: 53 push %rbx 4048b3: 50 push %rax 4048b4: 49 89 femov%rdi,%r14 4048b7: bf b7 e1 42 00 mov$0x42e1b7,%edi 4048bc: e8 4f 2d 01 00 call 417610 4048c1: 31 db xor%ebx,%ebx 4048c3: 4c 89 f7mov%r14,%rdi 4048c6: 31 f6 xor%esi,%esi 4048c8: e8 93 c7 ff ff call 401060 4048cd: 0f 1f 00nopl (%rax) 4048d0: 31 c0 xor%eax,%eax 4048d2: 66 2e 0f 1f 84 00 00cs nopw 0x0(%rax,%rax,1) 4048d9: 00 00 00 4048dc: 0f 1f 40 00 nopl 0x0(%rax) 4048e0: c5 fc 28 80 00 01 4bvmovaps 0x4b0100(%rax),%ymm0 4048e7: 00 4048e8: c5 fc 28 88 20 01 4bvmovaps 0x4b0120(%rax),%ymm1 4048ef: 00 4048f0: c5 fc 59 80 00 0d 49vmulps 0x490d00(%rax),%ymm0,%ymm0 4048f7: 00 4048f8: c5 f4 59 88 20 0d 49vmulps 0x490d20(%rax),%ymm1,%ymm1 4048ff: 00 404900: c5 fc 29 80 00 31 43vmovaps %ymm0,0x433100(%rax) 404907: 00 404908: c5 fc 29 88 20 31 43vmovaps %ymm1,0x433120(%rax) 40490f: 00 404910: c5 fc 28 90 00 19 47vmovaps 0x471900(%rax),%ymm2 404917: 00 404918: c5 fc 28 98 20 19 47vmovaps 0x471920(%rax),%ymm3 40491f: 00 404920: c4 e2 7d a8 90 00 25vfmadd213ps 0x452500(%rax),%ymm0,%ymm2 404927: 45 00 404929: c4 e2 75 a8 98 20 25vfmadd213ps 0x452520(%rax),%ymm1,%ymm3 404930: 45 00 404932: c5 fc 29 90 00 25 45vmovaps %ymm2,0x452500(%rax) 404939: 00 40493a: c5 fc 29 98 20 25 45vmovaps %ymm3,0x452520(%rax) 404941: 00 404942: 48 83 c0 40 add$0x40,%rax 404946: 48 3d 00 f4 01 00 cmp$0x1f400,%rax 40494c: 75 92 jne4048e0 40494e: bf 00 25 45 00 mov$0x452500,%edi 404953: be 00 31 43 00 mov$0x433100,%esi 404958: ba 00 19 47 00 mov$0x471900,%edx 40495d: b9 00 0d 49 00 mov$0x490d00,%ecx 404962: 41 b8 00 01 4b 00 mov$0x4b0100,%r8d 404968: 41 b9 00 f5 4c 00 mov$0x4cf500,%r9d 40496e: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0 404972: 68 00 f5 54 00 push $0x54f500 404977: 68 00 f5 50 00 push $0x50f500 40497c: c5 f8 77vzeroupper 40497f: e8 3c 10 01 00 call 4159c0 404984: 48 83 c4 10 add$0x10,%rsp 404988: 83 c3 01add$0x1,%ebx 40498b: 81 fb a0 86 01 00 cmp$0x186a0,%ebx 404991: 0f 85 39 ff ff ff jne4048d0 404997: 49 83 c6 10 add$0x10,%r14 40499b: 4c 89 f7mov%r14,%rdi 40499e: 31 f6 xor%esi,%esi 4049a0: e8 bb c6 ff ff call 401060 4049a5: bf b7 e1 42 00 mov$0x42e1b7,%edi 4049aa: 48 83 c4 08 add$0x8,%rsp 4049ae: 5b pop%rbx 4049af: 41 5e pop%r14 4049b1: e9 4a 26 02 00 jmp427000 4049b6: 66 2e 0f 1f 84 00 00cs nopw 0x0(%rax,%rax,1) 4049bd: 00 00 00 We get: real_t s152 (struct args_t * func_args) { int i; int nl; static const char __func__[5] = "s152"; struct timeval * _1; float _2; float _3; float _4; struct timeval * _5; real_t _16; long unsigned int _21; long unsigned int _22; real_t * _23; float _24; real_t * _25; float
[Bug middle-end/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 --- Comment #1 from Jan Hubicka --- Loop is: real_t s116 (struct args_t * func_args) { int i; int nl; static const char __func__[5] = "s116"; struct timeval * _1; int _2; float _3; float _4; float _5; int _6; float _7; float _8; float _9; int _10; float _11; float _12; float _13; int _14; float _15; float _16; float _17; int _18; float _19; float _20; float _21; struct timeval * _22; real_t _33; unsigned int ivtmp_43; unsigned int ivtmp_44; unsigned int ivtmp_45; unsigned int ivtmp_46; [local count: 108459]: initialise_arrays (&__func__); _1 = &func_args_29(D)->t1; gettimeofday (_1, 0B); goto ; [100.00%] [local count: 1052266996]: [local count: 1063004409]: # i_48 = PHI <_18(8), 0(5)> # ivtmp_46 = PHI _2 = i_48 + 1; _3 = a[_2]; _4 = a[i_48]; _5 = _3 * _4; a[i_48] = _5; _6 = i_48 + 2; _7 = a[_6]; _8 = a[_2]; _9 = _7 * _8; a[_2] = _9; _10 = i_48 + 3; _11 = a[_10]; _12 = a[_6]; _13 = _11 * _12; a[_6] = _13; _14 = i_48 + 4; _15 = a[_14]; _16 = a[_10]; _17 = _15 * _16; a[_10] = _17; _18 = i_48 + 5; _19 = a[_18]; _20 = a[_14]; _21 = _19 * _20; a[_14] = _21; ivtmp_45 = ivtmp_46 - 1; if (ivtmp_45 != 0) goto ; [98.99%] else goto ; [1.01%] tsvc.c:275:18: missed: not vectorized, possible dependence between data-refs a[i_48] and a[_18] tsvc.c:274:27: missed: bad data dependence. _18 = i_48 + 5 and stride is 5...
[Bug middle-end/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394 --- Comment #1 from Jan Hubicka --- Here we fail with: tsvc.c:1526:27: note: vect_is_simple_use: operand x_30 = PHI <_2(8), x_18(3)>, type of def: unknown tsvc.c:1526:27: missed: Unsupported pattern. tsvc.c:1527:26: missed: not vectorized: unsupported use in stmt. tsvc.c:1526:27: missed: unexpected pattern. { int i; int nl; real_t x; static const char __func__[5] = "s254"; struct timeval * _1; float _2; float _3; float _4; struct timeval * _5; real_t _17; unsigned int ivtmp_27; unsigned int ivtmp_28; unsigned int ivtmp_29; unsigned int ivtmp_35; [local count: 108459]: initialise_arrays (&__func__); _1 = &func_args_13(D)->t1; gettimeofday (_1, 0B); [local count: 10737416]: # nl_31 = PHI # ivtmp_28 = PHI x_18 = b[31999]; [local count: 1063004409]: # x_30 = PHI <_2(8), x_18(3)> # i_32 = PHI # ivtmp_35 = PHI _2 = b[i_32]; _3 = _2 + x_30; _4 = _3 * 5.0e-1; a[i_32] = _4; i_22 = i_32 + 1; ivtmp_29 = ivtmp_35 - 1; if (ivtmp_29 != 0) goto ; [98.99%] else goto ; [1.01%] [local count: 1052266996]: goto ; [100.00%]
[Bug tree-optimization/99394] s254 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99394 --- Comment #3 from Jan Hubicka --- testcase is: typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t flat_2d_array[LEN_2D*LEN_2D]; real_t x[LEN_1D]; real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D], bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int indx[LEN_1D]; real_t* __restrict__ xx; real_t* yy; // %2.5 real_t s254(void) { //scalar and array expansion //carry around variable real_t x; for (int nl = 0; nl < 4*iterations; nl++) { x = b[LEN_1D-1]; for (int i = 0; i < LEN_1D; i++) { a[i] = (b[i] + x) * (real_t).5; x = b[i]; } } }
[Bug middle-end/99407] New: s243 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99407 Bug ID: 99407 Summary: s243 benchmark of TSVC is vectorized by clang and not by gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- This testcase (from TSVC) is about 4 times faster on zen3 when built with clang. typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t flat_2d_array[LEN_2D*LEN_2D]; real_t x[LEN_1D]; real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D], bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int indx[LEN_1D]; real_t* __restrict__ xx; real_t* yy; real_t s243(void) { //node splitting //false dependence cycle breaking for (int nl = 0; nl < iterations; nl++) { for (int i = 0; i < LEN_1D-1; i++) { a[i] = b[i] + c[i ] * d[i]; b[i] = a[i] + d[i ] * e[i]; a[i] = b[i] + a[i+1] * d[i]; } } } internal loop from clang is: .LBB0_2:# Parent Loop BB0_1 Depth=1 # => This Inner Loop Header: Depth=2 vmovups c(%rcx), %ymm12 vmovups c+32(%rcx), %ymm14 vmovups d(%rcx), %ymm0 vmovups d+32(%rcx), %ymm7 vfmadd213ps b(%rcx), %ymm0, %ymm12 # ymm12 = (ymm0 * ymm12) + mem vfmadd213ps b+32(%rcx), %ymm7, %ymm14 # ymm14 = (ymm7 * ymm14) + mem vfmadd231ps e(%rcx), %ymm0, %ymm12 # ymm12 = (ymm0 * mem) + ymm12 vfmadd231ps e+32(%rcx), %ymm7, %ymm14 # ymm14 = (ymm7 * mem) + ymm14 vmovups %ymm12, b(%rcx) vmovups %ymm14, b+32(%rcx) vfmadd231ps a+4(%rcx), %ymm0, %ymm12 # ymm12 = (ymm0 * mem) + ymm12 vfmadd231ps a+36(%rcx), %ymm7, %ymm14 # ymm14 = (ymm7 * mem) + ymm14 vmovups %ymm12, a(%rcx) vmovups %ymm14, a+32(%rcx) addq$64, %rcx cmpq$127936, %rcx # imm = 0x1F3C0 jne .LBB0_2
[Bug middle-end/99407] s243 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99407 --- Comment #1 from Jan Hubicka --- Here we get: s243.c:27:18: missed: not vectorized, possible dependence between data-refs a[i_29] and a[_9] s243.c:26:27: missed: bad data dependence. s243.c:26:27: note: * Analysis failed with vector mode V8QI [local count: 1052266997]: [local count: 1063004410]: # i_29 = PHI <_9(6), 0(4)> # ivtmp_43 = PHI _1 = b[i_29]; _2 = c[i_29]; _3 = d[i_29]; _4 = _2 * _3; _5 = _1 + _4; a[i_29] = _5; _6 = e[i_29]; _7 = _3 * _6; _8 = _5 + _7; b[i_29] = _8; _9 = i_29 + 1; _10 = a[_9]; _11 = _3 * _10; _12 = _8 + _11; a[i_29] = _12; ivtmp_42 = ivtmp_43 - 1; if (ivtmp_42 != 0) goto ; [98.99%] else goto ; [1.01%]
[Bug middle-end/99408] New: s3251 benchmark of TSVC vectorized by clang runs about 7 times faster compared to gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99408 Bug ID: 99408 Summary: s3251 benchmark of TSVC vectorized by clang runs about 7 times faster compared to gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D]; void main(void) { for (int nl = 0; nl < iterations; nl++) { for (int i = 0; i < LEN_1D-1; i++){ a[i+1] = b[i]+c[i]; b[i] = c[i]*e[i]; d[i] = a[i]*e[i]; } } } Built with -march=znver2 -Ofast I get: main: .LFB0: .cfi_startproc vmovaps c+127968(%rip), %xmm5 vmovaps e+127968(%rip), %xmm4 movl$10, %edx vmovq c+127984(%rip), %xmm9 vmovq e+127984(%rip), %xmm10 vmovss c+127992(%rip), %xmm7 vmovss e+127992(%rip), %xmm3 vmovss c+127984(%rip), %xmm13 vmulps %xmm4, %xmm5, %xmm6 vmulps %xmm9, %xmm10, %xmm12 vmulss %xmm3, %xmm7, %xmm11 .p2align 4 .p2align 3 .L2: xorl%eax, %eax .p2align 4 .p2align 3 .L4: vmovaps c(%rax), %ymm2 addq$32, %rax vaddps b-32(%rax), %ymm2, %ymm0 vmovups %ymm0, a-28(%rax) vmulps e-32(%rax), %ymm2, %ymm0 vmovaps e-32(%rax), %ymm2 vmovaps %ymm0, b-32(%rax) vmulps a-32(%rax), %ymm2, %ymm0 vmovaps %ymm0, d-32(%rax) cmpq$127968, %rax jne .L4 vaddps b+127968(%rip), %xmm5, %xmm1 vaddss b+127984(%rip), %xmm13, %xmm2 decl%edx vmovaps %xmm6, b+127968(%rip) vmovq b+127984(%rip), %xmm0 vmovlps %xmm12, b+127984(%rip) vaddps %xmm0, %xmm9, %xmm0 vmovups %xmm1, a+127972(%rip) vshufps $255, %xmm1, %xmm1, %xmm1 vmulps a+127968(%rip), %xmm4, %xmm8 vunpcklps %xmm2, %xmm1, %xmm1 vaddss b+127992(%rip), %xmm7, %xmm2 vmovss %xmm11, b+127992(%rip) vmulps %xmm10, %xmm1, %xmm1 vmovlps %xmm0, a+127988(%rip) vmovshdup %xmm0, %xmm0 vmulss %xmm3, %xmm0, %xmm0 vmovss %xmm2, a+127996(%rip) jne .L2 vmovaps %xmm8, d+127968(%rip) vmovlps %xmm1, d+127984(%rip) vmovss %xmm0, d+127992(%rip) vzeroupper ret Clang does: main: # @main .cfi_startproc # %bb.0: vbroadcastssa(%rip), %ymm0 vmovss e+127968(%rip), %xmm1 # xmm1 = mem[0],zero,zero,zero vmovss e+127980(%rip), %xmm2 # xmm2 = mem[0],zero,zero,zero vmovss c+127984(%rip), %xmm4 # xmm4 = mem[0],zero,zero,zero vmovss e+127984(%rip), %xmm5 # xmm5 = mem[0],zero,zero,zero vmovss c+127988(%rip), %xmm8 # xmm8 = mem[0],zero,zero,zero vmovss e+127988(%rip), %xmm9 # xmm9 = mem[0],zero,zero,zero vmovss c+127992(%rip), %xmm11 # xmm11 = mem[0],zero,zero,zero vmovss e+127992(%rip), %xmm12 # xmm12 = mem[0],zero,zero,zero xorl%eax, %eax vmovups %ymm0, -56(%rsp)# 32-byte Spill vmovss c+127968(%rip), %xmm0 # xmm0 = mem[0],zero,zero,zero vmovss %xmm1, -64(%rsp)# 4-byte Spill vmulss %xmm4, %xmm5, %xmm3 vmulss %xmm8, %xmm9, %xmm10 vmulss %xmm11, %xmm12, %xmm13 vmovss %xmm0, -60(%rsp)# 4-byte Spill vmulss %xmm0, %xmm1, %xmm0 vmovss e+127972(%rip), %xmm1 # xmm1 = mem[0],zero,zero,zero vmovss %xmm0, -68(%rsp)# 4-byte Spill vmovss c+127972(%rip), %xmm0 # xmm0 = mem[0],zero,zero,zero vmovss %xmm1, -76(%rsp)# 4-byte Spill vmovss %xmm0, -72(%rsp)# 4-byte Spill vmulss %xmm0, %xmm1, %xmm0 vmovss e+127976(%rip), %xmm1 # xmm1 = mem[0],zero,zero,zero vmovss %xmm0, -80(%rsp)# 4-byte Spill vmovss c+127976(%rip), %xmm0 # xmm0 = mem[0],zero,zero,zero vmovss %xmm1, -88(%rsp)# 4-byte Spill vmovss %xmm0, -84(%rsp)# 4-byte Spill vmulss %xmm0, %xmm1, %xmm0 vmovss c+127980(%rip), %xmm1 # xmm1 = mem[0],zero,zero,zero vmovss %xmm0, -92(%rsp)# 4-byte Spill vmulss %xmm1, %xmm2, %xmm0 vmovss %xmm0, -96(%rsp)# 4-byte Spill .p2align4, 0x90 .LBB0_1:# =>This Loop Header:
[Bug middle-end/99409] New: s252 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409 Bug ID: 99409 Summary: s252 benchmark of TSVC is vectorized by clang and not by gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D]; void main() { //scalar and array expansion //loop with ambiguous scalar temporary real_t t, s; for (int nl = 0; nl < iterations; nl++) { t = (real_t) 0.; for (int i = 0; i < LEN_1D; i++) { s = b[i] * c[i]; a[i] = s + t; t = s; } } } clang does: main: # @main .cfi_startproc # %bb.0: xorl%eax, %eax .p2align4, 0x90 .LBB0_1:# =>This Loop Header: Depth=1 # Child Loop BB0_2 Depth 2 vxorps %xmm0, %xmm0, %xmm0 movq$-128000, %rcx # imm = 0xFFFE0C00 .p2align4, 0x90 .LBB0_2:# Parent Loop BB0_1 Depth=1 # => This Inner Loop Header: Depth=2 vmovups c+128000(%rcx), %ymm1 vmovups c+128032(%rcx), %ymm2 vmovups c+128064(%rcx), %ymm3 vmovups c+128096(%rcx), %ymm4 vmulps b+128000(%rcx), %ymm1, %ymm1 vmulps b+128032(%rcx), %ymm2, %ymm2 vmulps b+128064(%rcx), %ymm3, %ymm3 vmulps b+128096(%rcx), %ymm4, %ymm4 vperm2f128 $33, %ymm1, %ymm0, %ymm0 # ymm0 = ymm0[2,3],ymm1[0,1] vperm2f128 $33, %ymm2, %ymm1, %ymm5 # ymm5 = ymm1[2,3],ymm2[0,1] vperm2f128 $33, %ymm3, %ymm2, %ymm6 # ymm6 = ymm2[2,3],ymm3[0,1] vperm2f128 $33, %ymm4, %ymm3, %ymm7 # ymm7 = ymm3[2,3],ymm4[0,1] vshufps $3, %ymm1, %ymm0, %ymm0 # ymm0 = ymm0[3,0],ymm1[0,0],ymm0[7,4],ymm1[4,4] vshufps $3, %ymm2, %ymm5, %ymm5 # ymm5 = ymm5[3,0],ymm2[0,0],ymm5[7,4],ymm2[4,4] vshufps $3, %ymm3, %ymm6, %ymm6 # ymm6 = ymm6[3,0],ymm3[0,0],ymm6[7,4],ymm3[4,4] vshufps $3, %ymm4, %ymm7, %ymm7 # ymm7 = ymm7[3,0],ymm4[0,0],ymm7[7,4],ymm4[4,4] vshufps $152, %ymm1, %ymm0, %ymm0 # ymm0 = ymm0[0,2],ymm1[1,2],ymm0[4,6],ymm1[5,6] vshufps $152, %ymm2, %ymm5, %ymm5 # ymm5 = ymm5[0,2],ymm2[1,2],ymm5[4,6],ymm2[5,6] vshufps $152, %ymm3, %ymm6, %ymm6 # ymm6 = ymm6[0,2],ymm3[1,2],ymm6[4,6],ymm3[5,6] vshufps $152, %ymm4, %ymm7, %ymm7 # ymm7 = ymm7[0,2],ymm4[1,2],ymm7[4,6],ymm4[5,6] vaddps %ymm0, %ymm1, %ymm0 vaddps %ymm5, %ymm2, %ymm1 vaddps %ymm6, %ymm3, %ymm2 vaddps %ymm7, %ymm4, %ymm3 vmovups %ymm0, a+128000(%rcx) vmovups %ymm1, a+128032(%rcx) vmovups %ymm2, a+128064(%rcx) vmovups %ymm3, a+128096(%rcx) subq$-128, %rcx vmovaps %ymm4, %ymm0 jne .LBB0_2 # %bb.3:# in Loop: Header=BB0_1 Depth=1 incl%eax cmpl$10, %eax # imm = 0x186A0 jne .LBB0_1 # %bb.4: vzeroupper retq s252.c:18:27: note: worklist: examine stmt: _3 = s_11 + t_21; s252.c:18:27: note: vect_is_simple_use: operand _1 * _2, type of def: internal s252.c:18:27: note: mark relevant 5, live 0: s_11 = _1 * _2; s252.c:18:27: note: vect_is_simple_use: operand t_21 = PHI , type of def: unknown s252.c:18:27: missed: Unsupported pattern. s252.c:20:22: missed: not vectorized: unsupported use in stmt. s252.c:18:27: missed: unexpected pattern. [local count: 1052266996]: [local count: 1063004409]: # t_21 = PHI # i_23 = PHI # ivtmp_20 = PHI _1 = b[i_23]; _2 = c[i_23]; s_11 = _1 * _2; _3 = s_11 + t_21; a[i_23] = _3; i_13 = i_23 + 1; ivtmp_19 = ivtmp_20 - 1; if (ivtmp_19 != 0) goto ; [98.99%] else goto ; [1.01%]
[Bug middle-end/99411] New: s311 benchmark of TSVC is vectorized by clang better than by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99411 Bug ID: 99411 Summary: s311 benchmark of TSVC is vectorized by clang better than by gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D]; int main() { //reductions //sum reduction real_t sum; for (int nl = 0; nl < iterations*10; nl++) { sum = (real_t)0.; for (int i = 0; i < LEN_1D; i++) { sum += a[i]; } } return sum > 4; } We produce with -O2 -march=znver2 .L2: movl$a, %eax vxorps %xmm0, %xmm0, %xmm0 .p2align 4 .p2align 3 .L3: vaddps (%rax), %ymm0, %ymm0 addq$32, %rax cmpq$a+128000, %rax jne .L3 vextractf128$0x1, %ymm0, %xmm1 decl%edx vaddps %xmm0, %xmm1, %xmm1 vmovhlps%xmm1, %xmm1, %xmm0 vaddps %xmm1, %xmm0, %xmm0 vshufps $85, %xmm0, %xmm0, %xmm1 vaddps %xmm0, %xmm1, %xmm0 jne .L2 xorl%eax, %eax vcomiss .LC0(%rip), %xmm0 seta%al vzeroupper ret .cfi_endproc clang does: main: # @main .cfi_startproc # %bb.0: xorl%eax, %eax .p2align4, 0x90 .LBB0_1:# =>This Loop Header: Depth=1 # Child Loop BB0_2 Depth 2 vxorps %xmm0, %xmm0, %xmm0 movq$-128000, %rcx # imm = 0xFFFE0C00 vxorps %xmm1, %xmm1, %xmm1 vxorps %xmm2, %xmm2, %xmm2 vxorps %xmm3, %xmm3, %xmm3 .p2align4, 0x90 .LBB0_2:# Parent Loop BB0_1 Depth=1 # => This Inner Loop Header: Depth=2 vaddps a+128000(%rcx), %ymm0, %ymm0 vaddps a+128032(%rcx), %ymm1, %ymm1 vaddps a+128064(%rcx), %ymm2, %ymm2 vaddps a+128096(%rcx), %ymm3, %ymm3 subq$-128, %rcx jne .LBB0_2 # %bb.3:# in Loop: Header=BB0_1 Depth=1 incl%eax cmpl$100, %eax # imm = 0xF4240 jne .LBB0_1 # %bb.4: vaddps %ymm0, %ymm1, %ymm0 xorl%eax, %eax vaddps %ymm0, %ymm2, %ymm0 vaddps %ymm0, %ymm3, %ymm0 vextractf128$1, %ymm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vpermilpd $1, %xmm0, %xmm1# xmm1 = xmm0[1,0] vaddps %xmm1, %xmm0, %xmm0 vmovshdup %xmm0, %xmm1# xmm1 = xmm0[1,1,3,3] vaddss %xmm1, %xmm0, %xmm0 vucomiss.LCPI0_0(%rip), %xmm0 seta%al vzeroupper retq On zen3 hardware gcc version runs 2.4s, while clang's 0.8s
[Bug middle-end/99411] s311 and s31111 benchmark of TSVC is vectorized by clang better than by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99411 Jan Hubicka changed: What|Removed |Added Summary|s311 benchmark of TSVC is |s311 and s3 benchmark |vectorized by clang better |of TSVC is vectorized by |than by gcc |clang better than by gcc --- Comment #1 from Jan Hubicka --- I think this is same case typedef float real_t; #define iterations 100 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D]; real_t test(real_t* A){ real_t s = (real_t)0.0; for (int i = 0; i < 4; i++) s += A[i]; return s; } int main() { //reductions //sum reduction real_t sum; for (int nl = 0; nl < 2000*iterations; nl++) { sum = (real_t)0.; sum += test(a); sum += test(&a[4]); sum += test(&a[8]); sum += test(&a[12]); sum += test(&a[16]); sum += test(&a[20]); sum += test(&a[24]); sum += test(&a[28]); } return sum>4; }
[Bug middle-end/99411] s311, s312 and s31111 benchmark of TSVC is vectorized by clang better than by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99411 Jan Hubicka changed: What|Removed |Added Summary|s311 and s3 benchmark |s311, s312 and s3 |of TSVC is vectorized by|benchmark of TSVC is |clang better than by gcc|vectorized by clang better ||than by gcc --- Comment #2 from Jan Hubicka --- another one: // %3.1 typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D]; int main () { //reductions //product reduction real_t prod; for (int nl = 0; nl < 10*iterations; nl++) { prod = (real_t)1.; for (int i = 0; i < LEN_1D; i++) { prod *= a[i]; } } return prod > 0; }
[Bug middle-end/99411] s311, s312, s31111 and s31111 benchmark of TSVC is vectorized by clang better than by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99411 Jan Hubicka changed: What|Removed |Added Summary|s311, s312 and s3 |s311, s312, s3 and |benchmark of TSVC is|s3 benchmark of TSVC is |vectorized by clang better |vectorized by clang better |than by gcc |than by gcc --- Comment #3 from Jan Hubicka --- and yet another one typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D]; int main() { //reductions //conditional sum reduction real_t sum; for (int nl = 0; nl < iterations/2; nl++) { sum = 0.; for (int i = 0; i < LEN_1D; i++) { if (a[i] > (real_t)0.) { sum += a[i]; } } } return sum > 4; }
[Bug middle-end/99411] s311, s312, s31111 and s31111, s3110 benchmark of TSVC is vectorized by clang better than by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99411 Jan Hubicka changed: What|Removed |Added Summary|s311, s312, s3 and |s311, s312, s3 and |s3 benchmark of TSVC is |s3, s3110 benchmark of |vectorized by clang better |TSVC is vectorized by clang |than by gcc |better than by gcc --- Comment #4 from Jan Hubicka --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D]; real_t aa[LEN_2D][LEN_2D]; int main() { //reductions //if to max with index reductio 2 dimensions //similar to S315 int xindex, yindex; real_t max, chksum; for (int nl = 0; nl < 100*(iterations/(LEN_2D)); nl++) { max = aa[(0)][0]; xindex = 0; yindex = 0; for (int i = 0; i < LEN_2D; i++) { for (int j = 0; j < LEN_2D; j++) { if (aa[i][j] > max) { max = aa[i][j]; xindex = i; yindex = j; } } } chksum = max + (real_t) xindex + (real_t) yindex; } return max + xindex+1 + yindex+1; }
[Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412 Bug ID: 99412 Summary: s352 benchmark of TSVC is vectorized by clang and not by gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],b[LEN_1D]; int main () { //loop rerolling //unrolled dot product real_t dot; for (int nl = 0; nl < 8*iterations; nl++) { dot = (real_t)0.; for (int i = 0; i < LEN_1D; i += 5) { dot = dot + a[i] * b[i] + a[i + 1] * b[i + 1] + a[i + 2] * b[i + 2] + a[i + 3] * b[i + 3] + a[i + 4] * b[i + 4]; } } return dot; } clang does: main: # @main .cfi_startproc # %bb.0: xorl%eax, %eax .p2align4, 0x90 .LBB0_1:# =>This Loop Header: Depth=1 # Child Loop BB0_2 Depth 2 vxorps %xmm0, %xmm0, %xmm0 movq$-5, %rcx .p2align4, 0x90 .LBB0_2:# Parent Loop BB0_1 Depth=1 # => This Inner Loop Header: Depth=2 vmovups b+20(,%rcx,4), %xmm1 vmovss b+36(,%rcx,4), %xmm2# xmm2 = mem[0],zero,zero,zero vmulps a+20(,%rcx,4), %xmm1, %xmm1 vpermilpd $1, %xmm1, %xmm3# xmm3 = xmm1[1,0] vaddps %xmm3, %xmm1, %xmm1 vmovshdup %xmm1, %xmm3# xmm3 = xmm1[1,1,3,3] vaddss %xmm3, %xmm1, %xmm1 vfmadd231ss a+36(,%rcx,4), %xmm2, %xmm1 # xmm1 = (xmm2 * mem) + xmm1 addq$5, %rcx vaddss %xmm0, %xmm1, %xmm0 cmpq$31995, %rcx# imm = 0x7CFB jb .LBB0_2 # %bb.3:# in Loop: Header=BB0_1 Depth=1 incl%eax cmpl$80, %eax # imm = 0xC3500 jne .LBB0_1 # %bb.4: vcvttss2si %xmm0, %eax retq
[Bug middle-end/99411] s311, s312, s31111, s31111, s3110, vsumr benchmark of TSVC is vectorized by clang better than by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99411 Jan Hubicka changed: What|Removed |Added Summary|s311, s312, s3 and |s311, s312, s3, s3, |s3, s3110 benchmark of |s3110, vsumr benchmark of |TSVC is vectorized by clang |TSVC is vectorized by clang |better than by gcc |better than by gcc --- Comment #5 from Jan Hubicka --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D]; int main() { //control loops //vector sum reduction real_t sum; for (int nl = 0; nl < iterations*10; nl++) { sum = 0.; for (int i = 0; i < LEN_1D; i++) { sum += a[i]; } } return sum; }
[Bug middle-end/99414] New: s235 benchmark of TSVC is vectorized better by icc than gcc (loop interchange)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99414 Bug ID: 99414 Summary: s235 benchmark of TSVC is vectorized better by icc than gcc (loop interchange) Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D], aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; // %2.3 real_t main(struct args_t * func_args) { //loop interchanging //imperfectly nested loops for (int nl = 0; nl < 200*(iterations/LEN_2D); nl++) { for (int i = 0; i < LEN_2D; i++) { a[i] += b[i] * c[i]; for (int j = 1; j < LEN_2D; j++) { aa[j][i] = aa[j-1][i] + bb[j][i] * a[i]; } } } } runs about 10 times faster on zen3 built by icc -O3 -ip -Ofast -g -march=core-avx2 -mtune=core-avx2 -vec s235.c main: # parameter 1: %rdi ..B1.1: # Preds ..B1.0 # Execution count [1.77e+00] .cfi_startproc ..___tag_value_main.1: ..L2: #9.1 pushq %rbp #9.1 .cfi_def_cfa_offset 16 movq %rsp, %rbp#9.1 .cfi_def_cfa 6, 16 .cfi_offset 6, -16 andq $-128, %rsp #9.1 subq $128, %rsp#9.1 movl $3, %edi #9.1 xorl %esi, %esi#9.1 call __intel_new_feature_proc_init #9.1 # LOE rbx r12 r13 r14 r15 ..B1.12:# Preds ..B1.1 # Execution count [1.77e+00] vstmxcsr (%rsp)#9.1 xorl %eax, %eax#14.5 orl $32832, (%rsp)#9.1 vldmxcsr (%rsp)#9.1 # LOE rbx r12 r13 r14 r15 eax ..B1.2: # Preds ..B1.8 ..B1.12 # Execution count [7.83e+04] xorl %edx, %edx#15.9 # LOE rdx rbx r12 r13 r14 r15 eax ..B1.3: # Preds ..B1.3 ..B1.2 # Execution count [2.00e+07] vmovups b(,%rdx,4), %ymm1 #16.21 lea (,%rdx,4), %rcx #16.13 vmovups 32+b(,%rdx,4), %ymm3 #16.21 vmovups 64+b(,%rdx,4), %ymm5 #16.21 vmovups 96+b(,%rdx,4), %ymm7 #16.21 vmovups 128+b(,%rdx,4), %ymm9 #16.21 vmovups 160+b(,%rdx,4), %ymm11#16.21 vmovups 192+b(,%rdx,4), %ymm13#16.21 vmovups 224+b(,%rdx,4), %ymm15#16.21 vmovups c(,%rdx,4), %ymm0 #16.28 vmovups 32+c(,%rdx,4), %ymm2 #16.28 vmovups 64+c(,%rdx,4), %ymm4 #16.28 vmovups 96+c(,%rdx,4), %ymm6 #16.28 vmovups 128+c(,%rdx,4), %ymm8 #16.28 vmovups 160+c(,%rdx,4), %ymm10#16.28 vmovups 192+c(,%rdx,4), %ymm12#16.28 vmovups 224+c(,%rdx,4), %ymm14#16.28 vfmadd213ps a(,%rdx,4), %ymm0, %ymm1#16.13 vfmadd213ps 32+a(,%rdx,4), %ymm2, %ymm3 #16.13 vfmadd213ps 64+a(,%rdx,4), %ymm4, %ymm5 #16.13 vfmadd213ps 96+a(,%rdx,4), %ymm6, %ymm7 #16.13 vfmadd213ps 128+a(,%rdx,4), %ymm8, %ymm9#16.13 vfmadd213ps 160+a(,%rdx,4), %ymm10, %ymm11 #16.13 vfmadd213ps 192+a(,%rdx,4), %ymm12, %ymm13 #16.13 vfmadd213ps 224+a(,%rdx,4), %ymm14, %ymm15 #16.13 vmovups %ymm1, a(%rcx)#16.13 vmovups %ymm3, 32+a(%rcx) #16.13 vmovups %ymm5, 64+a(%rcx) #16.13 vmovups %ymm7, 96+a(%rcx) #16.13 vmovups %ymm9, 128+a(%rcx)
[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395 --- Comment #3 from Jan Hubicka --- ICC version seems to run faster 0040a050 : 40a050: 55 push %rbp 40a051: 48 89 e5mov%rsp,%rbp 40a054: 48 83 e4 e0 and$0xffe0,%rsp 40a058: 41 57 push %r15 40a05a: 53 push %rbx 40a05b: 48 83 ec 10 sub$0x10,%rsp 40a05f: 48 89 fbmov%rdi,%rbx 40a062: bf 74 f5 42 00 mov$0x42f574,%edi 40a067: e8 14 cc 00 00 call 416c80 40a06c: 48 89 dfmov%rbx,%rdi 40a06f: 33 f6 xor%esi,%esi 40a071: e8 4a 70 ff ff call 4010c0 40a076: 33 c0 xor%eax,%eax 40a078: 41 89 c7mov%eax,%r15d 40a07b: 33 d2 xor%edx,%edx 40a07d: 0f 1f 00nopl (%rax) 40a080: c5 fc 10 04 95 04 9dvmovups 0x579d04(,%rdx,4),%ymm0 40a087: 57 00 40a089: c5 fc 10 14 95 24 9dvmovups 0x579d24(,%rdx,4),%ymm2 40a090: 57 00 40a092: c5 fc 10 24 95 44 9dvmovups 0x579d44(,%rdx,4),%ymm4 40a099: 57 00 40a09b: c5 fc 10 34 95 64 9dvmovups 0x579d64(,%rdx,4),%ymm6 40a0a2: 57 00 40a0a4: c5 fc 59 0c 95 00 9dvmulps 0x579d00(,%rdx,4),%ymm0,%ymm1 40a0ab: 57 00 40a0ad: c5 ec 59 1c 95 20 9dvmulps 0x579d20(,%rdx,4),%ymm2,%ymm3 40a0b4: 57 00 40a0b6: c5 dc 59 2c 95 40 9dvmulps 0x579d40(,%rdx,4),%ymm4,%ymm5 40a0bd: 57 00 40a0bf: c5 cc 59 3c 95 60 9dvmulps 0x579d60(,%rdx,4),%ymm6,%ymm7 40a0c6: 57 00 40a0c8: c5 fc 11 0c 95 00 9dvmovups %ymm1,0x579d00(,%rdx,4) 40a0cf: 57 00 40a0d1: c5 fc 11 1c 95 20 9dvmovups %ymm3,0x579d20(,%rdx,4) 40a0d8: 57 00 40a0da: c5 fc 11 2c 95 40 9dvmovups %ymm5,0x579d40(,%rdx,4) 40a0e1: 57 00 40a0e3: c5 fc 11 3c 95 60 9dvmovups %ymm7,0x579d60(,%rdx,4) 40a0ea: 57 00 40a0ec: 48 83 c2 20 add$0x20,%rdx 40a0f0: 48 81 fa e0 7c 00 00cmp$0x7ce0,%rdx 40a0f7: 72 87 jb 40a080 40a0f9: 33 c9 xor%ecx,%ecx 40a0fb: ba e1 7c 00 00 mov$0x7ce1,%edx 40a100: c5 fc 10 04 95 00 9dvmovups 0x579d00(,%rdx,4),%ymm0 40a107: 57 00 40a109: 48 83 c2 08 add$0x8,%rdx 40a10d: c5 fc 59 0c 8d 80 90vmulps 0x599080(,%rcx,4),%ymm0,%ymm1 40a114: 59 00 40a116: c5 fc 11 0c 8d 80 90vmovups %ymm1,0x599080(,%rcx,4) 40a11d: 59 00 40a11f: 48 83 c1 08 add$0x8,%rcx 40a123: 48 83 f9 18 cmp$0x18,%rcx 40a127: 72 d7 jb 40a100 40a129: c5 fa 10 0d b3 ef 18vmovss 0x18efb3(%rip),%xmm1# 5990e4 40a130: 00 40a131: bf 00 9d 57 00 mov$0x579d00,%edi 40a136: c5 fa 10 1d aa ef 18vmovss 0x18efaa(%rip),%xmm3# 5990e8 40a13d: 00 40a13e: be 80 d8 45 00 mov$0x45d880,%esi 40a143: c5 f2 59 05 95 ef 18vmulss 0x18ef95(%rip),%xmm1,%xmm0 # 5990e0 40a14a: 00 40a14b: ba 00 a9 55 00 mov$0x55a900,%edx 40a150: c5 e2 59 25 94 ef 18vmulss 0x18ef94(%rip),%xmm3,%xmm4 # 5990ec 40a157: 00 40a158: c5 f2 59 d3 vmulss %xmm3,%xmm1,%xmm2 40a15c: c5 fa 11 05 7c ef 18vmovss %xmm0,0x18ef7c(%rip)# 5990e0 40a163: 00 40a164: b9 80 e4 43 00 mov$0x43e480,%ecx 40a169: c5 fa 11 15 73 ef 18vmovss %xmm2,0x18ef73(%rip)# 5990e4 40a170: 00 40a171: 41 b8 00 b5 53 00 mov$0x53b500,%r8d 40a177: c5 fa 11 25 69 ef 18vmovss %xmm4,0x18ef69(%rip)# 5990e8 40a17e: 00 40a17f: 41 b9 c0 b4 4b 00 mov$0x4bb4c0,%r9d 40a185: 68 00 91 59 00 push $0x599100 40a18a: 68 00 b5 4f 00 push $0x4fb500 40a18f: c5 f8 77vzeroupper 40a192: c5 f8 57 c0 vxorps %xmm0,%xmm0,%xmm0 40a196: e8 d5 92 00 00 call 413470 40a19b: 48 83 c4 10 add$0x10,%rsp 40a19f: 41 ff c7inc%r15d 40a1a2: 41 81 ff 40 42 0f 00cmp$0xf4240,%r15d 40a1a9: 0f 82 cc fe ff ff jb 40a07b 40a1af: 48 83 c3 10 add$0x10,%rbx 40a1b3: 33 f6 xor%esi,%esi 40a1b5: 48 89 dfmov%rbx,%rdi 40a1b8: e8 03 6f ff ff call 4010c0 40a1bd: bf 74 f5 42 00 mov$0x42f57
[Bug middle-end/99415] New: s115 benchmark of TSVC is vectorized by icc and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99415 Bug ID: 99415 Summary: s115 benchmark of TSVC is vectorized by icc and not by gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],aa[LEN_2D][LEN_2D]; void main() { for (int nl = 0; nl < 1000*(iterations/LEN_2D); nl++) { for (int j = 0; j < LEN_2D; j++) { for (int i = j+1; i < LEN_2D; i++) { a[i] -= aa[j][i] * a[j]; } } } } is built as: main: ..B1.1: # Preds ..B1.0 # Execution count [1.17e-01] .cfi_startproc ..___tag_value_main.1: ..L2: #9.1 pushq %rbp #9.1 .cfi_def_cfa_offset 16 movq %rsp, %rbp#9.1 .cfi_def_cfa 6, 16 .cfi_offset 6, -16 andq $-128, %rsp #9.1 pushq %r14 #9.1 pushq %r15 #9.1 pushq %rbx #9.1 subq $104, %rsp#9.1 movl $3, %edi #9.1 xorl %esi, %esi#9.1 call __intel_new_feature_proc_init #9.1 .cfi_escape 0x10, 0x03, 0x0e, 0x38, 0x1c, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xe8, 0xff, 0xff, 0xff, 0x22 .cfi_escape 0x10, 0x0e, 0x0e, 0x38, 0x1c, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xf8, 0xff, 0xff, 0xff, 0x22 .cfi_escape 0x10, 0x0f, 0x0e, 0x38, 0x1c, 0x0d, 0x80, 0xff, 0xff, 0xff, 0x1a, 0x0d, 0xf0, 0xff, 0xff, 0xff, 0x22 # LOE rbx r12 r13 r14 r15 ..B1.29:# Preds ..B1.1 # Execution count [1.17e-01] vstmxcsr (%rsp)#9.1 xorl %eax, %eax#11.5 orl $32832, (%rsp)#9.1 vldmxcsr (%rsp)#9.1 # LOE r12 r13 eax ..B1.2: # Preds ..B1.22 ..B1.29 # Execution count [4.50e+04] xorl %r11d, %r11d #12.9 xorl %edi, %edi#12.9 xorl %ebx, %ebx#12.9 xorl %r9d, %r9d#12.9 xorl %esi, %esi#12.9 # LOE rbx rsi r11 r12 r13 eax edi r9d ..B1.3: # Preds ..B1.21 ..B1.2 # Execution count [1.15e+07] incl %edi #13.28 decl %r9d #13.28 cmpl $256, %edi#13.35 jge ..B1.21 # Prob 50% #13.35 # LOE rbx rsi r11 r12 r13 eax edi r9d ..B1.4: # Preds ..B1.3 # Execution count [1.04e+07] lea 256(%r9), %r10d #13.35 cmpl $16, %r10d#13.13 jl..B1.25 # Prob 10% #13.13 # LOE rbx rsi r11 r12 r13 eax edi r9d r10d ..B1.5: # Preds ..B1.4 # Execution count [1.04e+07] lea 4+aa(%rsi,%rbx), %r8 #14.25 andq $31, %r8 #13.13 lea (%rsi,%rbx), %r14 #14.25 movl %r8d, %edx#13.13 negl %edx #13.13 addl $32, %edx #13.13 shrl $2, %edx #13.13 testl %r8d, %r8d#13.13 cmovne%edx, %r8d#13.13 lea 16(%r8), %ecx #13.13 cmpl %ecx, %r10d #13.13
[Bug middle-end/99416] New: s211 benchmark of TSVC is vectorized by icc and not by gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99416 Bug ID: 99416 Summary: s211 benchmark of TSVC is vectorized by icc and not by gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D]; void main() { for (int nl = 0; nl < iterations; nl++) { for (int i = 1; i < LEN_1D-1; i++) { a[i] = b[i - 1] + c[i] * d[i]; b[i] = b[i + 1] - e[i] * d[i]; } } } Icc produces: ain: ..B1.1: # Preds ..B1.0 # Execution count [0.00e+00] .cfi_startproc ..___tag_value_ain.1: ..L2: #9.1 subq $136, %rsp#9.1 .cfi_def_cfa_offset 144 xorl %edx, %edx#11.5 lea 12+d(%rip), %r8 #14.38 vmovss(%r8), %xmm0 #14.38 movl $7, %edi #13.38 lea 12+e(%rip), %r9 #14.38 vmulss(%r9), %xmm0, %xmm12 #14.38 xorl %esi, %esi#13.38 lea 12+c(%rip), %r10 #13.38 vmulss(%r10), %xmm0, %xmm0 #13.38 vmovss16(%r8), %xmm4#14.38 movl $31977, %ecx #12.9 vmulss16(%r9), %xmm4, %xmm14#14.38 movl $31975, %eax #12.9 lea 24+b(%rip), %r11 #14.20 vmovss(%r11), %xmm11#14.20 vmovss4(%r8), %xmm6 #14.38 vmovss%xmm12, 104(%rsp) #14.38[spill] vmovss%xmm11, 8(%rsp) #14.20[spill] vmulss4(%r9), %xmm6, %xmm12 #14.38 vmulss4(%r10), %xmm6, %xmm11#13.38 vmovss127984+d(%rip), %xmm6 #14.38 vmovss8(%r8), %xmm13#14.38 vmovss%xmm14, 96(%rsp) #14.38[spill] vmulss127984+e(%rip), %xmm6, %xmm14 #14.38 vmulss8(%r9), %xmm13, %xmm1 #14.38 vmovss%xmm14, 112(%rsp) #14.38[spill] vmovss127988+d(%rip), %xmm14#14.38 vmovss%xmm1, 16(%rsp) #14.38[spill] vmulss8(%r10), %xmm13, %xmm1#13.38 vmulss16(%r10), %xmm4, %xmm13 #13.38 vmulss127988+e(%rip), %xmm14, %xmm4 #14.38 vmovss%xmm4, 120(%rsp) #14.38[spill] vmulss127988+c(%rip), %xmm14, %xmm4 #13.38 vmovss-4(%r11), %xmm5 #14.20 vmovss-8(%r8), %xmm2#14.38 vmovss12(%r8), %xmm15 #14.38 vmovss%xmm4, 24(%rsp) #13.38[spill] vmovss127992+d(%rip), %xmm4 #14.38 vmovss%xmm5, (%rsp) #14.20[spill] vmulss-8(%r9), %xmm2, %xmm3 #14.38 vmulss-8(%r10), %xmm2, %xmm5#13.38 vmulss12(%r9), %xmm15, %xmm2#14.38 vmulss12(%r10), %xmm15, %xmm15 #13.38 vmulss127992+e(%rip), %xmm4, %xmm14 #14.38 vmulss127992+c(%rip), %xmm4, %xmm4 #13.38 vmovss-4(%r8), %xmm10 #14.38 vmulss-4(%r9), %xmm10, %xmm7#14.38 vmulss-4(%r10), %xmm10, %xmm10 #13.38 vmovss%xmm7, 88(%rsp) #14.38[spill] vmovss%xmm4, 32(%rsp) #13.38[spill] vmovss%xmm15, 56(%rsp) #13.31[spill] vmovss%xmm14, 40(%rsp) #13.31[spill] vmovss%xmm3, 80(%rsp) #
[Bug middle-end/99633] New: s1113 benchmark of TSVC is unrolled by icc and not by gcc and runs faster on znver3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99633 Bug ID: 99633 Summary: s1113 benchmark of TSVC is unrolled by icc and not by gcc and runs faster on znver3 Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D]; int main(struct args_t * func_args) { //linear dependence testing //one iteration dependency on a(LEN_1D/2) but still vectorizable //initialise_arrays(__func__); //gettimeofday(&func_args->t1, NULL); for (int nl = 0; nl < 2*iterations; nl++) { for (int i = 0; i < LEN_1D; i++) { a[i] = a[LEN_1D/2] + b[i]; } //dummy(a, b, c, d, e, aa, bb, cc, 0.); } return a[10]; } Is unrolled twice by icc and runs 1.5s instead of 2.6s when built with gcc. -funroll-loops fixes the issue, but it suggests we may want to unroll by default on zver3
[Bug tree-optimization/99414] s235 and s233 benchmarks of TSVC is vectorized better by icc than gcc (loop interchange)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99414 Jan Hubicka changed: What|Removed |Added Summary|s235 benchmark of TSVC is |s235 and s233 benchmarks of |vectorized better by icc|TSVC is vectorized better |than gcc (loop interchange) |by icc than gcc (loop ||interchange) --- Comment #2 from Jan Hubicka --- another testcase typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int main(struct args_t * func_args) { //loop interchange //interchanging with one of two inner loops for (int nl = 0; nl < 100*(iterations/LEN_2D); nl++) { for (int i = 1; i < LEN_2D; i++) { for (int j = 1; j < LEN_2D; j++) { aa[j][i] = aa[j-1][i] + cc[j][i]; } for (int j = 1; j < LEN_2D; j++) { bb[j][i] = bb[j][i-1] + cc[j][i]; } } dummy(); } return aa[0][0]; }
[Bug tree-optimization/99414] s235, s2233 and s233 benchmarks of TSVC is vectorized better by icc than gcc (loop interchange)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99414 Jan Hubicka changed: What|Removed |Added Summary|s235 and s233 benchmarks of |s235, s2233 and s233 |TSVC is vectorized better |benchmarks of TSVC is |by icc than gcc (loop |vectorized better by icc |interchange)|than gcc (loop interchange) --- Comment #3 from Jan Hubicka --- this one is 7s with gcc and 0.4s with icc. typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int main(struct args_t * func_args) { //loop interchange //interchanging with one of two inner loops for (int nl = 0; nl < 100*(iterations/LEN_2D); nl++) { for (int i = 1; i < LEN_2D; i++) { for (int j = 1; j < LEN_2D; j++) { aa[j][i] = aa[j-1][i] + cc[j][i]; } for (int j = 1; j < LEN_2D; j++) { bb[i][j] = bb[i-1][j] + cc[i][j]; } } dummy(); } return aa[0][0]; }
[Bug tree-optimization/99414] s235, s2233, s275 and s233 benchmarks of TSVC is vectorized better by icc than gcc (loop interchange)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99414 Jan Hubicka changed: What|Removed |Added Summary|s235, s2233 and s233|s235, s2233, s275 and s233 |benchmarks of TSVC is |benchmarks of TSVC is |vectorized better by icc|vectorized better by icc |than gcc (loop interchange) |than gcc (loop interchange) --- Comment #4 from Jan Hubicka --- s275: typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t a[LEN_2D],d[LEN_2D],aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int main(struct args_t * func_args) { //control flow //if around inner loop, interchanging needed for (int i = 0; i < LEN_2D; i++) aa[0][i]=1; for (int nl = 0; nl < 10*(iterations/LEN_2D); nl++) { for (int i = 0; i < LEN_2D; i++) { if (aa[0][i] > (real_t)0.) { for (int j = 1; j < LEN_2D; j++) { aa[j][i] = aa[j-1][i] + bb[j][i] * cc[j][i]; } } } dummy(); } return aa[0][0]; }
[Bug tree-optimization/99414] s235, s2233, s275, s2275 and s233 benchmarks of TSVC is vectorized better by icc than gcc (loop interchange)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99414 Jan Hubicka changed: What|Removed |Added Summary|s235, s2233, s275 and s233 |s235, s2233, s275, s2275 |benchmarks of TSVC is |and s233 benchmarks of TSVC |vectorized better by icc|is vectorized better by icc |than gcc (loop interchange) |than gcc (loop interchange) --- Comment #5 from Jan Hubicka --- s2275: typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t a[LEN_2D],b[LEN_2D],c[LEN_2D],d[LEN_2D],aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int main(struct args_t * func_args) { //loop distribution is needed to be able to interchange for (int nl = 0; nl < 100*(iterations/LEN_2D); nl++) { for (int i = 0; i < LEN_2D; i++) { for (int j = 0; j < LEN_2D; j++) { aa[j][i] = aa[j][i] + bb[j][i] * cc[j][i]; } a[i] = b[i] + c[i] * d[i]; } dummy(); } return aa[0][0]; }
[Bug middle-end/99634] New: s2102 benchmarks of TSVC is vectorized better by icc than gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99634 Bug ID: 99634 Summary: s2102 benchmarks of TSVC is vectorized better by icc than gcc Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t a[LEN_2D],d[LEN_2D],aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int main(struct args_t * func_args) { //diagonals //identity matrix, best results vectorize both inner and outer loops for (int nl = 0; nl < 100*(iterations/LEN_2D); nl++) { for (int i = 0; i < LEN_2D; i++) { for (int j = 0; j < LEN_2D; j++) { aa[j][i] = (real_t)0.; } aa[i][i] = (real_t)1.; } dummy(); } return aa[0][0]; } is vectorized by ic as: min: # parameter 1: %rdi ..B1.1: # Preds ..B1.0 # Execution count [5.00e-03] .cfi_startproc ..___tag_value_min.1: ..L2: #36.1 pushq %rbp #36.1 .cfi_def_cfa_offset 16 movq %rsp, %rbp#36.1 .cfi_def_cfa 6, 16 .cfi_offset 6, -16 andq $-32, %rsp#36.1 movl $aa, %edi #38.13 xorl %esi, %esi#38.13 movl $262144, %edx #38.13 call _intel_fast_memset#38.13 # LOE rbx r12 r13 r14 r15 ..B1.2: # Preds ..B1.1 # Execution count [1.00e+00] vmovups .L_2il0floatpacket.0(%rip), %ymm1 #41.24 xorl %edx, %edx#37.9 xorl %eax, %eax#37.9 vextractf128 $1, %ymm1, %xmm0 #41.13 # LOE rax rdx rbx r12 r13 r14 r15 xmm0 xmm1 ..B1.3: # Preds ..B1.3 ..B1.2 # Execution count [2.56e+02] vextractps $3, %xmm1, 44204+aa(%rax,%rdx,4) #41.13 lea (%rax,%rdx,4), %rcx #41.13 vmovss%xmm0, 45232+aa(%rax,%rdx,4) #41.13 vextractps $1, %xmm0, 46260+aa(%rax,%rdx,4) #41.13 vextractps $2, %xmm0, 47288+aa(%rax,%rdx,4) #41.13 vextractps $3, %xmm0, 48316+aa(%rax,%rdx,4) #41.13 vmovss%xmm1, 49344+aa(%rax,%rdx,4) #41.13 vextractps $1, %xmm1, 50372+aa(%rax,%rdx,4) #41.13 vextractps $2, %xmm1, 51400+aa(%rax,%rdx,4) #41.13 vextractps $3, %xmm1, 52428+aa(%rax,%rdx,4) #41.13 vmovss%xmm0, 53456+aa(%rax,%rdx,4) #41.13 vextractps $1, %xmm0, 54484+aa(%rax,%rdx,4) #41.13 vextractps $2, %xmm0, 55512+aa(%rax,%rdx,4) #41.13 vextractps $3, %xmm0, 56540+aa(%rax,%rdx,4) #41.13 vmovss%xmm1, 57568+aa(%rax,%rdx,4) #41.13 vextractps $1, %xmm1, 58596+aa(%rax,%rdx,4) #41.13 vextractps $2, %xmm1, 59624+aa(%rax,%rdx,4) #41.13 vextractps $3, %xmm1, 60652+aa(%rax,%rdx,4) #41.13 vmovss%xmm0, 61680+aa(%rax,%rdx,4) #41.13 vextractps $1, %xmm0, 62708+aa(%rax,%rdx,4) #41.13 vextractps $2, %xmm0, 63736+aa(%rax,%rdx,4) #41.13 vextractps $3, %xmm0, 64764+aa(%rax,%rdx,4) #41.13 vmovss%xmm1, 65792+aa(%rax,%rdx,4) #41.13 vextractps $1, %xmm1, 66820+aa(%rax,%rdx,4) #41.13 vextractps $2, %xmm1, 67848+aa(%rax,%rdx,4) #41.13 vextractps $3, %xmm1, 68876+aa(%rax,%rdx,4) #41.13 vmovss%xmm0, 69904+aa(%rax,%rdx,4) #41.13 vextractps $1, %xmm0, 70932+aa(%rax,%rdx,4) #41.13 vextractps $2, %xmm0, 71960+aa(%rax,%rdx,4) #41.13 vextractps $3, %xmm0, 72988+aa(%rax,%rdx,4) #41.13 vmovss%xmm1, 74016+aa(%rax,%rdx,4) #41.13 vextractps $1, %xmm1, 75044+aa(%rax,%rdx,4) #41.13 vextractps $2, %xmm1, 76072+aa(%rax,%rdx,4) #41.13 vextractps $3, %xmm1, 77100+aa(%rax,%rdx,4)
[Bug middle-end/99638] New: s132 benchmarks of TSVC on zen3 benefits from -mno-fma
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99638 Bug ID: 99638 Summary: s132 benchmarks of TSVC on zen3 benefits from -mno-fma Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 100 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t flat_2d_array[LEN_2D*LEN_2D]; real_t x[LEN_1D]; real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D], aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int indx[LEN_1D]; real_t* __restrict__ xx; real_t* yy; // %2.5 void main() { //global data flow analysis //loop with multiple dimension ambiguous subscripts int m = 0; int j = m; int k = m+1; for (int nl = 0; nl < 400*iterations; nl++) { for (int i= 1; i < LEN_2D; i++) { aa[j][i] = aa[k][i-1] + b[i] * c[1]; } dummy(); } } compiled with -Ofast -march=native runs 4.4s compared to 4.2s with -Ofast -march=native -mno-fma
[Bug middle-end/99638] s132 and s281 benchmarks of TSVC on zen3 benefits from -mno-fma
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99638 Jan Hubicka changed: What|Removed |Added CC||jamborm at gcc dot gnu.org Summary|s132 benchmarks of TSVC on |s132 and s281 benchmarks of |zen3 benefits from -mno-fma |TSVC on zen3 benefits from ||-mno-fma --- Comment #1 from Jan Hubicka --- s281 benchmark: typedef float real_t; #define iterations 100 #define LEN_1D 32000 #define LEN_2D 256 // array definitions real_t flat_2d_array[LEN_2D*LEN_2D]; real_t x[LEN_1D]; real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D], aa[LEN_2D][LEN_2D],bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int indx[LEN_1D]; real_t* __restrict__ xx; real_t* yy; // %2.5 void main() { //crossing thresholds //index set splitting //reverse data access real_t x; for (int nl = 0; nl < iterations; nl++) { for (int i = 0; i < LEN_1D; i++) { x = a[LEN_1D-i-1] + b[i] * c[i]; a[i] = x-(real_t)1.0; b[i] = x; } dummy(); } } with FMA runs 18s and without 14s
[Bug middle-end/99646] New: s111 benchmark of TSVC preffers -mprefer-avx128 on zen3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99646 Bug ID: 99646 Summary: s111 benchmark of TSVC preffers -mprefer-avx128 on zen3 Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- typedef float real_t; #define iterations 10 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],b[LEN_1D],aa[LEN_2D][LEN_2D]; void main() { //linear dependence testing //no dependence - vectorizable for (int nl = 0; nl < 2*iterations; nl++) { for (int i = 1; i < LEN_1D; i += 2) { a[i] = a[i - 1] + b[i]; } dummy(); } } takes 0.73s with -march=native -Ofast -mprefer-avx128 and 0.81s with -march=native -Ofast 128bit version is: main: .LFB0: .cfi_startproc pushq %rbx .cfi_def_cfa_offset 16 .cfi_offset 3, -16 movl$20, %ebx .L2: xorl%eax, %eax .p2align 4 .p2align 3 .L4: vmovaps a(%rax), %xmm2 vmovups b+4(%rax), %xmm3 addq$32, %rax vshufps $136, a-16(%rax), %xmm2, %xmm0 vshufps $136, b-12(%rax), %xmm3, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vmovss %xmm0, a-28(%rax) vextractps $1, %xmm0, a-20(%rax) vextractps $2, %xmm0, a-12(%rax) vextractps $3, %xmm0, a-4(%rax) cmpq$127968, %rax jne .L4 vmovss b+127972(%rip), %xmm0 xorl%eax, %eax vaddss a+127968(%rip), %xmm0, %xmm0 vmovss %xmm0, a+127972(%rip) vmovss a+127976(%rip), %xmm0 vaddss b+127980(%rip), %xmm0, %xmm0 vmovss %xmm0, a+127980(%rip) vmovss a+127984(%rip), %xmm0 vaddss b+127988(%rip), %xmm0, %xmm0 vmovss %xmm0, a+127988(%rip) vmovss a+127992(%rip), %xmm0 vaddss b+127996(%rip), %xmm0, %xmm0 vmovss %xmm0, a+127996(%rip) calldummy main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq%rsp, %rbp .cfi_def_cfa_register 6 pushq %rbx .cfi_offset 3, -24 movl$20, %ebx andq$-32, %rsp .p2align 4 .p2align 3 .L2: xorl%eax, %eax .p2align 4 .p2align 3 .L4: vmovaps a(%rax), %ymm4 vmovups b+4(%rax), %ymm5 addq$64, %rax vshufps $136, a-32(%rax), %ymm4, %ymm1 vperm2f128 $3, %ymm1, %ymm1, %ymm2 vshufps $68, %ymm2, %ymm1, %ymm0 vshufps $238, %ymm2, %ymm1, %ymm2 vshufps $136, b-28(%rax), %ymm5, %ymm1 vinsertf128 $1, %xmm2, %ymm0, %ymm0 vperm2f128 $3, %ymm1, %ymm1, %ymm2 vshufps $68, %ymm2, %ymm1, %ymm3 vshufps $238, %ymm2, %ymm1, %ymm2 vinsertf128 $1, %xmm2, %ymm3, %ymm1 vaddps %ymm1, %ymm0, %ymm0 vmovss %xmm0, a-60(%rax) vextractps $1, %xmm0, a-52(%rax) vextractps $2, %xmm0, a-44(%rax) vextractps $3, %xmm0, a-36(%rax) vextractf128$0x1, %ymm0, %xmm0 vmovss %xmm0, a-28(%rax) vextractps $1, %xmm0, a-20(%rax) vextractps $2, %xmm0, a-12(%rax) vextractps $3, %xmm0, a-4(%rax) cmpq$127936, %rax jne .L4 vmovaps a+127936(%rip), %xmm6 vmovups b+127940(%rip), %xmm7 xorl%eax, %eax vshufps $136, a+127952(%rip), %xmm6, %xmm0 vshufps $136, b+127956(%rip), %xmm7, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vmovss %xmm0, a+127940(%rip) vextractps $1, %xmm0, a+127948(%rip) vextractps $2, %xmm0, a+127956(%rip) vextractps $3, %xmm0, a+127964(%rip) vmovss b+127972(%rip), %xmm0 vaddss a+127968(%rip), %xmm0, %xmm0 vmovss %xmm0, a+127972(%rip) vmovss b+127980(%rip), %xmm0 vaddss a+127976(%rip), %xmm0, %xmm0 vmovss %xmm0, a+127980(%rip) vmovss b+127988(%rip), %xmm0 vaddss a+127984(%rip), %xmm0, %xmm0 vmovss %xmm0, a+127988(%rip) vmovss a+127992(%rip), %xmm0 vaddss b+127996(%rip), %xmm0, %xmm0 vmovss %xmm0, a+127996(%rip) vzeroupper calldummy
[Bug ipa/99785] Awful lot of time spent building gl.cc in Firefox
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99785 --- Comment #15 from Jan Hubicka --- We run into the size estimate with always inlines because after inlining we update the size of caller (because that does matter when inlining normal functions). We already have special purepose always inliner to avoid some of the issues, so I guess we keep running into this during the late IPA inlining? Honza
[Bug ipa/99785] Awful lot of time spent building gl.cc in Firefox
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99785 --- Comment #16 from Jan Hubicka --- OK,we seem to handle all relevant always_inlines in early passes and then we produce functions large function with many non-always_inline calls that we spend a lot of time inlining. This is becuase we have relative function growth bounds that are quite high and we manage to get a lot of inlining done. I guess clang hits cap on those earlier. I will check if I can save some compile time. Honza
[Bug ipa/99751] [11 Regression] wrong code at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99751 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #7 from Jan Hubicka --- mine.
[Bug rtl-optimization/97836] wrong code at -O1 on x86_64-pc-linux-gnu by r11-5029
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97836 --- Comment #8 from Jan Hubicka --- indeed, I think for gcc11 we want to make return mark value as used and for next stage1 we want to design EAF flags bit more carefully...
[Bug ipa/99751] [11 Regression] wrong code at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99751 --- Comment #8 from Jan Hubicka --- So we wrongly identify nodirectescape in store_to_c this is due to early exit in analyze_call that does not account for const call possibly returning its parameter. (An early confusion in EAF tracking logic before I settled up on the fact that returns are not escapes for local PTA). I am looking into fix. It is odd that this did not show earlier.
[Bug ipa/99751] [11 Regression] wrong code at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99751 --- Comment #9 from Jan Hubicka --- OK, so actually there is logic to handle return values (even for consts) but it has wrong if. I am testing the attached fix. diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c index 7aaf53be8f4..5f33bb5b410 100644 --- a/gcc/ipa-modref.c +++ b/gcc/ipa-modref.c @@ -1545,9 +1545,9 @@ merge_call_lhs_flags (gcall *call, int arg, int index, bool deref, tree lhs = gimple_call_lhs (call); analyze_ssa_name_flags (lhs, lattice, depth + 1, ipa); if (deref) - lattice[index].merge (lattice[SSA_NAME_VERSION (lhs)]); - else lattice[index].merge_deref (lattice[SSA_NAME_VERSION (lhs)], false); + else + lattice[index].merge (lattice[SSA_NAME_VERSION (lhs)]); } /* In the case of memory store we can do nothing. */ else @@ -1621,7 +1621,7 @@ analyze_ssa_name_flags (tree name, vec &lattice, int depth, else if (gcall *call = dyn_cast (use_stmt)) { tree callee = gimple_call_fndecl (call); - /* Return slot optiomization would require bit of propagation; + /* Return slot optimization would require bit of propagation; give up for now. */ if (gimple_call_return_slot_opt_p (call) && gimple_call_lhs (call) != NULL_TREE
[Bug ipa/99751] [11 Regression] wrong code at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99751 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #10 from Jan Hubicka --- Forgot PR marker in changelog, but it is fixed by g:dd64aaafe6916ac11ccae3182b4550c8b8f5e066
[Bug ipa/99447] [11 Regression] ICE (segfault) in lookup_page_table_entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99447 --- Comment #15 from Jan Hubicka --- I also tried to reproduce this locally w/o luck. Looking at the backtrace in detail, there is no DEF_STMT involved. It walks from dwarf dies, to RTL constant pool address that points to tree which has abstract origin that points to symtab node which points to callgraph edge which points to dead basic block. The pointer from cgraph node to edge that should be removed. I can add code to clear pointers SSA_NAME->def_stmt bit there is no def stmt in the backtrace, so it would not help here. W/o reproducer it seem hard to tell what is/was real cause of this issue... Honza
[Bug ipa/99447] [11 Regression] ICE (segfault) in lookup_page_table_entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99447 --- Comment #16 from Jan Hubicka --- I was trying to reproduce some kind of ICE for a while, trying to also rebuild with ggc forced on every ggc_collect call, but no luck. I wonder if you happen to know specific gcc regression that was failing and if it was patched or not...
[Bug ipa/99447] [11 Regression] ICE (segfault) in lookup_page_table_entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99447 --- Comment #18 from Jan Hubicka --- > Looking around the only place (we don't know whether this was WPA or LTRANS) > we'd have a cgraph with edges is during clone materialization which pointed > me at cgraph_node::release_body which frees the body but fails to eventually > zap ->call_stmt references This I agree with, but during our last discussion I went through all release_body calls and found none which would match this scenario - they are all on paths where we zap cgraph edges to (it is only makes sense to exist in this case, since we are supposed to keep cgrpah edges in sync with actual body and after feeing the body this would leave cgaph in inconsistent stage). I will try to move tree to 20210306 and see if that helps. I can simply add cgraph edge removal to release_body to make code bit more robust - while most uses erases edges earlier, it is almost free to check the pointer for being NULL twice. Still it is weird that the bug does not reproduce with allways collect.
[Bug ipa/99447] [11 Regression] ICE (segfault) in lookup_page_table_entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99447 --- Comment #19 from Jan Hubicka --- Created attachment 50485 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50485&action=edit small refactoring this patch moves the removal to release_body and removes the calls on those paths where removal is done just after call to it (as opposed to being done earlier or via reset cal). But still there is no code path where it should make difference. Pehraps the assert will catch something interesting. Tests are running.
[Bug ipa/99447] [11 Regression] ICE (segfault) in lookup_page_table_entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99447 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|WAITING --- Comment #20 from Jan Hubicka --- I re-tried with g:0ad6a2e2f0c667f9916cfcdb81f41f6055f1d0b3 and it builds all fine even with --param ggc-min-expand=0 --param ggc-min-heapsize=0. It seems that --enable-checking=gcac is now noop. @doko: perhaps using --param ggc-min-expand=0 --param ggc-min-heapsize=0 on your setup may trigger the problem again. There is some chance that i.e. the qt headers are the cause, but I am tempted to close the bug as WORKSFORME after committing the refactoring patch.
[Bug ipa/99447] [11 Regression] ICE (segfault) in lookup_page_table_entry
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99447 Jan Hubicka changed: What|Removed |Added Status|NEW |WAITING --- Comment #27 from Jan Hubicka --- Even with pie and fat LTO the compilation works well. In addition I committed patch that should make it clear that we to not stale pointers. Without a reproducer I am not sure what we can do more, so perhaps we can resolve it as WORKSFORME.
[Bug ipa/99309] [10/11 Regression] Segmentation fault with __builtin_constant_p usage at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99309 --- Comment #5 from Jan Hubicka --- As discussed, I can prepare patch to make inliner to redirect __builtin_constant_p to __builtin_true whenever inliner detect that the expression is compile time ocnstant. This will avoid us eventually hitting unreachable when late optimizations forget to make the transformation. I was worried about this idea since this will still lead to some inconsistency since uses guarded by the __builtin_constnat_p may or may not be constant propagated and it seems logical to assume that in the block guarded by builtin_constnat_p the expression will indeed evaluate to compile time constant. However we can get similar inconsistencies with alias oracle walking limits as well, so these constructions are generally fragile (but seems increasingly common in C++ codebases). It would be still nice to have fre5 to constant propagate this. IPA analysis are very simplistics. Richi, any idea on this?
[Bug ipa/98265] [10/11 Regression] gcc-10 has significantly worse code generated with -O2 compared to -O1 (or gcc-9 -O2) when using the Eigen C++ library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98265 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #10 from Jan Hubicka --- Trunk now generates on the unreduced testcase: .file "test.cpp" .text .p2align 4 .globl _Z1f .type _Z1f, @function _Z1f: .LFB6287: .cfi_startproc mulss %xmm3, %xmm0 movq%rdi, %rax mulss %xmm3, %xmm1 mulss %xmm3, %xmm2 movss %xmm0, (%rdi) movss %xmm1, 4(%rdi) movss %xmm2, 8(%rdi) ret .cfi_endproc .LFE6287: .size _Z1f, .-_Z1f .ident "GCC: (GNU) 11.0.1 20210331 (experimental)" .section.note.GNU-stack,"",@progbits
[Bug middle-end/99857] [11 Regression] FAIL: libgomp.c/declare-variant-1.c (test for excess errors) by r11-7926
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99857 --- Comment #6 from Jan Hubicka --- Thanks for a testcase, it makes things easier to debug indeed :) The problem is that openmp uses declare_vairant_alt on symbols to make them special definitions, but the definition flag is not set. That makes free_lang_data to call release_body and since the code depends on references things gets out of sync. I am testing. diff --git a/gcc/tree.c b/gcc/tree.c index 7c44c226a33..e4e74ac8afc 100644 --- a/gcc/tree.c +++ b/gcc/tree.c @@ -5849,7 +5849,7 @@ free_lang_data_in_decl (tree decl, class free_lang_data_d *fld) if (!(node = cgraph_node::get (decl)) || (!node->definition && !node->clones)) { - if (node) + if (node && !node->declare_variant_alt) node->release_body (); else { For next stage1 I think we want to set definition bit for them and remove all the special cases of declare_vairant_alt that makes them to behave as definitions. We also want to add checking that !definition symbols are extenral symbols which is missed in the verifier.
[Bug lto/100010] [8/9/10/11 Regression] ICE in lto_output_node, at lto-cgraph.c:447 (-fdevirtualize-at-ltrans) since r6-6384-gceda2c69d5219719
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100010 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #2 from Jan Hubicka --- mine.
[Bug ipa/92535] [10 regression] ICF is relatively expensive and became less effective
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92535 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|NEW Summary|[10/11 regression] ICF is |[10 regression] ICF is |relatively expensive and|relatively expensive and |became less effective |became less effective --- Comment #17 from Jan Hubicka --- For GCC 11 we now get faster build times with ICF than without on cc1plus, Firefox and clang LTO build. So I think we can consider it no longer regression while ICF can always be improved (and I have some changes queues for next stage1). I have no plan to backport this to gcc10, so unasigning.
[Bug ipa/99309] [10/11 Regression] Segmentation fault with __builtin_constant_p usage at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99309 Jan Hubicka changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #9 from Jan Hubicka --- Have WIP patch to attach predicates to buildtin_constant_p and redirect to true if inliner works out that it is a constat (still relying on late passes to optimize the if branch well). >From all the options I can think of this seems best even though it may end up in relatively rare cases that we do the (very simple) propagation at IPA time and late optimizations won't. Without explicitly disabling passes (where I think this is fine to happen) all testcases we seen so far was of the form that constant was eventually propagated but only after we folded builtin_constant_p to false. Overall it is not possible to assure that builtin_constant_p on memory will fold to true only if all uses of the memory later in the if branch will ford to constant since AO has walking limits.
[Bug ipa/80726] [8/9/10/11 Regression] Destructor not inlined anymore (regression)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80726 Jan Hubicka changed: What|Removed |Added Resolution|--- |DUPLICATE Status|ASSIGNED|RESOLVED --- Comment #9 from Jan Hubicka --- This is a dup that is fixed on mainline. *** This bug has been marked as a duplicate of bug 98265 ***
[Bug ipa/98265] [10 Regression] gcc-10 has significantly worse code generated with -O2 compared to -O1 (or gcc-9 -O2) when using the Eigen C++ library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98265 Jan Hubicka changed: What|Removed |Added CC||cuzdav at gmail dot com --- Comment #12 from Jan Hubicka --- *** Bug 80726 has been marked as a duplicate of this bug. ***
[Bug ipa/92394] operand_equal_p should compare as base+offset when comparing addresses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92394 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from Jan Hubicka --- This is fixed now since we have way to overload operand_equal_p in ICF.
[Bug ipa/97389] [11 Regression] Segfault in tramp3d since r11-3825-g71dbabccbfb295c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97389 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #1 from Jan Hubicka --- Mine.
[Bug ipa/97389] [11 Regression] Segfault in tramp3d since r11-3825-g71dbabccbfb295c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97389 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #3 from Jan Hubicka --- Fixed.
[Bug bootstrap/97350] [11 Regression] Ada bootstrap fails with: self_referential_size, at stor-layout.c:172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97350 --- Comment #7 from Jan Hubicka --- Interesting, i get different ICE during GIMPLE pass: slp ../../gcc/ada/libgnat/s-os_lib.adb: In function ‘system__os_lib__normalize_pathname__missed_drive_letter’: ../../gcc/ada/libgnat/s-os_lib.adb:2133:7: internal compiler error: in vect_init_pattern_stmt, at tree-vect-patterns.c:115 2133 | function Missed_Drive_Letter (Name : String) return Boolean is | ^ 0x6534d9 vect_init_pattern_stmt ../../gcc/tree-vect-patterns.c:115 0x13e2913 vect_set_pattern_stmt ../../gcc/tree-vect-patterns.c:133 0x13e2913 vect_mark_pattern_stmts ../../gcc/tree-vect-patterns.c:5287 0x13e2913 vect_pattern_recog_1 ../../gcc/tree-vect-patterns.c:5403 0x13ef3a1 vect_pattern_recog(vec_info*) ../../gcc/tree-vect-patterns.c:5543 0xcda2ce vect_slp_analyze_bb_1 ../../gcc/tree-vect-slp.c:3819 0xcda2ce vect_slp_region ../../gcc/tree-vect-slp.c:3918 0xcda2ce vect_slp_bbs ../../gcc/tree-vect-slp.c:4074 0xcdb9d8 vect_slp_function(function*) ../../gcc/tree-vect-slp.c:4125 0xcdd085 execute ../../gcc/tree-vectorizer.c:1432
[Bug ipa/97403] New: Ancestor jump function should be generalized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97403 Bug ID: 97403 Summary: Ancestor jump function should be generalized Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org CC: marxin at gcc dot gnu.org Target Milestone: --- In the following we should be able to propagate through test in ipa-cp, but we are not. struct foo {int bar}; __attribute__ ((noinline)) test2(int *p) { return *p; } __attribute__ ((noinline)) test (struct foo *array) { return test2 (&array[4].bar); } main() { const struct foo array[5]={{1},{2},{3},{4},{5}}; test(array); }
[Bug bootstrap/97350] [11 Regression] Ada bootstrap fails with: self_referential_size, at stor-layout.c:172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97350 --- Comment #10 from Jan Hubicka --- OK, I was poking a bit about the problem and indeed the bootstrapped gnat with -O3 and PGO ices, while gnat built normally does not. We fail: #2 0x019b7dcb in _Z13variable_sizeP9tree_node (size=0x77448900) at ../../gcc/stor-layout.c:172 172 gcc_assert (self_refs.length () > 0); (gdb) l 167 if (TREE_CODE (t) == CALL_EXPR || self_referential_component_ref_p (t)) 168 return size; 169 170 /* Collect the list of self-references in the expression. */ 171 find_placeholder_in_expr (size, &self_refs); 172 gcc_assert (self_refs.length () > 0); 173 174 /* Obtain a private copy of the expression. */ 175 t = size; here the gcc_assert fires. Sadly self_refs has no debug info. Size is: unit-size align:128 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7745f0a8 precision:128 min max > readonly arg:0 unit-size align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7746ae70 precision:8 min max context RM size RM max > readonly visited arg:0 readonly visited arg:0 readonly nothrow visited arg:0 visited arg:0 visited> arg:1 >> arg:1 > arg:1 readonly visited arg:0 arg:1 >> arg:1 readonly arg:0 readonly arg:0 readonly arg:0 readonly arg:0 readonly visited arg:0 > arg:1 readonly visited arg:0 >> arg:1 >> arg:1 > arg:2 constant 0>> .P_BOUNDS->UB0 >= .P_BOUNDS->LB0 ? (bitsizetype) (((sizetype) .P_BOUNDS->UB0 - (sizetype) .P_BOUNDS->LB0) + 1) * 8 : 0; I am not expert on Ada type sizes but it seems like well formed expression. and backtrace is: #0 _Z14internal_errorPKcz (gmsgid=0xac ) at ../../gcc/diagnostic.c:1752 #1 0x010ba114 in _Z11fancy_abortPKciS0_ (file=0x23a38a8 "in %s, at %s:%d", line=172, function=0x1e507bb "self_referential_size") at ../../gcc/diagnostic.c:1824 #2 0x019b7dcb in _Z13variable_sizeP9tree_node (size=0x77448900) at ../../gcc/stor-layout.c:172 #3 _Z13variable_sizeP9tree_node (size=0x77448900) at ../../gcc/stor-layout.c:67 #4 0x0128f4e0 in finalize_type_size (type=0x7746c3f0) at ../../gcc/stor-layout.c:1967 #5 0x0128df40 in _Z11layout_typeP9tree_node (type=0x23a38a8) at ../../gcc/stor-layout.c:2625 #6 0x0190e307 in _ZL18build_array_type_1P9tree_nodeS0_bbb.lto_priv.0 (elt_type=0x7745f3f0, index_type=0x7746c348, typeless_storage=59, shared=172, set_canonical=59) at ../../gcc/tree.c:8194 #7 0x01567bcc in _Z18gnat_to_gnu_entityiP9tree_nodeb (gnat_entity=37370024, gnu_expr=0x1e507bb, definition=59) at ../../gcc/ada/gcc-interface/decl.c:2366 #8 0x015618f5 in _Z16gnat_to_gnu_typei (gnat_entity=37370024) at ../../gcc/ada/gcc-interface/decl.c:4887 #9 0x015687a9 in _Z18gnat_to_gnu_entityiP9tree_nodeb (gnat_entity=37370024, gnu_expr=0x1e507bb, definition=59) at ../../gcc/ada/gcc-interface/decl.c:4814 #10 0x015618f5 in _Z16gnat_to_gnu_typei (gnat_entity=37370024) at ../../gcc/ada/gcc-interface/decl.c:4887 #11 0x019ea47c in gigi (gnat_root=37370024, max_gnat_node=31786939, number_name=30016059, nodes_ptr=0xac, flags_ptr=0x1ca023b, next_node_ptr=0x73, prev_node_ptr=0x0, elists_ptr=0x0, elmts_ptr=0x0, strings_ptr=0x0, string_chars_ptr=0x0, list_headers_ptr=0x0, number_file=12, file_info_ptr=0x7fffe3c0, standard_boolean=16, standard_integer=37, standard_character=107, standard_long_long_float=100, standard_exception_type=1704, gigi_operating_mode=0) at ../../gcc/ada/gcc-interface/trans.c:463 #12 0x019e406d in back_end__call_back_end (mode=(unknown: 1704)) at ../../gcc/ada/back_end.adb:155 #13 0x01928eed in _ada_gnat1drv () at ../../gcc/ada/gnat1drv.adb:1608 #14 0x01910a4b in _ZL15gnat_parse_filev.lto_priv.0 () at ../../gcc/ada/gcc-interface/misc.c:118 #15 0x019107f4 in _ZL12compile_filev.lto_priv.0 () at ../../gcc/toplev.c:460 #16 0x018f3296 in _ZN6toplev4mainEiPPc (this=0x7fffe63e, argc=21, argv=0x7fffe728) at ../../gcc/toplev.c:2321 #17 0x018f26ec in main (argc=30016059, argv=0x1ca023b) at ../../gcc/main.c:39 Breakpointing on 171 works and vector seems to be filled in. However the disasembly shows: 0x019b7db2 <+98>:callq 0x1a1c050 <_Z24find_placeholder_in_exprP9tree_nodeP3vecIS0_7va_heap6vl_ptrE> => 0x019b7db7 <+103>: mov$0x1e507bb,%edx 0x019b7dbc <+108>: mov$0xac,%esi 0x019b7dc1 <+113>: mov$0x1ca0231,%edi 0x019b7dc6 <+118>: callq 0x10ba0f0 <_Z11fancy_abortPKciS0_> so it
[Bug bootstrap/97350] [11 Regression] Ada bootstrap fails with: self_referential_size, at stor-layout.c:172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97350 --- Comment #11 from Jan Hubicka --- In WPA we seem to see the store to vector: Propagated modref for push_without_duplicates/1089577 loads: Limits: 32 bases, 16 refs Every base stores: Limits: 32 bases, 16 refs Base 0:struct vec (alias set 544) Ref 0:unsigned int (alias set 3) Every access Base 1:union tree_node * (alias set 21) Ref 0:union tree_node * (alias set 21) Every access Propagated modref for find_placeholder_in_expr/1089578 loads: Limits: 32 bases, 16 refs Every base stores: Limits: 32 bases, 16 refs Base 0:struct vec (alias set 544) Ref 0:unsigned int (alias set 3) Every access Base 1:union tree_node * (alias set 21) Ref 0:union tree_node * (alias set 21) Every access I guess base 0, ref 0 is the length adjustment (m_num is unsigned int). What seems interesting is that find_placeholder_in_expr lives in other partition then variable_size. It is read as: Read modref for find_placeholder_in_expr/1089578 loads: Limits: 32 bases, 16 refs Every base stores: Limits: 32 bases, 16 refs Base 0: alias set 17 Ref 0: alias set 3 Every access Base 1: alias set 16 Ref 0: alias set 16 Every access so alias set 17 and 3 are vec and unsigned_int. However in fre3 we get: ipa-modref: call stmt find_placeholder_in_expr (size_8(D), &self_refs); ipa-modref: call to find_placeholder_in_expr/1089578 does not clobber ref: self_refs.m_vec alias sets: 11->12 This seems odd: alias set 11 and 12 seems quite different form 17 and 3. Moreover 3 is usual alias set for a builtin type (unsigned int).
[Bug bootstrap/97350] [11 Regression] Ada bootstrap fails with: self_referential_size, at stor-layout.c:172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97350 --- Comment #12 from Jan Hubicka --- Aha, the code in question is: # USE = nonlocal null { D.8330 D.22051 D.22054 D.22059 D.22060 } (nonlocal, escaped, interposable) # CLB = nonlocal null { D.8330 D.22051 D.22054 D.22059 D.22060 } (nonlocal, escaped, interposable) find_placeholder_in_expr (size_8(D), &self_refs); # PT = nonlocal escaped null _30 = self_refs.m_vec; if (_30 != 0B) goto ; [100.00%] else goto ; [0.00%] [count: 7690]: _31 = MEM[(const struct vec *)_30].m_vecpfx.m_num; if (_31 == 0) goto ; [0.00%] else goto ; [100.00%] What we seem to optimize out is the to m_vec, here alias set 12 makes more sense. and indeed it seems that this is missing in the summary. Smells like a bug in ipa_merge_modref_summary_after_inlining since the function is split and re-merged by inliner.
[Bug bootstrap/97350] [11 Regression] Ada bootstrap fails with: self_referential_size, at stor-layout.c:172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97350 --- Comment #13 from Jan Hubicka --- bug in SCC discovery. I am testing diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c index 4f86b9ccea1..771a0a88f9a 100644 --- a/gcc/ipa-modref.c +++ b/gcc/ipa-modref.c @@ -1603,6 +1603,11 @@ make_pass_ipa_modref (gcc::context *ctxt) static bool ignore_edge (struct cgraph_edge *e) { + /* We merge summaries of inline clones into summaries of functions they + are inlined to. For that reason the complete function bodies must + act as unit. */ + if (!e->inline_failed) +return false; enum availability avail; cgraph_node *callee = e->callee->function_or_virtual_thunk_symbol (&avail, e->caller);
[Bug c/97172] [11 Regression] ICE: tree code ‘ssa_name’ is not supported in LTO streams since r11-3303-g6450f07388f9fe57
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97172 --- Comment #8 from Jan Hubicka --- Generally LTO is organized into a global stream containing types, decls etc. and local streams containing funtion bodies and initializers. Global stream thus can not contain references that are local to function bodies, like SSA_NAME, beause these are not instantiated at WPA stage and thus have no meaing. The ICE is about SSA_NAME being refered by something that is in the global stream. Judging from the testcase there is probably reference to variadic type and the variadic type now has SSA_NAME in its TYPE_SIZE or so, which should not happen.
[Bug c/97445] Some fonctions marked static inline in Linux kernel are not inlined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445 --- Comment #19 from Jan Hubicka --- get_order unwinds to: [local count: 1073741824]: _1 = __builtin_constant_p (size_68(D)); if (_1 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 536870913]: if (size_68(D) == 0) goto ; [21.72%] else goto ; [78.28%] [local count: 420262548]: if (size_68(D) <= 4095) goto ; [50.00%] else goto ; [50.00%] [local count: 210131274]: _2 = size_68(D) + 18446744073709551615; _3 = __builtin_constant_p (_2); if (_3 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 105065637]: _4 = (signed long) _2; if (_4 >= 0) goto ; [59.00%] else goto ; [41.00%] ... [very long code] [local count: 105065637]: __asm__("bsrq %1,%q0" : "=r" bitpos_75 : "rm" _2, "0" -1); iftmp.1_73 = bitpos_75 + -11; [local count: 210131274]: # iftmp.1_67 = PHI <52(6), iftmp.1_73(69), 51(7), 50(8), 49(9), 48(10), 47(11), 46(12), 45(13), 44(14), 43(15), 42(16), 41(17), 40(18), 39(19), 38(20), 37(21), 36(22), 35(23), 34(24), 33(25), 32(26), 31(27), 30(28), 29(29), 28(30), 27(31), 26(32), 25(33), 24(34), 23(35), 22(36), 21(37), 20(38), 19(39), 18(40), 17(41), 16(42), 15(43), 14(44), 13(45), 12(46), 11(47), 10(48), 9(49), 8(50), 7(51), 6(52), 5(53), 4(54), 3(55), 2(56), 1(57), 0(58), -1(59), -2(60), -3(61), -4(62), -5(63), -6(64), -7(65), -8(66), -10(68), -9(67)> goto ; [100.00%] [local count: 536870913]: size_69 = size_68(D) + 18446744073709551615; size_70 = size_69 >> 12; __asm__("bsrq %1,%q0" : "=r" bitpos_72 : "rm" size_70, "0" -1); _74 = bitpos_72 + 1; [local count: 1073741824]: # _66 = PHI <52(3), 0(4), iftmp.1_67(70), _74(71)> return _66; We get summary: IPA function summary for get_order/303 inlinable global time: 8.716289 self size: 201 global size: 201 min size: 4 self stack: 0 global stack:0 size:4.00, time:3.00 size:3.00, time:2.00, executed if:(not inlined) size:4.00, time:2.00, executed if:(op0 not constant) size:2.00, time:0.782800, executed if:(op0 != 0) size:3.00, time:0.391400, executed if:(op0 > 4095) && (op0 != 0) size:2.00, time:0.195700, executed if:(op0 > 4095) && (op0 != 0) && (op0 not constant) size:3.00, time:0.173194, executed if:(op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0) size:3.00, time:0.086597, executed if:(op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0) size:3.00, time:0.043299, executed if:(op0,(# + 18446744073709551615),(# & 2305843009213693952) == 0) && (op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0) size:3.00, time:0.021649, executed if:(op0,(# + 18446744073709551615),(# & 1152921504606846976) == 0) && (op0,(# + 18446744073709551615),(# & 2305843009213693952) == 0) && (op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0) size:3.00, time:0.010825, executed if:(op0,(# + 18446744073709551615),(# & 576460752303423488) == 0) && (op0,(# + 18446744073709551615),(# & 1152921504606846976) == 0) && (op0,(# + 18446744073709551615),(# & 2305843009213693952) == 0) && (op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0) size:168.00, time:0.010825, executed if:(op0,(# + 18446744073709551615),(# & 288230376151711744) == 0) && (op0,(# + 18446744073709551615),(# & 576460752303423488) == 0) && (op0,(# + 18446744073709551615),(# & 1152921504606846976) == 0) && (op0,(# + 18446744073709551615),(# & 2305843009213693952) == 0) && (op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0) calls: __builtin_constant_p/4546 function body not available freq:0.20 loop depth: 0 size: 0 time: 0 predicate: (op0 > 4095) && (op0 != 0) op0 points to local or readonly memory __builtin_constant_p/4546 func
[Bug ipa/97445] Some fonctions marked static inline in Linux kernel are not inlined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445 Jan Hubicka changed: What|Removed |Added Component|c |ipa --- Comment #48 from Jan Hubicka --- Changing component to IPA. Concerning comment #37 about summaries not being updated after ipa-cp, I was actually wrong there: they are updated and the behaviour is quite sane. We work out that kmalloc has constant argument and produce specialized clone for it. Because it is estimated quite large it is not inlined. While when ipa-cp is disabled we work out that inlining it will simplify body a lot and bump up the limits. Jakub, concerning asm volatile ("movl $-1, %eax") that was of course a hack. I was confused about bsr instruction - for some time I tought it stores only 8bit value until I re-read the manual. Honza
[Bug tree-optimization/97519] New: builtin_constant_p (x + cst) should be optimized to builtin_constant_p (x)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97519 Bug ID: 97519 Summary: builtin_constant_p (x + cst) should be optimized to builtin_constant_p (x) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- As discussed in PR97445 we should optimize builtins_constant_p (var+cst) and similar cases.
[Bug ipa/97445] Some fonctions marked static inline in Linux kernel are not inlined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445 Jan Hubicka changed: What|Removed |Added Depends on||97519, 97503 --- Comment #49 from Jan Hubicka --- Patch posted for the inline heuristics change https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556685.html Also opened spearate PR on builtin_constant_p folding. I am not sure how to implement that correctly (what are the conditions that make this valid - perhaps for all "i op cst" after all?) Martin, how does the if chain conversion behave on the example? Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97503 [Bug 97503] Suboptimal use of cntlzw and cntlzd https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97519 [Bug 97519] builtin_constant_p (x + cst) should be optimized to builtin_constant_p (x)
[Bug ipa/97576] [11 Regression] ICE: verify_cgraph_node failed (error: reference to dead statement)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97576 --- Comment #2 from Jan Hubicka --- The problem here is that clone materialization invalidates statement pointers in refs. We clean these at the begining of late optimization, I guess it should be done on demand during materialization (they are not used past that point, but we do not have convenient place to clear them).
[Bug c/97578] ice during IPA pass: inline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97578 Jan Hubicka changed: What|Removed |Added CC||jakub at redhat dot com, ||mjambor at suse dot cz Component|ipa |c Summary|[11 Regression] ice during |ice during IPA pass: inline |IPA pass: inline| --- Comment #3 from Jan Hubicka --- What hits us here is the hack I needed to introduce to ipa_param_adjustments::modify_call which triggers materialization to make debug info code working. In this case redirection happens from tree-inline and materialization gets us back to tree-inline. Inliner is however not intended to be recursive (it uses bb->aux pointers and in this case it will use it twice). Martin, Jambor, it would be really great if we did not need to materialize. I do not see how attaching debug info to decls can work if caller is in one partition and callee in another. We could also just add a loop walking all such calls and trigger materialization before going to tree-inline to avoid the recursion problem, but still IMO debug info will get missing on the partitioning boundary. We could also just avoid the (ab)use of bb->aux and replace it by a vector here which would be also an option.
[Bug ipa/97576] [11 Regression] ICE: verify_cgraph_node failed (error: reference to dead statement)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97576 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #5 from Jan Hubicka --- Fixed.
[Bug ipa/97593] [11 Regression] ICE in gt_pch_nx, at symbol-summary.h:290 since r11-4329-g67f3791f7d133214
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97593 --- Comment #2 from Jan Hubicka --- Hmm, this is anoying: we can not store summary to PCH. I guess we want to collect thunks to a vector and annotate them to callgraph at finalization time :(
[Bug fortran/97652] New: New pdt14 failure after g:617695cdc2b3d950f1e4deb5ea85d5cc302943f4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97652 Bug ID: 97652 Summary: New pdt14 failure after g:617695cdc2b3d950f1e4deb5ea85d5cc302943f4 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- pdt14 is miscompiled with -fipa-modref. This is triggered by handling fnspec, but it seems to only trigger latent problem. The only disambiguations are: ipa-modref: call stmt push_8 (&root, &C.4105); ipa-modref: call to push_8/6 does not clobber ref: __vtab_link_module_Pdtlink_8._deallocate alias sets: 12->11 ipa-modref: call stmt push_8 (&root, &C.4104); ipa-modref: call to push_8/6 does not clobber ref: __vtab_link_module_Pdtlink_8._deallocate alias sets: 12->11 ipa-modref: call stmt push_8 (&root, &C.4103); ipa-modref: call to push_8/6 does not clobber ref: __vtab_link_module_Pdtlink_8._deallocate alias sets: 12->11 ipa-modref: call stmt push_8 (&root, &C.4105); ipa-modref: call to push_8/6 does not clobber ref: __vtab_link_module_Pdtlink_8._deallocate alias sets: 12->11 ipa-modref: call stmt push_8 (&root, &C.4104); ipa-modref: call to push_8/6 does not clobber ref: __vtab_link_module_Pdtlink_8._deallocate alias sets: 12->11 ipa-modref: call stmt push_8 (&root, &C.4103); ipa-modref: call to push_8/6 does not clobber ref: __vtab_link_module_Pdtlink_8._deallocate alias sets: 12->11 these ought to be safe since __vtab_link_module_Pdtlink_8 is readonly in the testcase. With LTO we detect that variable as such (and the testcase stil work without modref and fails different with modref). fre3 does quite a lot of additional changes and I am not sure what gets wrong here: __attribute__((externally_visible)) main (integer(kind=4) argc, character(kind=1) * * argv) { + struct array01_unknown cdesc.10; + struct array01_unknown cdesc.9; + real(kind=8) res; + struct Pdtlink_8 * previous; + struct Pdtlink_8 * current; + real(kind=8) res; struct pdtlink_8 * root; static integer(kind=4) options.11[7] = {2150, 4095, 1, 1, 1, 0, 31}; - real(kind=8) _7; - integer(kind=4) _8; - real(kind=8) _9; - integer(kind=4) _10; - real(kind=8) _11; - integer(kind=4) _12; - real(kind=8) _13; - integer(kind=4) _14; + struct Pdtlink_8 * _15; + struct Pdtlink_8 * _17; + struct Pdtlink_8 * _21; + struct Pdtlink_8 * _22; + void (*) () _23; + struct Pdtlink_8 * _25; + void (*) () _26; [local count: 1073741824]: _gfortran_set_args (argc_2(D), argv_3(D)); @@ -1972,52 +2120,75 @@ push_8 (&root, &C.4103); push_8 (&root, &C.4104); push_8 (&root, &C.4105); - _7 = pop_8 (&root); - _8 = (integer(kind=4)) _7; - if (_8 != 3) -goto ; [0.04%] + _15 = MEM[(struct Pdtlink_8 * &)&root]; + if (_15 != 0B) +goto ; [70.00%] else -goto ; [99.96%] +goto ; [30.00%] - [local count: 429496]: - _gfortran_stop_numeric (1, 0); - - [local count: 1073312329]: - _9 = pop_8 (&root); - _10 = (integer(kind=4)) _9; - if (_10 != 2) -goto ; [0.04%] + [local count: 75913541732]: + # current_16 = PHI <_15(2), _17(3)> + # previous_29 = PHI <_15(2), current_16(3)> + _17 = current_16->next; + if (_17 == 0B) +goto ; [0.00%] else -goto ; [99.96%] - - [local count: 429324]: - _gfortran_stop_numeric (2, 0); +goto ; [100.00%] - [local count: 1072883005]: - _11 = pop_8 (&root); - _12 = (integer(kind=4)) _11; - if (_12 != 1) -goto ; [0.04%] + [count: 0]: + res_19 = current_16->n; + _21 = previous_29->next; + if (_21 == 0B) +goto ; [30.00%] else -goto ; [99.96%] +goto ; [70.00%] - [local count: 429152]: - _gfortran_stop_numeric (3, 0); + [count: 0]: + _22 = _15->next; + if (_22 != 0B) +goto ; [70.00%] + else +goto ; [30.00%] - [local count: 1072453853]: - _13 = pop_8 (&root); - _14 = (integer(kind=4)) _13; - if (_14 != 0) -goto ; [0.04%] + [count: 0]: + MEM [(struct dtype_type *)&cdesc.9 + 24B] = {}; + cdesc.9.dtype.elem_len = 24; + cdesc.9.dtype.rank = 1; + cdesc.9.dtype.type = 11; + cdesc.9.dim[0].lbound = 1; + cdesc.9.dim[0].stride = 1; + cdesc.9.dim[0].ubound = 1; + cdesc.9.data = _22; + _23 = __vtab_link_module_Pdtlink_8._deallocate; + __builtin_unreachable (); + + [count: 0]: + __builtin_unreachable (); + + [count: 0]: + _25 = _21->next; + if (_25 != 0B) +goto ; [70.00%] else -goto ; [99.96%] +goto ; [30.00%] + + [count: 0]: + MEM [(struct dtype_type *)&cdesc.10 + 24B] = {}; + cdesc.10.dtype.elem_len = 24; + cdesc.10.dtype.rank = 1; + cdesc.10.dtype.type = 11; + cdesc.10.dim[0].lbound = 1; + cdesc.10.dim[0].stride = 1; + cdesc.10.dim[0].ubound = 1; + cdesc.10.data = _25; + _26 = __vtab_link_module_Pdtlink_8._deallocate; + __builtin_unreachable (); - [local count: 428981]: - _gfortran_stop_numeric (4, 0); +
[Bug fortran/97652] New pdt14 failure after g:617695cdc2b3d950f1e4deb5ea85d5cc302943f4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97652 --- Comment #1 from Jan Hubicka --- Actually there is another propagation happening in ipa-cp analysis: --- aa/pdt_14.f03.077i.cp 2020-10-31 09:00:52.809726530 +0100 +++ pdt_14.f03.077i.cp 2020-10-31 09:10:35.204755828 +0100 @@ -10,6 +10,8 @@ Starting walk at: push_8 (&root, &C.4104); instance pointer: &root Outer instance pointer: root offset: 0 (bits) vtbl reference: Function call may change dynamic type:push_8 (&root, &C.4103); +ipa-modref: call stmt push_8 (&root, &C.4103); +ipa-modref: call to push_8/6 does not clobber ref: root alias sets: 14->14 Determining dynamic type for call: push_8 (&root, &C.4104); Starting walk at: push_8 (&root, &C.4104); instance pointer: &C.4104 Outer instance pointer: C.4104 offset: 0 (bits) vtbl reference: @@ -19,6 +21,10 @@ instance pointer: &root Outer instance pointer: root offset: 0 (bits) vtbl reference: Function call may change dynamic type:push_8 (&root, &C.4104); Function call may change dynamic type:push_8 (&root, &C.4103); +ipa-modref: call stmt push_8 (&root, &C.4104); +ipa-modref: call to push_8/6 does not clobber ref: root alias sets: 14->14 +ipa-modref: call stmt push_8 (&root, &C.4103); +ipa-modref: call to push_8/6 does not clobber ref: root alias sets: 14->14 Determining dynamic type for call: push_8 (&root, &C.4105); Starting walk at: push_8 (&root, &C.4105); instance pointer: &C.4105 Outer instance pointer: C.4105 offset: 0 (bits) vtbl reference: @@ -30,6 +36,12 @@ Function call may change dynamic type:push_8 (&root, &C.4105); Function call may change dynamic type:push_8 (&root, &C.4104); Function call may change dynamic type:push_8 (&root, &C.4103); +ipa-modref: call stmt push_8 (&root, &C.4105); +ipa-modref: call to push_8/6 does not clobber ref: root alias sets: 14->14 +ipa-modref: call stmt push_8 (&root, &C.4104); +ipa-modref: call to push_8/6 does not clobber ref: root alias sets: 14->14 +ipa-modref: call stmt push_8 (&root, &C.4103); +ipa-modref: call to push_8/6 does not clobber ref: root alias sets: 14->14 Determining dynamic type for call: _3 = pop_8 (&root); Starting walk at: _3 = pop_8 (&root); instance pointer: &root Outer instance pointer: root offset: 0 (bits) vtbl reference: @@ -129,10 +141,14 @@ no arg info callsite ch2701/7 -> pop_8/5 : param 0: UNKNOWN + Aggregate passed by reference: + offset: 0, type: struct pdtlink_8 *, CONST: 0B value: 0x0, mask: 0xfff8 VR [1, -1] callsite ch2701/7 -> push_8/6 : param 0: UNKNOWN + Aggregate passed by reference: + offset: 0, type: struct pdtlink_8 *, CONST: 0B value: 0x0, mask: 0xfff8 VR [1, -1] param 1: CONST: &C.4105 -> 3.0e+0 @@ -140,6 +156,8 @@ Unknown VR callsite ch2701/7 -> push_8/6 : param 0: UNKNOWN + Aggregate passed by reference: + offset: 0, type: struct pdtlink_8 *, CONST: 0B value: 0x0, mask: 0xfff8 VR [1, -1] param 1: CONST: &C.4104 -> 2.0e+0 The jump function is not used for cloning, only triggers inline, but the conclusion seems wrong. push_8 can make root non-0. Root is of type pdtlink_8 so perhaps Frontend produces multiple copies of these. push_8 store is: - Analyzing store: *self_34(D) - Recording base_set=8 ref_set=8 parm=0 so indeed a different alias set than 14 used by ch2701
[Bug middle-end/97672] [11 Regression] gfortran.dg/pdt_14.f03 – runtime: timeout with -O2 (and higher)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97672 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Jan Hubicka --- Duplicate. I added some analysis to the other PR. It is apprently a TBAA issue in the frontend. *** This bug has been marked as a duplicate of bug 97652 ***
[Bug fortran/97652] New pdt14 failure after g:617695cdc2b3d950f1e4deb5ea85d5cc302943f4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97652 Jan Hubicka changed: What|Removed |Added CC||burnus at gcc dot gnu.org --- Comment #2 from Jan Hubicka --- *** Bug 97672 has been marked as a duplicate of this bug. ***
[Bug c/97578] ice during IPA pass: inline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97578 --- Comment #8 from Jan Hubicka --- OK, I comitted patch as is and we could see if any memory can be conserved by being more precise. I still think the debug info should not need decls here. Honza
[Bug ipa/97698] [11 Regression] ICE: Segmentation fault (in duplicate_thunk_for_node) since r11-4587-gae7a23a3fab74
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97698 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #3 from Jan Hubicka --- Fixed.
[Bug ipa/97673] [11 Regression] ICE in remap_gimple_stmt, at tree-inline.c:1922 since r11-4267-g0e590b68fa374365
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97673 --- Comment #2 from Jan Hubicka --- This should be dup of PR97578
[Bug ipa/97593] [11 Regression] ICE in gt_pch_nx, at symbol-summary.h:290 since r11-4329-g67f3791f7d133214
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97593 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #6 from Jan Hubicka --- Fixed.
[Bug ipa/97300] [11 regression] several test cases fail after r11-3308
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97300 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #2 from Jan Hubicka --- Assumed type failures are fixed now by the Fortran array descriptor TBAA fix. g:40cb3f8ac875c6cf6610a5f93da571cfdd2a1513 If there are other failures, lets open independent PR for that.
[Bug ipa/97735] New: ipa-prop should handle simple casts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97735 Bug ID: 97735 Summary: ipa-prop should handle simple casts Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org CC: marxin at gcc dot gnu.org Target Milestone: --- Compiling: test (int *a, int size) { __builtin_memset (a, 0, size); } gets: Jump functions: Jump functions of caller __builtin_memset/1: Jump functions of caller test/0: callsite test/0 -> __builtin_memset/1 : param 0: PASS THROUGH: 0, op nop_expr, agg_preserved value: 0x0, mask: 0x Unknown VR param 1: CONST: 0 value: 0x0, mask: 0x0 Unknown VR param 2: UNKNOWN value: 0x0, mask: 0x VR ~[2147483648, -2147483649] I think we should be able to represent that SIZE is passthrough with a conversion.
[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008 --- Comment #6 from Jan Hubicka --- I just noticed this PR and wonder if there is anything to do on inliner side. It uses DECL_DECLARED_INLINE that was invented to distinguish between implicit inlines and explicit ones. So even if it would be bit misnamed it should mean "this is an inline hint for inliner", so I guess frontend needs to distinguish between constexpr and normal places where inline hint still means "inline more"? Inliner is really not on level to be able to completely ignore used inline hints without regressing various code. I made inline weaker for -O2 in GCC10 but for -O3 we still take it very seriously and I do not see way out of that: in many cases it is very hard to predict how much optimization will happen after inlining and a lot of code is carefully crafted under assumption that some specific inline happens (and a lot of such code is in C++)
[Bug lto/80379] Redundant note: code may be misoptimized unless -fno-strict-aliasing is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80379 --- Comment #3 from Jan Hubicka --- The problem here is that the hint is output at decl merging and -fno-strict-aliasing is a function local flag. At that time we do not even know what functions will be since units are not streamed in yet. This means that we do not know if some unit has function that is -fno-strict-aliasing. So supressing the warning does not fit the implementation very easily :(
[Bug ipa/97757] [11 Regression] fortran save_6.f90 fails with a segv for -flto -O >= 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97757 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #1 from Jan Hubicka --- indeed this is obviously garbage collected that is weird because all things should be reachable via the modref summary (where THIS pointer is taken). I will try cross.
[Bug ipa/97766] ipa/modref-2.c fails on 32 bits targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97766 Jan Hubicka changed: What|Removed |Added Last reconfirmed||2020-11-09 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED --- Comment #1 from Jan Hubicka --- That value is sizeof(double)*8. I tpicked double since we have builtin that writes it assumed it is 64 bits on all targets. Forgot that it can be 32bit. We could change it to float. Is float of same size everywhere? If not we could restrict test only to targets where size is known.
[Bug middle-end/97775] New: Wrong code with bitfield
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97775 Bug ID: 97775 Summary: Wrong code with bitfield Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- The follwing testcase reduced from ssd/t2.c #include void dump (void *p, unsigned int len) { const char digits[17] = "0123456789abcdef"; unsigned char *a = (unsigned char *)p; int i; for (i = 0; i < len; i++) { putchar (' '); putchar (digits[a[i] / 16]); putchar (digits[a[i] % 16]); } } void put (const char s[]) { int i; for (i = 0; s[i]; i++) putchar (s[i]); } void new_line (void) { putchar ('\n'); } struct __attribute__((scalar_storage_order("little-endian"), packed)) R1 { unsigned S1 : 2; unsigned I : 32; unsigned S2 : 2; unsigned A1 : 9; unsigned A2 : 9; unsigned A3 : 9; unsigned B : 1; }; struct R1 My_R1 = { 2, 0x12345678, 1, 0xAB, 0xCD, 0xEF, 1 }; int main (void) { struct R1 Local_R1; Local_R1.B = 1; #ifdef BAD new_line (); #endif /* { dg-output "Local_R1 : e2 59 d1 48 b4 aa d9 bb.*\n" } */ Local_R1.S1 = 0; Local_R1.I = 0; Local_R1.S2 = 0; Local_R1.A1 = 0; Local_R1.A2 = 0; Local_R1.A3 = 0; Local_R1.B = !Local_R1.B; put ("Local_R1 :"); dump (&Local_R1, sizeof (struct R1)); new_line (); /* { dg-output "Local_R1 : e5 59 d1 48 b0 a0 c1 03.*\n" } */ new_line (); return 0; } Defining BAD canges output < Local_R1 : 00 00 00 00 00 00 00 00 --- > Local_R1 : 00 00 00 00 00 00 00 80 Difference is already in fre1: -Value numbering store Local_R1.B to _3 +Value numbering store Local_R1.B to 1 -RPO tracked 17 values available at 3 locations and 17 lattice elements +RPO tracked 17 values available at 0 locations and 17 lattice elements +Replaced BIT_FIELD_REF with 0 in all uses of _1 = BIT_FIELD_REF ; +Replaced (signed char) _1 with 0 in all uses of _2 = (signed char) _1; +Replaced _2 >= 0 with 1 in all uses of _3 = _2 >= 0; +Deleted redundant store Local_R1.B = _3; +Removing dead stmt Local_R1.B = _3; +Removing dead stmt _3 = _2 >= 0; +Removing dead stmt _2 = (signed char) _1; +Removing dead stmt _1 = BIT_FIELD_REF ; main () { struct R1 Local_R1; - unsigned char _1; - signed char _2; - _Bool _3; : Local_R1.B = 1; @@ -533,10 +540,6 @@ Local_R1.A1 = 0; Local_R1.A2 = 0; Local_R1.A3 = 0; - _1 = BIT_FIELD_REF ; - _2 = (signed char) _1; - _3 = _2 >= 0; - Local_R1.B = _3; put ("Local_R1 :"); dump (&Local_R1, 8); new_line (); Clearly B should be 0 and not 1.
[Bug middle-end/97775] Wrong code with bitfield
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97775 --- Comment #1 from Jan Hubicka --- Forgot to say, flags to reproduce are: -Os t2.c -fno-tree-sra -fno-ipa-modref
[Bug rtl-optimization/97836] [11 Regression] wrong code at -O1 on x86_64-pc-linux-gnu by r11-5029
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97836 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED CC||hubicka at gcc dot gnu.org --- Comment #3 from Jan Hubicka --- Confirmed. The wrong code happens already in fre1 where we do: main () { int f; int * _1; : _1 = d (&f); __builtin_abort (); } Modref summary for d is: loads: Limits: 32 bases, 16 refs Base 0: alias set 1 Ref 0: alias set 1 Every access stores: Limits: 32 bases, 16 refs Base 0: alias set 1 Ref 0: alias set 1 Every access parm 0 flags: direct noclobber noescape unused for body: d (int * e) { int D.1973; int a.0_1; : a.0_1 = a; if (a.0_1 != 0) goto ; [INV] else goto ; [INV] : a = 0; : return e_10(D); } direct noclobber noescape looks correct to me: value is only returned and noescape values are allowed to escape to return value (per IRC discussion we had with Richi). I think problem is with unused that makes tree-ssa-structalias to completely skip the parameter rather than adding it to return value alias set. I guess we want to specify what unused really means. Indeed current comment is "Nonzero if the argument is not used by the function." and in this case we wold need to have separate EAF_NOREAD so current EAF_UNUSED would be EAF_NOCLOBBER | EAF_NOREAD or track that internally in ipa-modref. A quick fix is to make return statement clear EAF_UNUSED flag.
[Bug middle-end/97840] [11 regression] Bogus -Wmaybe-uninitialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97840 Jan Hubicka changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2020-11-15 Status|UNCONFIRMED |NEW CC||hubicka at gcc dot gnu.org --- Comment #1 from Jan Hubicka --- Confirmed. Reproduces on aarch64 cross for me, not on x86-64 native. Warning is on: #1 0x01343ad5 in maybe_warn_pass_by_reference (stmt=0x732ec558, wlims=...) at ../../gcc/tree-ssa-uninit.c:530 530 tree argbase = maybe_warn_operand (ref, stmt, NULL_TREE, arg, wlims); (gdb) down #0 maybe_warn_operand (ref=..., stmt=0x732ec558, lhs=0x0, rhs=0x755b93f0, wlims=...) at ../../gcc/tree-ssa-uninit.c:434 434 warned = warning_at (location, OPT_Wmaybe_uninitialized, (gdb) p debug_generic_stmt (rhs) D.89878 std::filesystem::__cxx11::recursive_directory_iterator::pop (struct recursive_directory_iterator * const this) { struct error_code ec; struct allocator D.89878; std::__cxx11::basic_string::basic_string<> (&D.89879, iftmp.99_1, &D.89878); D.89878 ={v} {CLOBBER}; and is otherwise unused. Function looks identical with -fno-ipa-modref. std::__cxx11::basic_string::basic_string<> is defined locally and the last parameter (__a) is unused. modref determines flags parm 2 flags: direct noclobber noescape unused That seems all OK to me, so it seems that somehow uninit pass gets more active because of different alias info.
[Bug middle-end/97840] [11 regression] Bogus -Wmaybe-uninitialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97840 --- Comment #2 from Jan Hubicka --- Ok, so the warning is triggering when uninitialized memory is passed to function argument declared as const. This happens here but is false positive since the parameter is not used at all. This may have become worse with EAF analysis since we now optimize the dead code initializing unused parameters in case kill analysis triggers. Following patch fixes it but I do not understand why this does not trigger on x86-64 for me. diff --git a/gcc/tree-ssa-uninit.c b/gcc/tree-ssa-uninit.c index f23514395e0..1e074793b02 100644 --- a/gcc/tree-ssa-uninit.c +++ b/gcc/tree-ssa-uninit.c @@ -443,7 +443,7 @@ maybe_warn_operand (ao_ref &ref, gimple *stmt, tree lhs, tree rhs, access implying read access to those objects. */ static void -maybe_warn_pass_by_reference (gimple *stmt, wlimits &wlims) +maybe_warn_pass_by_reference (gcall *stmt, wlimits &wlims) { if (!wlims.wmaybe_uninit) return; @@ -501,6 +501,10 @@ maybe_warn_pass_by_reference (gimple *stmt, wlimits &wlims) && !TYPE_READONLY (TREE_TYPE (argtype))) continue; + /* Ignore args we are not going to read from. */ + if (gimple_call_arg_flags (stmt, argno - 1) & EAF_UNUSED) + continue; + if (save_always_executed && access->mode == access_read_only) /* Attribute read_only arguments imply read access. */ wlims.always_executed = true; @@ -639,8 +643,8 @@ warn_uninitialized_vars (bool wmaybe_uninit) if (gimple_vdef (stmt)) wlims.vdef_cnt++; - if (is_gimple_call (stmt)) - maybe_warn_pass_by_reference (stmt, wlims); + if (gcall *call = dyn_cast (stmt)) + maybe_warn_pass_by_reference (call, wlims); else if (gimple_assign_load_p (stmt) && gimple_has_location (stmt)) {
[Bug middle-end/97840] [11 regression] Bogus -Wmaybe-uninitialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97840 Jan Hubicka changed: What|Removed |Added CC||msebor at gcc dot gnu.org --- Comment #3 from Jan Hubicka --- OK, on x86_64 the corresponding warning does not trigger since TYPE_EMPTY_P is true. x86_64 compiler I get: (gdb) p debug_tree (rhstype) constant 8> unit-size constant 1> align:8 warn_if_not_align:0 symtab:0 alias-set 76 canonical-type 0x77624498 fields unit-size align:8 warn_if_not_align:0 symtab:0 alias-set 77 canonical-type 0x7684a888 fields context full-name "class __gnu_cxx::new_allocator" needs-constructor needs-destructor X() X(constX&) this=(X&) n_parents=0 use_template=1 interface-unknown pointer_to_this reference_to_this chain > ignored decl_6 BLK /opt/gcc/test/Build/aarch64-suse-linux/libstdc++-v3/include/bits/allocator.h:116:11 size unit-size align:8 warn_if_not_align:0 offset_align 8 offset bit-offset context chain ignored decl_1 VOID /opt/gcc/test/Build/aarch64-suse-linux/libstdc++-v3/include/bits/allocator.h:129:9 align:1 warn_if_not_align:0 context parms value length:1 elt:0 >>> full-name "template struct std::allocator::rebind" chain >> context full-name "class std::allocator" needs-constructor needs-destructor X() X(constX&) this=(X&) n_parents=1 use_template=3 interface-only pointer_to_this reference_to_this chain > $50 = void (gdb) p rhstype->type_common.empty_flag $51 = 1 while on aarch64 I get: (gdb) p debug_tree (rhstype) constant 8> unit-size constant 1> align:8 warn_if_not_align:0 symtab:0 alias-set 76 canonical-type 0x771ff3f0 fields unit-size align:8 warn_if_not_align:0 symtab:0 alias-set 77 canonical-type 0x766297e0 fields context full-name "class __gnu_cxx::new_allocator" needs-constructor needs-destructor X() X(constX&) this=(X&) n_parents=0 use_template=1 interface-unknown pointer_to_this reference_to_this chain > ignored decl_6 BLK /opt/gcc/test/Build/aarch64-suse-linux/libstdc++-v3/include/bits/allocator.h:116:11 size unit-size align:8 warn_if_not_align:0 offset_align 8 offset bit-offset context chain ignored decl_1 VOID /opt/gcc/test/Build/aarch64-suse-linux/libstdc++-v3/include/bits/allocator.h:129:9 align:1 warn_if_not_align:0 context parms value length:1 elt:0 >>> full-name "template struct std::allocator::rebind" chain >> context full-name "class std::allocator" needs-constructor needs-destructor X() X(constX&) this=(X&) n_parents=1 use_template=3 interface-only pointer_to_this reference_to_this chain > $21 = void (gdb) p rhstype->type_common.empty_flag $22 = 0 that is set by 1972 /* Handle empty records as per the x86-64 psABI. */ 1973 TYPE_EMPTY_P (type) = targetm.calls.empty_record_p (type); So I suppose relying on TYPE_EMPTY_P to silence false positives on empty structures is not very portable.
[Bug middle-end/97840] [11 regression] Bogus -Wmaybe-uninitialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97840 --- Comment #4 from Jan Hubicka --- And to explain why warning does not trigger without modref, it is because we are not able to disambiguate the variable with another function call (since we think it escapes) (gdb) p debug_gimple_stmt (def_stmt) # .MEM_7 = VDEF <.MEM_5> _8 = __cxa_allocate_exception (48); Martin, I think this is much more your area than mine. I will post the patch on silencing warning on unused args, but I think we shoulid resovle the empty field issue.
[Bug middle-end/97840] [11 regression] Bogus -Wmaybe-uninitialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97840 --- Comment #6 from Jan Hubicka --- I remember that first_field was returning non-NULL (perhaps it is derived from empty base)? My patch touched nothing on the condition: it just improved the alias analysis. So while previously we tought that the variable can be intialized by the function call _8 = __cxa_allocate_exception (48); now we are able to track and figure out that it is non-escaping and thus can not be touched by it.
[Bug rtl-optimization/97836] [11 Regression] wrong code at -O1 on x86_64-pc-linux-gnu by r11-5029
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97836 --- Comment #5 from Jan Hubicka --- I forgot to attach the PR number, but I commited the quick fix (to prevent wrong code) as g:26285af40f98dfdb809b98b08386073c63b65db1 I will discuss the EAF_UNUSED flag today after teaching.
[Bug ipa/97757] [11 Regression] fortran save_6.f90 fails with a segv for -flto -O >= 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97757 --- Comment #3 from Jan Hubicka --- This is problem with propagate_in_scc sometimes freeing the summary and losing track of it. It is fixed in https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559116.html
[Bug objc/97854] New: [11 regression] ODR violation in stub-objc.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97854 Bug ID: 97854 Summary: [11 regression] ODR violation in stub-objc.c Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: objc Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- stub-objc provides dummy RID enum which causes ODR violation. This produces warnings with lto-bootstrap: ../../gcc/c-family/c-common.h:63: warning: type ‘rid’ violates the C++ One Definition Rule [-Wodr] 63 | enum rid | ../../gcc/c-family/stub-objc.c:30: note: an enum with different value name is defined in another translation unit 30 | enum rid { DUMMY }; | ../../gcc/c-family/c-common.h:67: note: name ‘RID_STATIC’ differs from name ‘DUMMY’ defined in another translation unit 67 | RID_STATIC = 0, | ../../gcc/c-family/stub-objc.c:30: note: mismatching definition 30 | enum rid { DUMMY }; | ../../gcc/c-family/c-common.h:63: warning: type ‘rid’ violates the C++ One Definition Rule [-Wodr] 63 | enum rid | ../../gcc/c-family/stub-objc.c:30: note: an enum with different value name is defined in another translation unit 30 | enum rid { DUMMY }; | ../../gcc/c-family/c-common.h:67: note: name ‘RID_STATIC’ differs from name ‘DUMMY’ defined in another translation unit 67 | RID_STATIC = 0, | ../../gcc/c-family/stub-objc.c:30: note: mismatching definition 30 | enum rid { DUMMY }; | I think this was introduced in g:9a34a5cce6b50fc3527e7c7ab356808ed435883c
[Bug middle-end/97855] New: [11 regression] Bogus warning locations during lto-bootstrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97855 Bug ID: 97855 Summary: [11 regression] Bogus warning locations during lto-bootstrap Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- For a while we get odd looking locations: D.5677.coeffs[0]’../../gcc/calls.c: In function ‘expand_call’: ../../gcc/dojump.c:118:28: warning: may be used uninitialized in this function [-Wmaybe-uninitialized] 118 | pending_stack_adjust = save->x_pending_stack_adjust; |^ D.5677.coeffs[0]’../../gcc/calls.c:4082:34: note: was declared here 4082 | saved_pending_stack_adjust save; | ^ D.5677.coeffs[0]’../../gcc/dojump.c:119:27: warning: may be used uninitialized in this function [-Wmaybe-uninitialized] 119 | stack_pointer_delta = save->x_stack_pointer_delta; | ^ D.5677.coeffs[0]’../../gcc/calls.c:4082:34: note: was declared here 4082 | saved_pending_stack_adjust save; | ^ This is not due to parallel write and seems that location code somehow conclude to output the additional D.5677.coeffs[0]’
[Bug middle-end/97840] [11 regression] Bogus -Wmaybe-uninitialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97840 --- Comment #9 from Jan Hubicka --- Created attachment 49571 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49571&action=edit Warnings building cc1plus with LTO This is current set of wranings building cc1plus with LTO. there are 66 maybe-uninitialized. I always wondered if we want to print warnings exposing GCC internals like: ../gmp/mpz/../../../gmp/mpz/swap.c:38:3: warning: ‘MEM[(struct __mpz_struct *)&cst]._mp_size’ may be used uninitialized in this function [-Wmaybe-uninitialized] ../gmp/mpz/../../../gmp/mpz/swap.c:37:3: warning: ‘MEM[(struct __mpz_struct *)&cst]._mp_alloc’ may be used uninitialized in this function [-Wmaybe-uninitialized] ../isl/../../isl/isl_tab.c:2940:29: warning: ‘var’ may be used uninitialized in this function [-Wmaybe-uninitialized] ../gmp/mpz/../../../gmp/mpz/swap.c:39:3: warning: ‘MEM[(struct __mpz_struct *)&cst]._mp_d’ may be used uninitialized in this function [-Wmaybe-uninitialized] ../gmp/mpz/../../../gmp/mpz/swap.c:38:3: warning: ‘MEM[(struct __mpz_struct *)&cst]._mp_size’ may be used uninitialized in this function [-Wmaybe-uninitialized] ../gmp/mpz/../../../gmp/mpz/swap.c:37:3: warning: ‘MEM[(struct __mpz_struct *)&cst]._mp_alloc’ may be used uninitialized in this function [-Wmaybe-uninitialized] ../../gcc/machmode.h:546:49: warning: ‘MEM[(struct scalar_int_mode *)&int_mode]’ may be used uninitialized in this function [-Wmaybe-uninitialized] A lot of warnings are about remainder_len in wide-int. Tehere is loop iniitalizeing it and seems we do not work out it has non-0 number of iteraitons. ../../gcc/analyzer/store.cc:647:13: warning: ‘MEM[(long int *)&sval_bit_size + 8B]’ may be used uninitialized [-Wmaybe-uninitialized] ../../gcc/analyzer/store.cc:647:13: warning: ‘MEM[(long int *)&sval_bit_size + 16B]’ may be used uninitialized [-Wmaybe-uninitialized] the MEM_REF syntax is not very pretty.
[Bug bootstrap/97857] New: profiledbootstrap broken freeing speculative call summary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97857 Bug ID: 97857 Summary: profiledbootstrap broken freeing speculative call summary Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Configuring with ../configure --with-build-config=bootstrap-lto --enable-checking=release --disable-plugin leads to ICE building stage feedback libstdc++. This is already with optimized cc1plus so it may a miscompile of cc1plus. 0x8fcd5a crash_signal ../../gcc/toplev.c:330 0x7789c83f ??? /build/glibc-vjB4T1/glibc-2.28/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x11fcf44 vec::release() ../../gcc/vec.h:1813 0x11fcf2e auto_vec::~auto_vec() ../../gcc/vec.h:1542 0x11fcf2e speculative_call_summary::~speculative_call_summary() ../../gcc/ipa-profile.c:178 0x11fcf2e object_allocator::remove(speculative_call_summary*) ../../gcc/alloc-pool.h:522 0x11fcf2e call_summary_base::release(speculative_call_summary*) ../../gcc/symbol-summary.h:625 0xd03fbe call_summary::~call_summary() ../../gcc/symbol-summary.h:771 0x11e106f ipa_profile_call_summaries::~ipa_profile_call_summaries() ../../gcc/ipa-profile.c:192 0x11e106f ipa_profile_call_summaries::~ipa_profile_call_summaries() ../../gcc/ipa-profile.c:192 0x11e0cff ipa_profile ../../gcc/ipa-profile.c:1031 0x11e0cff execute ../../gcc/ipa-profile.c:1070
[Bug middle-end/97858] New: [11 regression] Bogus warnings about va_list during profiledbootstrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97858 Bug ID: 97858 Summary: [11 regression] Bogus warnings about va_list during profiledbootstrap Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- During profiledbootstrap we get the following warnings: ../libcpp/../../libcpp/mkdeps.c: In function ‘munge.constprop’: ../libcpp/../../libcpp/mkdeps.c:176:13: warning: ‘MEM[(struct *)&args].reg_save_area’ may be used uninitialized [-Wmaybe-uninitialized] 176 | str = va_arg (args, const char *); | ^ ../libcpp/../../libcpp/mkdeps.c:120:11: note: ‘MEM[(struct *)&args].reg_save_area’ was declared here 120 | va_list args; | ^ ../libcpp/../../libcpp/mkdeps.c:176:13: warning: ‘MEM[(struct *)&args].overflow_arg_area’ may be used uninitialized in this function [-Wmaybe-uninitialized] 176 | str = va_arg (args, const char *); | ^ ../libcpp/../../libcpp/mkdeps.c:120:11: note: ‘MEM[(struct *)&args].overflow_arg_area’ was declared here 120 | va_list args; | ^ ../libcpp/../../libcpp/mkdeps.c:176:13: warning: ‘MEM[(struct *)&args].gp_offset’ may be used uninitialized in this function [-Wmaybe-uninitialized] 176 | str = va_arg (args, const char *); | ^ ../libcpp/../../libcpp/mkdeps.c:120:11: note: ‘MEM[(struct *)&args].gp_offset’ was declared here 120 | va_list args; | ^ This seems to be due to conditional initialization of va_list: static const char * munge (const char *str, const char *trail = NULL, ...) { static unsigned alloc; static char *buf; unsigned dst = 0; va_list args; if (trail) va_start (args, trail); but it does not make much sense to me to warn about internals of va_arg iplementation at first place. It is not user visible.