https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110586
Jan Hubicka <hubicka at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|[14 Regression] 10% |[14 Regression] 10% |fatigue2 regression on zen |fatigue2 regression on zen |since |since |r14-2369-g3a61ca1b925653 |r14-2369-g3a61ca1b925653 | |(bad LRA&scheduling) --- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Aha, sphinx3 is indeed same patch. The patch corrects profile here. It is LRA/scheduler interaction that causes the difference With older trunk I get: Performance counter stats for './b.out': 28,536.75 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 138 page-faults:u # 4.836 /sec 134,747,380,473 cycles:u # 4.722 GHz (83.33%) 714,193,718 stalled-cycles-frontend:u # 0.53% frontend cycles idle (83.33%) 3,510,378 stalled-cycles-backend:u # 0.00% backend cycles idle (83.33%) 243,176,910,654 instructions:u # 1.80 insn per cycle # 0.00 stalled cycles per insn (83.33%) 13,541,807,472 branches:u # 474.539 M/sec (83.33%) 13,829,858 branch-misses:u # 0.10% of all branches (83.33%) 28.537620889 seconds time elapsed 28.536941000 seconds user 0.000000000 seconds sys and with current trunk: Performance counter stats for './a.out': 31933.51 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 138 page-faults:u # 4.321 /sec 150448312691 cycles:u # 4.711 GHz (83.33%) 760763745 stalled-cycles-frontend:u # 0.51% frontend cycles idle (83.33%) 1918238 stalled-cycles-backend:u # 0.00% backend cycles idle (83.33%) 242823668283 instructions:u # 1.61 insn per cycle # 0.00 stalled cycles per insn (83.34%) 13541981288 branches:u # 424.068 M/sec (83.34%) 14583703 branch-misses:u # 0.11% of all branches (83.33%) 31.933986770 seconds time elapsed 31.933701000 seconds user 0.000000000 seconds sys So same instruction and branch count, but they execute slower. IPC goes down from 1.8 to 1.6. Perf thinks the difference is __perdida_m_MOD_generalized_hookes_law.constprop.0. 27.45% b.out b.out [.] MAIN__ 27.07% a.out a.out [.] MAIN__ 21.72% a.out a.out [.] __perdida_m_MOD_generalized_hookes_law.constprop.0. 16.60% b.out b.out [.] __perdida_m_MOD_generalized_hookes_law.constprop.0. 2.22% a.out a.out [.] __perdida_m_MOD_generalized_hookes_law.constprop.1. 1.64% b.out b.out [.] __perdida_m_MOD_generalized_hookes_law.constprop.1. 1.55% b.out libc.so.6 [.] __memset_avx2_unaligned_erms 1.54% a.out libc.so.6 [.] __memset_avx2_unaligned_erms 0.06% a.out libm.so.6 [.] __sincos_fma 0.04% b.out libm.so.6 [.] __sincos_fma b.out is before patch and a.out is after. The difference seems to be relocated load. Before patch: Percent│ 0000000000401860 <__perdida_m_MOD_generalized_hookes_▒ │ __perdida_m_MOD_generalized_hookes_law.constprop.0.is▒ 0.10 │ push %rbp ▒ 0.02 │ mov %r8,%rax ▒ │ vmovddup %xmm0,%xmm5 ▒ │ mov %rsp,%rbp ▒ 1.22 │ push %r15 ▒ 0.04 │ push %r14 ▒ 0.03 │ push %r13 ▒ 0.09 │ push %r12 ▒ 0.05 │ push %rbx ▒ 0.03 │ not %rax ▒ 0.00 │ mov %rdi,%rbx ▒ │ and $0xffffffffffffffe0,%rsp ▒ 1.11 │ mov %rdx,%r12 ▒ │ sub $0x180,%rsp ▒ 0.04 │ vmovapd %xmm5,0x20(%rsp) ◆ ^^^^ this load 0.04 │ mov %rax,0x30(%rsp) ▒ 0.02 │ test %rsi,%rsi ▒ │ ↓ je 210 ▒ │ mov %rsi,%rax ▒ │ mov %rsi,%r13 ▒ 1.16 │ lea (%rsi,%rsi,1),%r10 ▒ 0.01 │ mov %rsi,%r15 ▒ │ shl $0x4,%rax ▒ 0.06 │ neg %r13 ▒ │ lea (%r10,%rsi,1),%r14 ▒ 0.03 │ mov %rax,0x18(%rsp) ▒ 0.02 │ lea 0x0(,%rsi,8),%rax ▒ │ mov %rax,0x10(%rsp) ▒ 1.23 │ 66: mov $0x120,%edx ▒ 0.01 │ xor %esi,%esi ▒ │ lea 0x60(%rsp),%rdi ▒ 0.07 │ vmovsd %xmm1,0x38(%rsp) ▒ 0.03 │ vmovsd %xmm0,0x40(%rsp) ▒ 0.12 │ mov %r8,0x48(%rsp) ▒ 0.05 │ mov %rcx,0x50(%rsp) ▒ 0.06 │ sub %r12,%r13 ▒ 1.16 │ mov %r10,0x58(%rsp) ▒ 0.04 │ → call memset@plt ▒ after │ 0000000000401870 <__perdida_m_MOD_generalized_hookes_▒ │ __perdida_m_MOD_generalized_hookes_law.constprop.0.is▒ 0.07 │ push %rbp ▒ 0.01 │ mov %r8,%rax ▒ │ vmovddup %xmm0,%xmm3 ▒ │ mov %rsp,%rbp ▒ 0.87 │ push %r15 ▒ 0.04 │ push %r14 ▒ 0.02 │ push %r13 ▒ 0.07 │ push %r12 ▒ 0.02 │ push %rbx ▒ 0.02 │ not %rax ▒ 0.00 │ mov %rdi,%rbx ▒ │ and $0xffffffffffffffe0,%rsp ▒ 0.87 │ mov %rdx,%r12 ▒ │ sub $0x180,%rsp ◆ 0.04 │ mov %rax,0x58(%rsp) ▒ 0.03 │ test %rsi,%rsi ▒ │ je 210 ▒ │ mov %rsi,%rax ▒ 0.00 │ mov %rsi,%r13 ▒ │ lea (%rsi,%rsi,1),%r10 ▒ 0.95 │ mov %rsi,%r15 ▒ 0.01 │ shl $0x4,%rax ▒ │ neg %r13 ▒ 0.04 │ lea (%r10,%rsi,1),%r14 ▒ 0.04 │ mov %rax,0x18(%rsp) ▒ 0.01 │ lea 0x0(,%rsi,8),%rax ▒ │ mov %rax,0x10(%rsp) ▒ 0.02 │ 60: mov $0x120,%edx ▒ 0.89 │ xor %esi,%esi ▒ 0.01 │ lea 0x60(%rsp),%rdi ▒ │ vmovsd %xmm1,0x20(%rsp) ▒ ^^^^ is now here 0.08 │ vmovsd %xmm0,0x28(%rsp) ▒ 0.04 │ mov %r8,0x30(%rsp) ▒ 0.01 │ mov %rcx,0x38(%rsp) ▒ 0.05 │ sub %r12,%r13 ▒ │ mov %r10,0x50(%rsp) ▒ 1.04 │ vmovapd %xmm3,0x40(%rsp) ▒ And later bit different scheduling: 0.12 │ vmovsd %xmm1,0x108(%rsp) ▒ 1.22 │ vmovsd %xmm0,0x70(%rsp) ◆ 0.38 │ vmovapd %xmm4,0xc0(%rsp) ▒ 1.27 │ vmovsd %xmm0,0xa0(%rsp) ▒ 0.20 │ vmovsd %xmm1,0x140(%rsp) ▒ 2.41 │ vmovsd %xmm1,0x178(%rsp) ▒ 2.05 │ vbroadcastsd 0x10(%rcx,%rax,8),%ymm1 ▒ 0.10 │ vunpcklpd %xmm0,%xmm2,%xmm3 ▒ │ vmovsd %xmm2,0xd0(%rsp) ▒ 0.34 │ vmovapd %xmm3,0x60(%rsp) ▒ 2.25 │ vunpcklpd %xmm2,%xmm0,%xmm3 ▒ │ vbroadcastsd -0x8(%rcx,%rdx,8),%ymm2 ▒ 0.01 │ vmovapd %xmm3,0x90(%rsp) ▒ 0.28 │ vbroadcastsd (%rcx,%rdx,8),%ymm3 ▒ 0.01 │ vmulpd 0xc0(%rsp),%ymm3,%ymm3 ▒ 52.87 │ vmulpd 0xf0(%rsp),%ymm2,%ymm2 ▒ 0.06 │ vbroadcastsd (%rcx),%ymm0 ▒ │ vfmadd132pd 0x90(%rsp),%ymm3,%ymm1 ▒ 1.77 │ vfmadd132pd 0x60(%rsp),%ymm2,%ymm0 ▒ 0.10 │ vmovddup 0x8(%rcx,%rax,8),%xmm2 ▒ │ lea 0x0(%r13,%r12,2),%rax ▒ After: 0.28 │ vmovsd %xmm1,0x108(%rsp) ▒ 0.98 │ vmovsd %xmm0,0x70(%rsp) ◆ 0.04 │ vmovapd %xmm3,0xc0(%rsp) ▒ 0.99 │ vmovsd %xmm0,0xa0(%rsp) ▒ 0.26 │ vmovsd %xmm1,0x140(%rsp) ▒ 1.80 │ vmovsd %xmm1,0x178(%rsp) ▒ 0.91 │ vbroadcastsd (%rcx,%rdx,8),%ymm3 ▒ 0.08 │ vbroadcastsd 0x10(%rcx,%rax,8),%ymm1 ▒ 0.07 │ vunpcklpd %xmm0,%xmm2,%xmm4 ▒ 0.02 │ vmovsd %xmm2,0xd0(%rsp) ▒ 0.93 │ vmulpd 0xc0(%rsp),%ymm3,%ymm3 ▒ 42.18 │ vmovapd %xmm4,0x60(%rsp) ▒ │ vunpcklpd %xmm2,%xmm0,%xmm4 ▒ │ vbroadcastsd -0x8(%rcx,%rdx,8),%ymm2 ▒ │ vmulpd 0xf0(%rsp),%ymm2,%ymm2 ▒ 0.09 │ vmovapd %xmm4,0x90(%rsp) ▒ │ vbroadcastsd (%rcx),%ymm0 ▒ │ vfmadd132pd 0x90(%rsp),%ymm3,%ymm1 ▒ 23.48 │ vfmadd132pd 0x60(%rsp),%ymm2,%ymm0 ▒ 0.77 │ vmovddup 0x8(%rcx,%rax,8),%xmm2 ▒ │ lea 0x0(%r13,%r12,2),%rax ▒ Perdida is loopless with only 3 BBS in optimize dump. With old build we get: <bb 2> [local count: 25581901]: _60 = {ISRA.929_118(D), ISRA.929_118(D)}; offset.162_6 = ~ISRA.928_112(D); if (ISRA.925_113(D) != 0) goto <bb 3>; [50.00%] else goto <bb 4>; [50.00%] <bb 3> [local count: 12790951]: _226 = -ISRA.925_113(D); _228 = ISRA.925_113(D) * 2; _230 = ISRA.925_113(D) * 3; _232 = ISRA.925_113(D) * 16; _234 = (sizetype) _232; _236 = ISRA.925_113(D) * 8; _238 = (sizetype) _236; <bb 4> [local count: 51163802]: # iftmp.499_11 = PHI <ISRA.925_113(D)(3), 1(2)> # prephitmp_227 = PHI <_226(3), -1(2)> # prephitmp_229 = PHI <_228(3), 2(2)> # prephitmp_231 = PHI <_230(3), 3(2)> # prephitmp_235 = PHI <_234(3), 16(2)> # prephitmp_239 = PHI <_238(3), 8(2)> offset.166_13 = prephitmp_227 - ISRA.926_115(D); generalized_constitutive_tensor = {}; _17 = .FMA (ISRA.930_119(D), 2.0e+0, ISRA.929_118(D)); _157 = {ISRA.929_118(D), _17}; _177 = {_17, ISRA.929_118(D)}; Count of BB4 should be the same as the count of BB2 but it is twice as much. This is originally comming from vectorizer doing the vectorized epilogue that never iterates but giving it 50% chance of iteration. After patch this is corrected: <bb 2> [local count: 25581901]: _60 = {ISRA.929_118(D), ISRA.929_118(D)}; offset.162_6 = ~ISRA.928_112(D); if (ISRA.925_113(D) != 0) goto <bb 3>; [50.00%] else goto <bb 4>; [50.00%] <bb 3> [local count: 12790951]: _226 = -ISRA.925_113(D); _228 = ISRA.925_113(D) * 2; _230 = ISRA.925_113(D) * 3; _232 = ISRA.925_113(D) * 16; _234 = (sizetype) _232; _236 = ISRA.925_113(D) * 8; _238 = (sizetype) _236; <bb 4> [local count: 25581901]: # iftmp.499_11 = PHI <ISRA.925_113(D)(3), 1(2)> # prephitmp_227 = PHI <_226(3), -1(2)> # prephitmp_229 = PHI <_228(3), 2(2)> # prephitmp_231 = PHI <_230(3), 3(2)> # prephitmp_235 = PHI <_234(3), 16(2)> # prephitmp_239 = PHI <_238(3), 8(2)> offset.166_13 = prephitmp_227 - ISRA.926_115(D); generalized_constitutive_tensor = {}; _17 = .FMA (ISRA.930_119(D), 2.0e+0, ISRA.929_118(D)); _157 = {ISRA.929_118(D), _17}; _177 = {_17, ISRA.929_118(D)}; MEM <vector(2) real(kind=8)> [(real(kind=8) *)&generalized_constitutive_tensor] = _177; MEM <vector(2) real(kind=8)> [(real(kind=8) *)&generalized_constitutive_tensor + 48B] = _157; MEM <vector(2) real(kind=8)> [(real(kind=8) *)&generalized_constitutive_tensor + 96B] = _60; So it seems like RTL backend getting worse schedule due to different memory allocations. Memset is bit unfortunate here since it requires a lot of spiling. With -minline-all-stringops I get before patch: Performance counter stats for './b.out': 27,928.16 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 138 page-faults:u # 4.941 /sec 133,992,554,723 cycles:u # 4.798 GHz (83.33%) 17,113,198 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.33%) 10,144,634 stalled-cycles-backend:u # 0.01% backend cycles idle (83.33%) 205,237,551,965 instructions:u # 1.53 insn per cycle # 0.00 stalled cycles per insn (83.33%) 7,665,052,125 branches:u # 274.456 M/sec (83.34%) 13,596,346 branch-misses:u # 0.18% of all branches (83.34%) 27.933007797 seconds time elapsed 27.928356000 seconds user 0.000000000 seconds sys and after patch: 30791.26 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 138 page-faults:u # 4.482 /sec 148093969122 cycles:u # 4.810 GHz (83.33%) 13660157 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.33%) 411233 stalled-cycles-backend:u # 0.00% backend cycles idle (83.33%) 204951193376 instructions:u # 1.38 insn per cycle # 0.00 stalled cycles per insn (83.33%) 7664856101 branches:u # 248.930 M/sec (83.33%) 12960525 branch-misses:u # 0.17% of all branches (83.34%) 30.791579163 seconds time elapsed 30.791441000 seconds user 0.000000000 seconds sys So this may be