[Bug c/50272] New: A case that PRE optimization hurts performance
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50272 Bug #: 50272 Summary: A case that PRE optimization hurts performance Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: jiangning@arm.com For the following simple test case, PRE optimization hoists computation (s!=1) into the default branch of the switch statement, and finally causes very poor code generation. This problem occurs in both X86 and ARM, and I believe it is also a problem for other targets. int f(char *t) { int s=0; while (*t && s != 1) { switch (s) { case 0: s = 2; break; case 2: s = 1; break; default: if (*t == '-') s = 1; break; } t++; } return s; } Taking X86 as an example, with option "-O2" you may find 52 instructions generated like below, : 0:55 push %ebp 1:31 c0xor%eax,%eax 3:89 e5mov%esp,%ebp 5:57 push %edi 6:56 push %esi 7:53 push %ebx 8:8b 55 08 mov0x8(%ebp),%edx b:0f b6 0a movzbl (%edx),%ecx e:84 c9test %cl,%cl 10:74 50je 62 12:83 c2 01 add$0x1,%edx 15:85 c0test %eax,%eax 17:75 23jne3c 19:8d b4 26 00 00 00 00 lea0x0(%esi,%eiz,1),%esi 20:0f b6 0a movzbl (%edx),%ecx 23:84 c9test %cl,%cl 25:0f 95 c0 setne %al 28:89 c7mov%eax,%edi 2a:b8 02 00 00 00 mov$0x2,%eax 2f:89 fbmov%edi,%ebx 31:83 c2 01 add$0x1,%edx 34:84 dbtest %bl,%bl 36:74 2aje 62 38:85 c0test %eax,%eax 3a:74 e4je 20 3c:83 f8 02 cmp$0x2,%eax 3f:74 1fje 60 41:80 f9 2d cmp$0x2d,%cl 44:74 22je 68 46:0f b6 0a movzbl (%edx),%ecx 49:83 f8 01 cmp$0x1,%eax 4c:0f 95 c3 setne %bl 4f:89 dfmov%ebx,%edi 51:84 c9test %cl,%cl 53:0f 95 c3 setne %bl 56:89 demov%ebx,%esi 58:21 f7and%esi,%edi 5a:eb d3jmp2f 5c:8d 74 26 00 lea0x0(%esi,%eiz,1),%esi 60:b0 01mov$0x1,%al 62:5b pop%ebx 63:5e pop%esi 64:5f pop%edi 65:5d pop%ebp 66:c3 ret 67:90 nop 68:b8 01 00 00 00 mov$0x1,%eax 6d:5b pop%ebx 6e:5e pop%esi 6f:5f pop%edi 70:5d pop%ebp 71:c3 ret But with command line option "-O2 -fno-tree-pre", there are only 12 instructions generated, and the code would be very clean like below, : 0:55 push %ebp 1:31 c0xor%eax,%eax 3:89 e5mov%esp,%ebp 5:8b 55 08 mov0x8(%ebp),%edx 8:80 3a 00 cmpb $0x0,(%edx) b:74 0eje 1b d:80 7a 01 00 cmpb $0x0,0x1(%edx) 11:b0 02mov$0x2,%al 13:ba 01 00 00 00 mov$0x1,%edx 18:0f 45 c2 cmovne %edx,%eax 1b:5d pop%ebp 1c:c3 ret
[Bug c/50272] A case that PRE optimization hurts performance
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50272 --- Comment #1 from Jiangning Liu 2011-09-02 05:11:38 UTC --- Richard gave some analysis at http://gcc.gnu.org/ml/gcc/2011-08/msg00037.html
[Bug rtl-optimization/38644] [4.4/4.5/4.6/4.7 Regression] Optimization flag -O1 -fschedule-insns2 causes wrong code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644 --- Comment #56 from Jiangning Liu 2011-10-31 07:48:25 UTC --- (In reply to comment #54) > I tested with GCC 4.6.2 and the patch provided by Mikael Pettersson. It works > for -march=armv4t and -march=armv5t, but not for -march=armv5te: > Sebastian, Actually you may try this, diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index aed748c..8269c1a 100755 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -22273,6 +22273,8 @@ thumb1_expand_epilogue (void) gcc_assert (amount >= 0); if (amount) { + emit_insn (gen_blockage ()); + if (amount < 512) emit_insn (gen_addsi3 (stack_pointer_rtx, stack_pointer_rtx, GEN_INT (amount))); Thanks, -Jiangning
[Bug middle-end/39976] [4.5/4.6/4.7 Regression] Big sixtrack degradation on powerpc 32/64 after revision r146817
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39976 --- Comment #44 from Jiangning Liu 2012-02-24 08:09:25 UTC --- I'm not sure if this kind of bug has been completely fixed, and posted a qeustion in mail list at http://gcc.gnu.org/ml/gcc/2012-02/msg00415.html .
[Bug tree-optimization/52424] dom prematurely pops entries from const_and_copies stack
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52424 --- Comment #1 from Jiangning Liu 2012-02-29 03:23:46 UTC --- > I've attached a proposed fix. Jiangning, can you please apply this and see if > your performance problem is resolved? Bill, Confirmed, I think your patch works for my big case and I do see the redundant copies are removed from final binary code. Benchmark performance boosts accordingly as well, although there still might be other potential problems. Thanks a lot for your quick patch. And are you going to check-in to trunk soon for 4.7? It would be also better if you can add a test case. -Jiangning
[Bug tree-optimization/52424] dom prematurely pops entries from const_and_copies stack
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52424 --- Comment #2 from Jiangning Liu 2012-02-29 03:42:17 UTC --- > Jiangning Liu reported that the following C code has recently experienced > degraded performance on trunk. (Jiangning, please fill in the > Host/Target/Build fields for your configuration, or tell me what they are if > you don't have access. The problem is not related to specific targets in any > event.) > Seems I don't have access. Host: arm*-*-* Target: arm*-*-* Build: arm*-*-*
[Bug testsuite/52563] FAIL: gcc.dg/tree-ssa/scev-[3,4].c scan-tree-dump-times optimized "&a" 1
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52563 --- Comment #3 from Jiangning Liu 2012-03-13 08:11:40 UTC --- First, I tried gcc 4.4.3 on x86-64, and it works for this test case, so it is to some extension a GCC regression. Second, I tried trunk, and I can reproduce the failure for this test case. It means there are still some other bugs hidden after my scalar evolution improvement on address of array element. Third, loop ivopt should be able to find &a is the base object of a selected IV candidate after the analysis from simple_iv. After loop ivopt, I see the following dump, Selected IV set: candidate 5 (important) depends on 1 var_before i_13 var_after i_6 original biv type int base k_2(D) step k_2(D) candidate 6 depends on 1 var_before ivtmp.11_9 var_after ivtmp.11_1 incremented before exit test type unsigned long base (unsigned long) ((int *) &a + (sizetype) k_2(D) * 4) step (unsigned long) ((sizetype) k_2(D) * 4) base object (void *) &a So &a is already identified as an IV base object, but somehow only one &a is hoisted out of loop, while the other one isn't. : D.1728_15 = (sizetype) k_2(D); D.1729_16 = D.1728_15 * 4; D.1730_17 = (sizetype) k_2(D); D.1731_18 = D.1730_17 * 4; D.1732_19 = &a + D.1731_18; ivtmp.11_20 = (unsigned long) D.1732_19; : # i_13 = PHI # ivtmp.11_9 = PHI a_p.0_4 = (int *) ivtmp.11_9; MEM[(int *)&a][i_13] = 100; i_6 = i_13 + k_2(D); ivtmp.11_1 = ivtmp.11_9 + D.1729_16; if (i_6 <= 999) goto ; else goto ; This statement, MEM[(int *)&a][i_13] = 100; is expected to be like, D.4086_21 = (void *) ivtmp.11_9; MEM[base: D.4086_21, offset: 0B] = 100; after loop ivopt.
[Bug rtl-optimization/38644] [4.3/4.4/4.5/4.6/4.7 Regression] Optimization flag -O1 -fschedule-insns2 causes wrong code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644 Jiangning Liu changed: What|Removed |Added CC||jiangning.liu at arm dot ||com --- Comment #35 from Jiangning Liu 2011-04-26 15:13:41 UTC --- I verified that two patches in #38644 (back end) and #30282 (middle end) both work for attached cases. Here comes my two cents, 1) The concept of red zone is to control whether instructions can write memory below current stack frame or not, and it is only being supported by ABIs for some particular ISAs, so it shouldn't be enabled in middle end by default for all targets. At this point, middle end should be fixed to avoid doing things unwanted in general for all targets. 2) Red zone is an orthogonal concept to prologue/epilogue, so it is not good to fix this issue in prologue/epilogue back end code. At this point, we shouldn't simply fix it in back end by adding barrier to implicitly disable red zone. Instead, some hooks should be abstracted in middle end to support it in scheduling dependence (middle end code). Back end like X86-64 should enable it through hooks by itself. The key here is red zone should be a clean feature to be supported in middle end. Exposing this kind of stuff to back end through hooks can improve code quality for middle end and avoid bringing the bugs to back-end. This bug has long history, and it is being now or has ever been exposed on ARM, POWER and X86(with some options combination). Fixing it in middle end is not only a bug fix, but a simple infrastructure improvement. Due to the long duration and the extensive impact for different targets, I don't see good reason of not fixing it in mainline ASAP.
[Bug rtl-optimization/38644] [4.4/4.5/4.6/4.7 Regression] Optimization flag -O1 -fschedule-insns2 causes wrong code
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644 --- Comment #41 from Jiangning Liu 2011-08-09 02:04:52 UTC --- > Yes, this is from the libstdc++ sources (4.6.1 20110627, > libstdc++-v3/libsupc++/new_opnt.cc). You need a non-EABI ARM variant of GCC > since this bug manifestation will only show up in the SJLJ version. I tried and my local patch works on this case. As you can see like below, it is fixed! add r0, r0, #12 bl _Unwind_SjLj_Unregister ldr r0, [r7, #8] mov sp, r7 add sp, sp, #68 @ sp needed for prologue pop {r2, r3, r4, r5} Thanks, -Jiangning