[Bug rtl-optimization/56124] Redundant reload for loading from memory
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56124 bin.cheng changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution||FIXED --- Comment #2 from bin.cheng 2013-04-18 09:42:48 UTC --- Fixed by http://gcc.gnu.org/ml/gcc-cvs/2013-04/msg00399.html
[Bug target/54414] New: ARM:mis-compiled prologue/epilogue on cortex-m0 when optimizing with -Os
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54414 Bug #: 54414 Summary: ARM:mis-compiled prologue/epilogue on cortex-m0 when optimizing with -Os Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com For the case of pr45070.c as below: /* PR45070 */ extern void abort(void); struct packed_ushort { unsigned short ucs; } __attribute__((packed)); struct source { int pos, length; int flag; }; static void __attribute__((noinline)) fetch(struct source *p) { p->length = 128; } static struct packed_ushort __attribute__((noinline)) next(struct source *p) { struct packed_ushort rv; if (p->pos >= p->length) { if (p->flag) { p->flag = 0; fetch(p); return next(p); } p->flag = 1; rv.ucs = 0x; return rv; } rv.ucs = 0; return rv; } int main(void) { struct source s; int i; s.pos = 0; s.length = 0; s.flag = 0; for (i = 0; i < 16; i++) { struct packed_ushort rv = next(&s); if ((i == 0 && rv.ucs != 0x) || (i > 0 && rv.ucs != 0)) abort(); } return 0; } Compile with below options: $ arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os pr45070.c -o pr45070.S The generated assembly code for function next is like: next: push{r0, r1, r2, r3, r4, lr} ldrr2, [r0] ldrr3, [r0, #4] movr4, r0 cmpr2, r3 blt.L3 ldrr2, [r0, #8] cmpr2, #0 beq.L4 movr3, #0 strr3, [r0, #8] addr0, r0, #4 blfetch.isra.0 movr0, r4 blnext movr3, sp sxthr0, r0 strbr0, [r3] lsrr0, r0, #8 strbr0, [r3, #1] movr3, sp ldrhr2, [r3] b.L6 .L4: movr3, #1 strr3, [r0, #8] negr2, r3 b.L6 .L3: movr2, #0 .L6: addr3, sp, #12 strhr2, [r3] addr3, sp, #12 ldrbr0, [r3, #1] ldrbr2, [r3] lslr0, r0, #8 orrr0, r2 @ sp needed for prologue pop{r1, r2, r3, r4, pc} The pc register is restored with wong value.
[Bug target/54414] ARM:mis-compiled prologue/epilogue on cortex-m0 when optimizing with -Os
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54414 --- Comment #1 from amker.cheng 2012-08-30 10:17:15 UTC --- I suspect that the call of arm_size_return_regs in function thumb1_extra_regs_pushed should also be covered as in http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00830.html
[Bug rtl-optimization/54133] regrename introduces additional dependencies
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133 --- Comment #8 from amker.cheng 2012-09-25 07:45:02 UTC --- I have spent some time investigating this bug and now I think I understand the issue. The problematic instruction patterns which save/restore argument/return registers is generated/kept on Thumb1 because ARM back end defines target hook TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P. The intention is to keep live range of hardware registers short, so I think it is inappropriate to do the propagation before IRA. I can only think about fixing this in following ways: 1. run an additional cprop_hardreg before register renaming. Of course this seems not decent. 2. post reload pass supports simple CSE by using cselib, we can do the transformation in postreload. Currently CSELIB can't detect such cases. Root cause is: 1. argument registers usually have no initialization; return register usually initialized by call_expr. 2. CSELIB uses the first element of the elt_list defines the mode in which the register was set; if the mode is unknown or the value is no longer valid in that mode, ELT will be NULL for the first element. 3. CSELIB creates first NULL elt_list for argument registers in function "cselib_lookup_1", because such registers has no initialization. 4. CSELIB ignores return registers initialized by call_expr, as in function "cselib_hash_rtx". Then create first NULL elt_list for return registers. 5. In function "cselib_reg_set_mode", CSELIB checks whether the first element of elt_list is NULL, this results in argument/return register won't be CSEd. But I am not sure whether CSELIB can be improved to address such issue.
[Bug target/54989] FAIL: gcc.dg/hoist-register-pressure.c scan-rtl-dump hoist "PRE/HOIST: end of bb .* copying expression" on darwin
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54989 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot ||com --- Comment #1 from bin.cheng 2012-10-20 05:40:08 UTC --- The failure is caused by higher register pressure in the THEN branch of the case, though I am not sure why the register pressure is higher than x86-linux. This can be fixed by simplifying test case as below: /* { dg-options "-Os -fdump-rtl-hoist" } */ /* { dg-final { scan-rtl-dump "PRE/HOIST: end of bb .* copying expression" "hoist" } } */ #define BUF 100 int a[BUF]; void com (int); void bar (int); int foo (int x, int y, int z) { /* "x+y" won't be hoisted if "-fira-hoist-pressure" is disabled, because its rtx_cost is too small. */ if (z) { a[1] = a[0]; a[2] = a[1]; a[3] = a[2]; a[4] = a[3]; a[5] = a[4]; a[6] = a[5]; a[7] = a[6]; com (x+y); } else { bar (x+y); } return 0; } I will send a patch fixing this.
[Bug other/55031] New: Documentation on RTL GCSE pass is outdated
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55031 Bug #: 55031 Summary: Documentation on RTL GCSE pass is outdated Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com Quoting from GCCINT, section "9.5 RTL passes": "When optimizing for size, GCSE is done using Morel-Renvoise Partial Redundancy Elimination, with the exception that it does not try to move invariants out of loops—that is left to the loop optimization pass. If MR PRE GCSE is done, code hoisting (aka unification) is also done, as well as load motion." While the pass gate function is as below: static bool gate_rtl_pre (void) { return optimize > 0 && flag_gcse && !cfun->calls_setjmp && optimize_function_for_speed_p (cfun) && dbg_cnt (pre); } This conflicts with the documentation, which says Morel-Renvoise PRE will be used when optimizing for size. I think the document is outdated.
[Bug target/54989] FAIL: gcc.dg/hoist-register-pressure.c scan-rtl-dump hoist "PRE/HOIST: end of bb .* copying expression" on darwin
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54989 --- Comment #7 from bin.cheng 2012-10-31 08:45:37 UTC --- I think this is fixed and it's a bug in 4.8.0. Hi Jack, could you verify that it is fixed? Thanks very much.
[Bug rtl-optimization/57540] New: stack pointer related loop invariants after reload
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540 Bug ID: 57540 Summary: stack pointer related loop invariants after reload Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amker.cheng at gmail dot com For below program, void foo ( unsigned char *len, int alphaSize, int maxLen ) { int i, j, k; unsigned char tooLong; int parent [ 258 * 2 ]; parent[0] = -2; tooLong = 0; for (i = 1; i <= alphaSize; i++) { j = 0; k = i; while (parent[k] >= 0) { k = parent[k]; j++; } len[i-1] = j; if (j > maxLen) tooLong = 1; } } Compile with command line, arm-linux-gnueabihf-gcc -S -O2 -marm -mcpu=cortex-a15 -o foo.S -xc foo.E The generated code is like, .cpu cortex-a15 .eabi_attribute 27, 3 .eabi_attribute 28, 1 .fpu vfpv3-d16 .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 2 .eabi_attribute 34, 1 .eabi_attribute 18, 4 .file"foo.E" .text .align2 .globalfoo .typefoo, %function foo: @ args = 0, pretend = 0, frame = 2064 @ frame_needed = 0, uses_anonymous_args = 0 strlr, [sp, #-4]! subsp, sp, #2064 mvnr3, #1 subsp, sp, #4 cmpr1, #0 strr3, [sp] ble.L1 movip, sp addr1, r0, r1 .L6: ldrr3, [ip, #4]! movr2, #0 cmpr3, #0 blt.L3 .L5: addlr, sp, #2064loop invariant addr2, r2, #1 addr3, lr, r3, asl #2 ldrr3, [r3, #-2064] cmpr3, #0 bge.L5 uxtbr2, r2 .L3: strbr2, [r0], #1 cmpr0, r1 bne.L6 .L1: addsp, sp, #2064 addsp, sp, #4 @ sp needed ldrpc, [sp], #4 .sizefoo, .-foo .ident"GCC: (GNU) 4.9.0 20130524 (experimental)" .section.note.GNU-stack,"",%progbits Apparently, first instruction in basic block .L5 is invariant, but kept in loop because it is generated by reload. I think this is a common issue.
[Bug rtl-optimization/57540] stack pointer related loop invariants after reload
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540 --- Comment #1 from bin.cheng --- The dump of loop_init is like, 72: r178:SI=0 106: L106: 90: NOTE_INSN_BASIC_BLOCK 6 91: r178:SI=r178:SI+0x1 94: r190:SI=r177:SI<<0x2 REG_DEAD r177:SI 95: r191:SI=sfp:SI+r190:SI REG_DEAD r190:SI 96: r192:SI=r191:SI-0x810 REG_DEAD r191:SI REG_DEAD r189:SI 97: r177:SI=[r192:SI] REG_DEAD r192:SI 98: cc:CC=cmp(r177:SI,0) 99: pc={(cc:CC>=0)?L104:pc} REG_DEAD cc:CC REG_BR_PROB 0x238c Instructions 95/96 should be re-factored as below: 95: r191:SI=sfp:SI-0x810 REG_DEAD r190:SI 96: r192:SI=r191:SI+r190:SI REG_DEAD r191:SI REG_DEAD r189:SI Thus instruction 95 is loop invariant and be hoisted. For arm target, the loop can be simplified into: blt.L3 .L5: addr2, r2, #1 ldrr3, [sp, r3, asl #2] cmpr3, #0 bge.L5 uxtbr2, r2
[Bug target/57540] stack pointer related loop invariants after reload
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540 bin.cheng changed: What|Removed |Added Component|rtl-optimization|target --- Comment #2 from bin.cheng --- This only happens on arm mode. For below gimple, k_8 = parent[k_29]; On ARM mode GCC expands it into, 81: r180:SI=0xf7f0 82: zero_extract(r180:SI,0x10,0x10)=0x 83: r181:SI=r165:SI<<0x2 84: r182:SI=r105:SI+r181:SI 85: r183:SI=r182:SI+r180:SI 86: r165:SI=[r183:SI] while on Thumb2 GCC expands it into, 88: r185:SI=r105:SI 89: r186:SI=r105:SI-0x810 90: r171:SI=[r171:SI*0x4+r186:SI] thus resulting in much better assembly code, .L5: ldrr3, [sp, r3, lsl #2] addsr2, r2, #1 cmpr3, #0 bge.L5 uxtbr2, r2
[Bug target/57540] stack pointer related loop invariants after reload for ARM mode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540 --- Comment #3 from bin.cheng --- I think this should be handled in expand. During expanding, GCC tries "base + scaled_offset + offset" pattern, which is invalid for targets like arm. At this point we still have a chance to refactor "base + offset" and force it into register, thus generating "reg + scaled_offset". By doing this, 1) "base + offset" can be kept as loop invariant; 2) the multiplication is done by scaled address, saving another add instruction. I am testing a patch and will send it for review once it passes tests.
[Bug target/56102] Wrong rtx cost calculated for Thumb1
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56102 bin.cheng changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from bin.cheng --- Yes, it's fixed by that checkin.
[Bug target/57540] stack pointer related loop invariants after reload for ARM mode
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57540 bin.cheng changed: What|Removed |Added Component|middle-end |target --- Comment #4 from bin.cheng --- Sorry, according to http://gcc.gnu.org/ml/gcc-patches/2013-06/msg00932.html, This seems should be fixed in backend. I will fixed this in arm_legitimize_address, so I change this entry to TARGET.
[Bug target/58423] New: [ARM]ICE with shrink-wrap-sibcall.c on a15/neon/hard
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58423 Bug ID: 58423 Summary: [ARM]ICE with shrink-wrap-sibcall.c on a15/neon/hard Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: amker.cheng at gmail dot com GCC ICEed with shrink-wrap-sibcall.c on a15 with below command line: ./arm-none-eabi-gcc -O2 -marm -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=hard shrink-wrap-sibcall.c -S -o shrink-wrap-sibcall.S -fno-diagnostics-show-caret -fdiagnostics-color=never -O2 -g ICE msg is: shrink-wrap-sibcall.c: In function 'baz': shrink-wrap-sibcall.c:26:1: internal compiler error: in maybe_record_trace_start, at dwarf2cfi.c:2218 0x82bfe41 maybe_record_trace_start ../../gcc/gcc/dwarf2cfi.c:2218 0x82c22f2 scan_trace ../../gcc/gcc/dwarf2cfi.c:2395 0x82c2a25 create_cfi_notes ../../gcc/gcc/dwarf2cfi.c:2549 0x82c2a25 execute_dwarf2_frame ../../gcc/gcc/dwarf2cfi.c:2904 0x82c2a25 execute ../../gcc/gcc/dwarf2cfi.c:3400 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <http://gcc.gnu.org/bugs.html> for instructions. GCC is at revision r202599 and the ICE relates to a15/neon/hard-abi, no matter how it is configured for arm.
[Bug target/58424] New: [ARM]gcc.target/arm/pr42575.c failed on arm
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58424 Bug ID: 58424 Summary: [ARM]gcc.target/arm/pr42575.c failed on arm Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: amker.cheng at gmail dot com gcc is at revision r202599 and is configured as: ../gcc/configure build=i686-linux-gnu host=i686-linux-gnu target=arm-none-eabi prefix=.../trunk-orig/target/ disable-decimal-float disable-libffi disable-libgomp disable-libmudflap disable-libquadmath disable-libssp disable-libstdcxx-pch disable-nls disable-shared disable-threads disable-tls with-gnu-as with-gnu-ld with-newlib with-headers=yes with-sysroot=.../trunk-orig/target/arm-none-eabi with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm' with-mode=thumb with-arch=armv7-m disable-multilib enable-lto enable-languages=c,c++,lto The source code is: /* { dg-options "-O2" } */ /* Make sure RA does good job allocating registers and avoids unnecessary moves. */ /* { dg-final { scan-assembler-not "mov" } } */ long long longfunc(long long x, long long y) { return x * y; } The generated assembly is: longfunc: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. mulr3, r0, r3 push{r4, r5} mlar1, r2, r1, r3 umullr4, r5, r0, r2 addr5, r5, r1 movr0, r4 movr1, r5 pop{r4, r5} bxlr .sizelongfunc, .-longfunc But I think the case would fail for other configurations too.
[Bug rtl-optimization/50663] New: conditional propagation missed in cprop.c pass
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50663 Bug #: 50663 Summary: conditional propagation missed in cprop.c pass Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com For following test case: extern int g; int main(int a, int b) { if (a == 1) { b = a; } g = b; return 0; } piece of dump file for cprop1 pass is like: (insn 8 4 9 2 (set (reg:CC 24 cc) (compare:CC (reg/v:SI 135 [ a ]) (const_int 1 [0x1]))) test.c:4 200 {*arm_cmpsi_insn} (nil)) (jump_insn 9 8 10 2 (set (pc) (if_then_else (ne (reg:CC 24 cc) (const_int 0 [0])) (label_ref 11) (pc))) test.c:4 212 {*arm_cond_branch} (expr_list:REG_DEAD (reg:CC 24 cc) (expr_list:REG_BR_PROB (const_int 6218 [0x184a]) (nil))) -> 11) (note 10 9 5 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (insn 5 10 11 3 (set (reg/v:SI 136 [ b ]) (reg/v:SI 135 [ a ])) test.c:4 696 {*thumb2_movsi_insn} (expr_list:REG_DEAD (reg/v:SI 135 [ a ]) (expr_list:REG_EQUAL (const_int 1 [0x1]) (nil The r135 in insn_5 should handled by conditional propagation, like: (note 10 9 5 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (insn 5 10 11 3 (set (reg/v:SI 136 [ b ]) (const_int 1 [0x1])) test.c:4 709 {*thumb2_movsi_insn} (expr_list:REG_DEAD (reg/v:SI 135 [ a ]) (expr_list:REG_EQUAL (const_int 1 [0x1]) (nil Seems cprop misses the conditional propagation for the branch basic block. FYI, I compiled the test case with command: ./arm-none-eabi-gcc -march=armv7-m -mthumb -O2 -S test.c -o test.S -da The gcc is comfigured for arm-none-eabi and it's on trunk.
[Bug rtl-optimization/50663] conditional propagation missed in cprop.c pass
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50663 --- Comment #1 from amker.cheng 2011-10-08 10:25:04 UTC --- Here comes the cause: Though the cprop.c pass collected the implicit_set information, it is recorded as local info of basic block, and cprop only does global propagation. The result is such conditional const propagation opportunities is missed. The whole process in cprop pass is like: bb0 : if (x) then bb1 else bb2 end 1, implicit_set from the preceding bb0 is tagged as local in bb1; 2, in compute_local_properties, the implicit_set is recorded in avloc[bb1]; 3, in compute_cprop_available, the implicit_set is only recorded in avout[bb1], not in avin[bb1], which it should be; 4, in cprop_insn and find_avail_set, only info recorded in avin[bb1] is considered when try to do propagation for bb1; Well, I believe it is a small problem, since implicit_set is recorded in avout[bb1], The basic block bb1 is the only one get missed in propagation. I'm working on a patch and will send it for reviewing later.
[Bug rtl-optimization/44025] Multiple load 0 to register
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44025 --- Comment #4 from amker.cheng 2011-11-02 06:03:56 UTC --- I noticed that for attached reduced test case "reduced_test.c", cse pass can eliminate such redundant load constant instructions. But since cse works on extended basic block, rather than globally, it can do nothing for the original case. The questions are: 1, why pre does not do such optimization; 2, if pre does do the work, surely the live range of r0 is extended, which might harm the register allocation; Also I found the regcprop.c, which is a peephole pass eliminates redundant register moves. It should be able to work for redundant constant load insns if : a) extend it in a value numbering way, at least for these constant values; b) extend it in a global data analysis way; Such change might also impact the scheduling pass and I am not sure how is the benefit for common codes.
[Bug rtl-optimization/44025] Multiple load 0 to register
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44025 --- Comment #5 from amker.cheng 2011-11-02 06:05:23 UTC --- Created attachment 25687 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25687 reduced test case which can be handled by cse pass
[Bug rtl-optimization/52804] IRA/RELOAD allocate wrong register on ARM for cortex-m0
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52804 --- Comment #6 from amker.cheng 2012-05-15 02:15:59 UTC --- No regression reported in trunk so far, I back ported it into 4.7 branch.
[Bug middle-end/51867] GCC generates inconsistent code for same sources calling builtin calls, like sqrtf
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51867 --- Comment #5 from amker.cheng 2012-06-18 02:03:21 UTC --- Should be fixed.
[Bug middle-end/53922] New: VRP: semantic conflict between range_includes_zero_p and value_inside_range
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53922 Bug #: 53922 Summary: VRP: semantic conflict between range_includes_zero_p and value_inside_range Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com In tree-vrp.c function value_inside_range returns: 1 if VAL is inside value range VR (VR->MIN <= VAL <= VR->MAX), 0 if VAL is not inside VR, -2 if we cannot tell either way. While in function range_includes_zero_p, it: return (value_inside_range (zero, vr) == 1); which is bogus. Because when value_inside_range returns -2, there is the possibility that value range includes zero. For example: int x(int a) { return a; } int y(int a) __attribute__ ((weak)); int (*scan_func)(int); extern int g; int g = 0; int main() { if (g) scan_func = x; else scan_func = y; if (scan_func) g = scan_func(10); return 0; } compiled with command line: arm-none-eabi-gcc -mthumb -mcpu=cortex-m3 -Os -S test.c -o test.S -fdump-tree-all The dump of vrp2 pass is: main () { int (*) (int) cstore.6; int g.2; int g.0; : g.0_1 = g; if (g.0_1 != 0) goto ; else goto ; : : # cstore.6_9 = PHI scan_func = cstore.6_9; g.2_4 = cstore.6_9 (10); g = g.2_4; return 0; } Though the problem shows up with this case in gcc4.6 branch and -Os option on arm, I think it exists in 4.7/4.8 too, just concealed by different gimple statements. I will work out a patch for this.
[Bug middle-end/53922] VRP: semantic conflict between range_includes_zero_p and value_inside_range
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53922 --- Comment #2 from amker.cheng 2012-07-11 08:03:11 UTC --- Yes, the dump before pass vrp2 is like: main () { int (*) (int) cstore.6; int g.2; int g.0; : g.0_1 = g; if (g.0_1 != 0) goto ; else goto ; : : # cstore.6_9 = PHI scan_func = cstore.6_9; if (cstore.6_9 != 0B) goto ; else goto ; : g.2_4 = cstore.6_9 (10); g = g.2_4; : return 0; } gcc parses "# cstore.6_9 = PHI " and asserts that cstore.6_9 non-zero, then folds predicate cstore.6_9 != 0B to 1, which is wrong, because weak symbol y could be zero.
[Bug middle-end/53922] VRP: semantic conflict between range_includes_zero_p and value_inside_range
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53922 --- Comment #3 from amker.cheng 2012-07-11 10:12:24 UTC --- vrp processes PHI node " # cstore.6_9 = PHI " in calling sequence: vrp_visit_phi_node -> vrp_meet When gcc gives up in function vrp_meet, it executes following code to derive an anti-range against zero: give_up: /* Failed to find an efficient meet. Before giving up and setting the result to VARYING, see if we can at least derive a useful anti-range. FIXME, all this nonsense about distinguishing anti-ranges from ranges is necessary because of the odd semantics of range_includes_zero_p and friends. */ if (!symbolic_range_p (vr0) && ((vr0->type == VR_RANGE && !range_includes_zero_p (vr0)) || (vr0->type == VR_ANTI_RANGE && range_includes_zero_p (vr0))) && !symbolic_range_p (vr1) && ((vr1->type == VR_RANGE && !range_includes_zero_p (vr1)) || (vr1->type == VR_ANTI_RANGE && range_includes_zero_p (vr1 { set_value_range_to_nonnull (vr0, TREE_TYPE (vr0->min)); /* Since this meet operation did not result from the meeting of two equivalent names, VR0 cannot have any equivalences. */ if (vr0->equiv) bitmap_clear (vr0->equiv); } Here vr0 is for "x" in source code, while vr1 for "y" in source code, which is a weak symbol. function range_includes_zero_p check whether vr1 includes zero by calling value_inside_range. The value_inside_range works well by returning -2, because of the WEAK symbol. After that, range_includes_zero_p checks whether return value of value_inside_range equals 1. Finally in vrp_meet, condition "((vr1->type == VR_RANGE && !range_includes_zero_p (vr1))" holds, resulting in gcc asserting cstore.6_9 non-zero. Am I missing something?
[Bug rtl-optimization/54133] New: regrename introduces additional dependencies
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133 Bug #: 54133 Summary: regrename introduces additional dependencies Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com With test program below: typedef struct { double X, Y; } Point; typedef struct { Point p1; Point c1; Point c2; Point p2; } Curve; double bar(double t, double p0, double p1, double p2, double p3); void foo( Curve *curve, int count ) { int n; int step; Point point; Curve c0; double t; for ( n = 0; n < count; ++n ) { c0 = curve[n]; for ( step = 0; step < (10); ++step ) { t = ((double)(step)) / (double)(10); point.X = bar( t, c0.p1.X, c0.c1.X, c0.c2.X, c0.p2.X ); point.Y = bar( t, c0.p1.Y, c0.c1.Y, c0.c2.Y, c0.p2.Y ); } } } Compiled with command line: arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os -frename-registers -S The dump before and after regrenaming are like: 1. before regrename: (insn 157 80 158 4 (set (reg:SI 4 r4 [180]) (reg:SI 0 r0)) ../office_pointio.E:29 187 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 0 r0) (nil))) (insn 158 157 147 4 (set (reg:SI 5 r5 [+4 ]) (reg:SI 1 r1 [+4 ])) ../office_pointio.E:29 187 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 1 r1 [+4 ]) (nil))) (insn 147 158 83 4 (set (reg:DF 2 r2) (mem/c:DF (plus:SI (reg/f:SI 13 sp) (const_int 40 [0x28])) [6 %sfp+-56 S8 A64])) ../office_pointio.E:30 205 {*thumb_movdf_insn} (nil)) (insn 83 147 148 4 (set (mem:DF (reg/f:SI 13 sp) [0 S8 A64]) (reg:DF 2 r2)) ../office_pointio.E:30 205 {*thumb_movdf_insn} (expr_list:REG_DEAD (reg:DF 2 r2) (nil))) (insn 148 83 84 4 (set (reg:DF 2 r2) (mem/c:DF (plus:SI (reg/f:SI 13 sp) (const_int 56 [0x38])) [6 %sfp+-40 S8 A64])) ../office_pointio.E:30 205 {*thumb_movdf_insn} (nil)) (insn 84 148 149 4 (set (mem:DF (plus:SI (reg/f:SI 13 sp) (const_int 8 [0x8])) [0 S8 A64]) (reg:DF 2 r2)) ../office_pointio.E:30 205 {*thumb_movdf_insn} (expr_list:REG_DEAD (reg:DF 2 r2) (nil))) (insn 149 84 85 4 (set (reg:DF 2 r2) (mem/c:DF (plus:SI (reg/f:SI 13 sp) (const_int 72 [0x48])) [6 %sfp+-24 S8 A64])) ../office_pointio.E:30 205 {*thumb_movdf_insn} (nil)) (insn 85 149 159 4 (set (mem:DF (plus:SI (reg/f:SI 13 sp) (const_int 16 [0x10])) [0 S8 A64]) (reg:DF 2 r2)) ../office_pointio.E:30 205 {*thumb_movdf_insn} (expr_list:REG_DEAD (reg:DF 2 r2) (nil))) (insn 159 85 160 4 (set (reg:SI 0 r0) (reg:SI 4 r4 [180])) ../office_pointio.E:30 187 {*thumb1_movsi_insn} (nil)) (insn 160 159 87 4 (set (reg:SI 1 r1 [+4 ]) (reg:SI 5 r5 [+4 ])) ../office_pointio.E:30 187 {*thumb1_movsi_insn} (nil)) 2. after regrename: (insn 157 80 158 4 (set (reg:SI 4 r4 [180]) (reg:SI 0 r0)) ../office_pointio.E:29 187 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 0 r0) (nil))) (insn 158 157 147 4 (set (reg:SI 5 r5 [+4 ]) (reg:SI 1 r1 [+4 ])) ../office_pointio.E:29 187 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 1 r1 [+4 ]) (nil))) (insn 147 158 83 4 (set (reg:DF 0 r0) (mem/c:DF (plus:SI (reg/f:SI 13 sp) (const_int 40 [0x28])) [6 %sfp+-56 S8 A64])) ../office_pointio.E:30 205 {*thumb_movdf_insn} (nil)) (insn 83 147 148 4 (set (mem:DF (reg/f:SI 13 sp) [0 S8 A64]) (reg:DF 0 r0)) ../office_pointio.E:30 205 {*thumb_movdf_insn} (expr_list:REG_DEAD (reg:DF 2 r2) (nil))) (insn 148 83 84 4 (set (reg:DF 2 r2) (mem/c:DF (plus:SI (reg/f:SI 13 sp) (const_int 56 [0x38])) [6 %sfp+-40 S8 A64])) ../office_pointio.E:30 205 {*thumb_movdf_insn} (nil)) (insn 84 148 149 4 (set (mem:DF (plus:SI (reg/f:SI 13 sp) (const_int 8 [0x8])) [0 S8 A64]) (reg:DF 2 r2)) ../office_pointio.E:30 205 {*thumb_movdf_insn} (expr_list:REG_DEAD (reg:DF 2 r2) (nil))) (insn 149 84 85 4 (set (reg:DF 1 r1) (mem/c:DF (plus:SI (reg/f:SI 13 sp) (const_int 72 [0x48])) [6 %sfp+-24 S8 A64])) ../office_pointio.E:30 205 {*thumb_movdf_insn} (nil)) (insn 85 149 159 4 (set (mem:DF (plus:SI (reg/f:SI 13 sp) (const_int 16 [0x10])) [0 S8 A64]) (reg:DF 1 r1)) ../office_pointio.E:30 205 {*thumb_movdf_insn} (expr_list:REG_DEAD (reg:DF 2 r2) (nil))) (insn 159 85 160 4 (set (reg:SI 0 r0) (reg:SI 4 r4 [180])) ../office_pointio.E:30 187 {*thumb1_movsi_insn} (nil)) (insn 160 159 87 4 (set (reg:SI 1 r1 [+4 ]) (reg:SI 5 r5 [+4 ])) ../office_pointio.E:30 187 {*
[Bug target/52412] another unnecessary register move on arm
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52412 amker.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot ||com --- Comment #2 from amker.cheng 2012-07-31 14:12:54 UTC --- The register move insn is generated by cse2 pass, and after that, there is no cprop pass till ira. The two allocnos for r6/r3(the original pseudos) are conflict with each other, though they contains same value and connected by move insn, IRA cannot allocate same hard register for them. Moveover, the case is compile with Os, where gcc does IRA in whole single region, and live range cannot be split either.
[Bug rtl-optimization/54133] regrename introduces additional dependencies
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133 --- Comment #2 from amker.cheng 2012-08-01 07:49:51 UTC --- I measured this kind of regression in benchmark CSiBE on arm-none-eabi/cortex-m0 with Os optimization. Turns out most of the them are relate to paramter/return register moving, like the reported case. The logic is: STEP1: At prologue or after call_insn, gcc saves parameter(or return) registers in pseudos, then load it from the pseudo when need to use it(like calling another function with the paramter). For example: { rx <- r0 ... ... r0 <- rx call another function } If instructions between saving and using do not clobber paramter register, the hard register can be propagated to remove one redundant move instruction. STEP2: copy propagation before IRA just ignore hard registers, so usually these can only be done in regcprop.c after IRA. BUT, STEP3: register renaming does not honor any propagation opportunities and may using r0 to rename, which introduces additional dependencies. It's a common regression because regrename always select renaming register from 0 to FIRST_PSEUOD_REG. In experiment, if I disable r0/r1 from renaming, most regressions observed in CSiBE are gone. So how should this be fixed? Thanks.
[Bug rtl-optimization/54133] regrename introduces additional dependencies
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133 --- Comment #5 from amker.cheng 2012-08-01 13:48:50 UTC --- Thanks for your patch, IMHO, I don't think the problem could be fixed in this way, because: 1. 78 r177:DF=r0:DF 80 [sp:SI]=r166:DF 81 [sp:SI+0x8]=r168:DF 82 [sp:SI+0x10]=r170:DF 84 r2:DF=r164:DF 85 r0:DF=call [`bar'] argc:0x18 REG_DEAD: r2:DF REG_UNUSED: r0:DF 86 [sp:SI]=r167:DF 87 [sp:SI+0x8]=r169:DF 88 [sp:SI+0x10]=r171:DF 89 r0:DF=r177:DF REG_DEAD: r177:DF 90 r2:DF=r165:DF 91 r0:DF=call [`bar'] argc:0x18 The propagation actually increases register pressure from insn 78 to insn 85, since r177 and r0 are both alive now. Maybe IRA makes a better decision in this case by spilling r177, I double the common results. 2.The reported case is some kind of special with all related insns limited in one basic block. In other cases like described in comment 2, the saving of hard register is in prologue, so the propagation crosses basic blocks. Anyway, one thing is clear that the problem is closely connected with parameter/return register moving.
[Bug rtl-optimization/54133] regrename introduces additional dependencies
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54133 --- Comment #7 from amker.cheng 2012-08-02 10:18:41 UTC --- (In reply to comment #6) > > In experiment, if I disable r0/r1 from renaming, most regressions observed > > in > > CSiBE are gone. > > > > So how should this be fixed? Thanks. > > The choice of the renaming register can be parameterized at the class level, > but I'm not sure this would work here. You could also try to add some > additional heuristics for this choice, as it seems to be clearly > counter-productive here. My bad that I did not mention details of the method by disabling r0/r1 from renaming. When comparing to trunk(where regrename is disabled for Os), the method fixes most of regrenaming regressions, which is good. But it is too conservertive that some renaming opportunities are missed. From the view of code size: data show that this method has 700/440 bytes benefit/regression against the current implemention of regrename. This means only 250 bytes benefit overall. The data is collected from CSiBE on arm cortex-m0. Giving that the regressions may cross basic_block, it's hard to fix them in regrenaming without missing renaming opportunities. Is it possible to run regcprop pass both before and after regrenaming?
[Bug target/51835] ARM EABI violation when passing arguments to helper floating functions like __aeabi_d2iz
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51835 --- Comment #6 from amker.cheng 2012-02-06 05:51:25 UTC --- (In reply to comment #5) > (In reply to comment #2) > > This is only applicable to the 4.6 branch and trunk since support for the > > Cortex M4 wasn't added till 4.6. > > > > cheers > > Ramana > > Maybe the Cortex M4 wasn't added until 4.6, but the other options are > permitted > by 4.5 and I can easily get 4.5 to produce wrong-looking code. With -O2 > -mfloat-abi=hard -mfpu=fpv4-sp-d16 -march=armv7-a -marm I see the following > code generation difference between 4.5 and 4.6: > > @@ -22,8 +22,9 @@ > @ frame_needed = 0, uses_anonymous_args = 0 > stmfd sp!, {r3, lr} > bl __aeabi_f2d > + fmrrd r0, r1, d0 > bl __aeabi_d2iz > ldmfd sp!, {r3, pc} > .size func, .-func > - .ident "GCC: (GNU) 4.5.4 20120126 (prerelease)" > + .ident "GCC: (GNU) 4.6.3 20120203 (prerelease)" > .section.note.GNU-stack,"",%progbits > > Backporting r183734 from 4.6 to 4.5 makes 4.5 generate the same code as 4.6, > i.e., with the fmrrd between the two calls. beside this patch, Julian Brown's patch r174803 is necessary too. For now, 1, arguments for both __aeabi_f2d and __aeabi_d2iz are wrong in 4.5; 2, arguments for __aeabi_f2d is wrong in 4.6 To solve this, have to: 1, backport r183734 and r174803 to 4.5; 2, backport r174803 to 4.6;
[Bug tree-optimization/43491] Unnecessary temporary for global register variable
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43491 --- Comment #8 from amker.cheng 2012-02-17 03:55:24 UTC --- (In reply to comment #7) > With tree hoisting we generate > > : > pretmp.5_19 = data_0; > pretmp.5_20 = data_3; > i_21 = pretmp.5_19 + pretmp.5_20; > if (data_3(D) != 0) > goto ; > else > goto ; > > : > > : > # v_1 = PHI > # i_2 = PHI > D.1719_14 = v_1 * i_21; > D.1718_15 = i_2 * D.1719_14; > return D.1718_15; > > instead of > > : > if (data_3(D) != 0) > goto ; > else > goto ; > > : > pretmp.5_19 = data_0; > pretmp.5_21 = data_3; > i_23 = pretmp.5_19 + pretmp.5_21; > goto ; > > : > data_0.0_4 = data_0; > data_3.1_5 = data_3; > i_6 = data_0.0_4 + data_3.1_5; > > : > # v_1 = PHI > # i_2 = PHI > # i_24 = PHI > D.1719_14 = v_1 * i_24; > D.1718_15 = i_2 * D.1719_14; > return D.1718_15; > > } > > I suppose that's good enough? See that PRE still inserts loads from > register variables, not sure if you'd want to disallow that as well. I think the reason why gcc inserts loads from global register variable is gcc treats loads/uses of such variable as memory references. If I am right, It seems a ssa issue, rather than PRE. As for the original bug, it is caused by loading const global register variable, then using the loaded ssa var across function calls(this step by pre), which introduces unnecessary register conflict. I guess the load itself won't hurt, but not sure whether hoisting will(as pre had done before). BTW, I did not get the hoisted code on trunk. Is it a patch your are working on? Thanks.
[Bug middle-end/37780] Conditional expression with __builtin_clz() should be optimized out
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37780 --- Comment #2 from amker.cheng 2012-03-20 07:58:09 UTC --- the special case could be easily detected when gimplifying. but actually I am not sure whether it can be done even in middle end, since the middle end should not depend on any target information, like CLZ_DEFINED_VALUE_AT_ZERO, right?
[Bug target/52804] New: IRA/RELOAD allocate wrong register on ARM for cortex-m0
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52804 Bug #: 52804 Summary: IRA/RELOAD allocate wrong register on ARM for cortex-m0 Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com For following code code: void foo(unsigned char ** i, char *** o, unsigned int row, int num); extern signed long tab[]; extern unsigned int w; void foo(unsigned char ** i, char *** o, unsigned int row, int num) { register int r, g, b; register signed long * t = tab; register char * pi; register char * o0; register char * o1; register unsigned int c; unsigned int n = w; while (--num >= 0) { pi = *i++; o0 = o[0][row]; o1 = o[1][row]; row++; for (c = 0; c < n; c++) { r = ((int) (pi[0])); g = ((int) (pi[1])); b = ((int) (pi[2])); pi += 3; o0[c] = (unsigned char) ((t[r] + t[g] + t[b])); o1[c] = (unsigned char) ((t[r] + t[g] + t[b])); } } } Compile it with following command: $ arm-none-eabi-gcc -S -mthumb -mcpu=cortex-m0 -O2 -o foo.S foo.c comparing ira/reload dump as following: /* dump of ira: (insn 82 81 83 3 (set (reg/f:SI 281 [ *o_15(D) ]) (mem/f:SI (reg/v/f:SI 315 [orig:275 o ] [275]) [2 *o_15(D)+0 S4 A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn} (expr_list:REG_EQUIV (mem/f:SI (reg/v/f:SI 315 [orig:275 o ] [275]) [2 *o_15(D)+0 S4 A32]) (nil))) (insn 83 82 84 3 (set (reg/v/f:SI 198 [ o0 ]) (mem/f:SI (plus:SI (reg/f:SI 281 [ *o_15(D) ]) (reg:SI 273 [ D.4183 ])) [2 *D.4088_18+0 S4 A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/f:SI 281 [ *o_15(D) ]) (nil))) (insn 84 83 85 3 (set (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ]) (mem/f:SI (plus:SI (reg/v/f:SI 315 [orig:275 o ] [275]) (const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4 A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:19 186 {*thumb1_movsi_insn} (expr_list:REG_EQUIV (mem/f:SI (plus:SI (reg/v/f:SI 315 [orig:275 o ] [275]) (const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4 A32]) (nil))) (insn 85 84 171 3 (set (reg/v/f:SI 201 [ o1 ]) (mem/f:SI (plus:SI (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ]) (reg:SI 273 [ D.4183 ])) [2 *D.4091_23+0 S4 A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:19 186 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ]) (expr_list:REG_DEAD (reg:SI 273 [ D.4183 ]) (nil dump of reload: (note 82 81 207 3 NOTE_INSN_DELETED) (insn 207 82 208 3 (set (reg:SI 6 r6) (reg/v/f:SI 9 r9 [orig:275 o ] [275])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn} (nil)) (insn 208 207 209 3 (set (reg:SI 6 r6) (mem/f:SI (reg:SI 6 r6) [2 *o_15(D)+0 S4 A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn} (nil)) (insn 209 208 210 3 (set (reg:SI 7 r7) (mem/f:SI (plus:SI (reg:SI 6 r6) (reg:SI 3 r3 [orig:273 D.4183 ] [273])) [2 *D.4088_18+0 S4 A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn} (nil)) (insn 210 209 84 3 (set (reg/v/f:SI 12 ip [orig:198 o0 ] [198]) (reg:SI 7 r7)) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:18 186 {*thumb1_movsi_insn} (nil)) (note 84 210 211 3 NOTE_INSN_DELETED) (insn 211 84 85 3 (set (reg:SI 0 r0) (mem/f:SI (plus:SI (reg:SI 6 r6) (const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4 A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:19 186 {*thumb1_movsi_insn} (nil)) (insn 85 211 171 3 (set (reg/v/f:SI 7 r7 [orig:201 o1 ] [201]) (mem/f:SI (plus:SI (reg:SI 0 r0) (reg:SI 3 r3 [orig:273 D.4183 ] [273])) [2 *D.4091_23+0 S4 A32])) ./gccmpsm0/obj_lite/cjpeg/jccolor-case.E:19 186 {*thumb1_movsi_insn} (nil)) */ Obviously, r6 is corrupted in insn 208, while it is used in insn 211. piece of generated assembly codes as following: foo: push{r4, r5, r6, r7, lr} movr5, r9 movr7, fp movr6, sl movr4, r8 push{r4, r5, r6, r7} movsl, r3 lslr2, r2, #2 ldrr3, .L11 subr2, r2, r0 subsp, sp, #20 movr9, r1 *step1 subr2, r2, #4 ldrr1, [r3] ldrr5, .L11+4 movfp, r0 strr2, [sp, #12] .L8: movr6, sl subr6, r6, #1 movsl, r6 bmi.L10 .L7: movr0, fp ldrr4, [sp, #12] addr0, r0, #4 movr6, r9 *step2 movfp, r0 ldrr6, [r6] *step3, r6 corrupted movr3, r4 addr3,
[Bug target/52804] IRA/RELOAD allocate wrong register on ARM for cortex-m0
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52804 --- Comment #1 from amker.cheng 2012-04-03 16:43:30 UTC --- For insns before ira: (insn 82 81 83 3 (set (reg/f:SI 281 [ *o_15(D) ]) (mem/f:SI (reg/v/f:SI 315 [orig:275 o ] [275]) [2 *o_15(D)+0 S4 A32])) pr52804.c:18 186 {*thumb1_movsi_insn} (expr_list:REG_EQUIV (mem/f:SI (reg/v/f:SI 315 [orig:275 o ] [275]) [2 *o_15(D)+0 S4 A32]) (nil))) (insn 83 82 84 3 (set (reg/v/f:SI 198 [ o0 ]) (mem/f:SI (plus:SI (reg/f:SI 281 [ *o_15(D) ]) (reg:SI 273 [ D.4183 ])) [2 *D.4088_18+0 S4 A32])) pr52804.c:18 186 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/f:SI 281 [ *o_15(D) ]) (nil))) (insn 84 83 85 3 (set (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ]) (mem/f:SI (plus:SI (reg/v/f:SI 315 [orig:275 o ] [275]) (const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4 A32])) pr52804.c:19 186 {*thumb1_movsi_insn} (expr_list:REG_EQUIV (mem/f:SI (plus:SI (reg/v/f:SI 315 [orig:275 o ] [275]) (const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4 A32]) (nil))) (insn 85 84 171 3 (set (reg/v/f:SI 201 [ o1 ]) (mem/f:SI (plus:SI (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ]) (reg:SI 273 [ D.4183 ])) [2 *D.4091_23+0 S4 A32])) pr52804.c:19 186 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ]) (expr_list:REG_DEAD (reg:SI 273 [ D.4183 ]) (nil The registers allocated are: r315 -> r9 r281 -> mem r273 -> r3 r198 -> r12 r201 -> r7 The insns need reload are like: insn 82 (deleted) insn 84 (deleted) insn 83 insn 85 The corresponding dump info of reload pass is like: Reloads for insn # 83 Reload 0: reload_in (SI) = (reg/v/f:SI 9 r9 [orig:275 o ] [275]) BASE_REGS, RELOAD_FOR_INPADDR_ADDRESS (opnum = 1) reload_in_reg: (reg/v/f:SI 9 r9 [orig:275 o ] [275]) reload_reg_rtx: (reg:SI 6 r6) Reload 1: reload_in (SI) = (mem/f:SI (reg/v/f:SI 9 r9 [orig:275 o ] [275]) [2 *o_15(D)+0 S4 A32]) LO_REGS, RELOAD_FOR_INPUT_ADDRESS (opnum = 1), can't combine reload_in_reg: (reg/f:SI 281 [ *o_15(D) ]) reload_reg_rtx: (reg:SI 6 r6) Reload 2: LO_REGS, RELOAD_FOR_INPUT_ADDRESS (opnum = 1), can't combine, secondary_reload_p reload_reg_rtx: (reg:SI 7 r7) Reload 3: reload_in (SI) = (mem/f:SI (plus:SI (reg/f:SI 281 [ *o_15(D) ]) (reg:SI 3 r3 [orig:273 D.4183 ] [273])) [2 *D.4088_18+0 S4 A32]) CORE_REGS, RELOAD_FOR_INPUT (opnum = 1) reload_in_reg: (mem/f:SI (plus:SI (reg/f:SI 281 [ *o_15(D) ]) (reg:SI 3 r3 [orig:273 D.4183 ] [273])) [2 *D.4088_18+0 S4 A32]) reload_reg_rtx: (reg/v/f:SI 12 ip [orig:198 o0 ] [198]) secondary_in_reload = 2 Reloads for insn # 85 Reload 0: reload_in (SI) = (reg/v/f:SI 9 r9 [orig:275 o ] [275]) BASE_REGS, RELOAD_FOR_OPADDR_ADDR (opnum = 1) reload_in_reg: (reg/v/f:SI 9 r9 [orig:275 o ] [275]) reload_reg_rtx: (reg:SI 6 r6) Reload 1: reload_in (SI) = (mem/f:SI (plus:SI (reg/v/f:SI 9 r9 [orig:275 o ] [275]) (const_int 4 [0x4])) [2 MEM[(char * * *)o_15(D) + 4B]+0 S4 A32]) LO_REGS, RELOAD_FOR_OPERAND_ADDRESS (opnum = 1), can't combine reload_in_reg: (reg/f:SI 282 [ MEM[(char * * *)o_15(D) + 4B] ]) reload_reg_rtx: (reg:SI 0 r0) We can see, after reload, insn sequence for insn 83/85 shoud be like: insn 83: r6 = r9 r6 = [r6] r7 = [r6 + r3] r12 = r7 insn 85: r6 = r9 r0 = [r6 + 4] r7 = [r0 + r3] ***BUT*** The problem is: RELOAD forms wrong inherited information when reloading insn 83, i.e., reload assumes that r9 is reloaded in r6 and is valid for inheriting when reloading insn 85. Resulting in using r6, which has already been corrupted. After looking into reload. I think function reload_reg_reaches_end_p has missed following case: rld[0] in : r9 reg_rtx : r6 when_needed : RELOAD_FOR_INPADDR_ADDRESS rld[1] in : [r9] reg_rtx : r6 when_neede : RELOAD_FOR_INPUT_ADDRESS In this case, the call of "reload_reg_reaches_end_p(regno(=6), reloadnum(=0))" should return 0, rather than 1 as now. because r6 used in rld[0] is corrupted by rld[1].
[Bug target/52804] IRA/RELOAD allocate wrong register on ARM for cortex-m0
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52804 --- Comment #2 from amker.cheng 2012-04-16 09:00:08 UTC --- Any comments? Or could anyone help me confirm this issue? Thanks very much.
[Bug rtl-optimization/55190] [SH] ivopts causes loop setup bloat
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55190 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #3 from bin.cheng --- ARM can benefit from doloop structure too, but it is implemented in different way. ARM backend defines special addsi_compare pattern and let combine pass combine decrement and comparison instruction, thus saving the comparison instruction. IVOPT can be improved to select two iv candidates for the example loop, with auto-increment one for the memory access and decrement one for loop exit check. This is especially good for target supports both doloop and auto-increment instructions like ARM and SH. BUT most hand-written loops have incremental basic iv, so IVOPT depends on previous pass ivcanon to rewrite it into decremental iv, like below: for (i = 0; i < 100; i++) //loop body > for (i = 100; i > 0; i--) //modified loop body Unfortunately, ivcanon pass only do such loop transformation for loop which iterates constant number times. It seems difficult for RTL loop passes to revert decision made by IVOPT, so I think it should be done in GIMPLE IVOPT. I will give it a try. Thanks.
[Bug rtl-optimization/50749] Auto-inc-dec does not find subsequent contiguous mem accesses
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50749 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #15 from bin.cheng --- There must be another scenario for the example, and in this case example: int test_0 (char* p, int c) { int r = 0; r += *p++; r += *p++; r += *p++; return r; } should be translated into sth like: //... ldrb [rx] ldrb [rx+1] ldrb [rx+2] add rx, rx, #3 //... This way all loads are independent and can be issued on super scalar machine. Actuall for targets like arm which supports post-increment constant (other than size of memory access), it can be further changed into: //... ldrb [rx], #3 ldrb [rx-2] ldrb [rx-1] //... For now auto-increment pass can't do this optimization. I once have a patch for this but benchmark shows the case is not common. This case is common especially after loop unrolling and rtl passes deliberately break down long dependence of RX, which I think is right.
[Bug tree-optimization/39200] ivopts slows down SciMark sparse matrix benchmark
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39200 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #1 from bin.cheng --- This is pretty old. I tried latest trunk with revision r205025. gcc -O2 -march=pentium4 [-fomit-frame-pointer] .L7: movl(%esi,%eax,4), %edx fldl(%edi,%edx,8) fmull(%ebx,%eax,8) faddp%st, %st(1) addl$1, %eax cmpl%ecx, %eax jne.L7 gcc -O2 -march=pentium4 [-fomit-frame-pointer] -fno-ivopts .L7: movl(%esi,%eax,4), %edx fldl(%edi,%edx,8) fmull(%ebx,%eax,8) faddp%st, %st(1) addl$1, %eax cmpl%eax, %ecx jg.L7 Also works for default arch in my configuration. Should this be considered fixed?
[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445 --- Comment #13 from bin.cheng --- Sorry for bothering, I have reverted the patch. Will investigate it.
[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445 --- Comment #14 from bin.cheng --- I found out the root cause of this ICE and will use the simplified code given by comment#9 as an example. The gimple dump before IVOPT is like: : : # c_2 = PHI __val_comp_iter (D.4949); p2 = D.4950; c_6 = c_2 + 4294967292; _21 = MEM[(int *)c_2 + 4294967292B]; if (a_11(D) != 0) goto ; else goto ; : c_3 = c_2 + 4; goto ; : goto ; : # c_23 = PHI # _24 = PHI <_29(11), _22(10)> : # c_20 = PHI # c_15 = PHI # _26 = PHI <_24(6), _21(5)> if (_26 != 0) goto ; else goto ; : D::m_fn1 (&MEM[(struct G *)&p2].MFI); if (_13(D) != 0) goto ; else goto ; : goto ; : *c_20 = 0; c_7 = c_15 + 4294967292; _22 = *c_7; goto ; : *c_20 = 0; c_28 = c_15 + 4294967292; _29 = *c_28; goto ; With the patch: STEP1: # c_20 = PHI is recognized as an iv. STEP2: Since # c_15 = PHI comes from a merging conditional branches, it shouldn't be marked as a biv in mark_bivs. STEP3: When mark_bivs handling "# c_20 = PHI ",it should know that this is a peeled iv and not mark either iv(c_20) or incr_iv(c_15) as bivs. Unfortunately, this patch should add logic in mark_bivs to skip peeled iv, rather than give an assert later when adding candidates for bivs. The following patch should fix this problem: @@ -1074,7 +1074,7 @@ find_bivs (struct ivopts_data *data) static void mark_bivs (struct ivopts_data *data) { - gimple phi; + gimple phi, def; tree var; struct iv *iv, *incr_iv; struct loop *loop = data->current_loop; @@ -1090,6 +1090,13 @@ mark_bivs (struct ivopts_data *data) continue; var = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop)); + def = SSA_NAME_DEF_STMT (var); + /* Don't mark iv peeled from other one as biv. */ + if (def + && gimple_code (def) == GIMPLE_PHI + && gimple_bb (def) == loop->header) +continue; + incr_iv = get_iv (data, var); if (!incr_iv) continue; PS, the example code can be optimized with fixed version patch by recognizing more address ivs. I attached the generated assembly code for arm cortex-m3.
[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445 --- Comment #15 from bin.cheng --- Created attachment 31414 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31414&action=edit The generated assembly with/without patch for code in comment #9 on cortex-m3
[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445 --- Comment #16 from bin.cheng --- I fixed the reported problem and posted new patch at http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01159.html Apology that I missed java in bootstrap for previous patch. This version passes bootstrap and test for c,c++,lto,fortran,java,go,objc,obj_c++ on x86_64. I am not sure if the java case is covered by bootstrap, or other applications. If it's in other application, could anyone help verifying that the issue is addressed on apple-darwin? Thanks.
[Bug tree-optimization/59479] New: Inlining of static function bloats code size when Os
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59479 Bug ID: 59479 Summary: Inlining of static function bloats code size when Os Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amker.cheng at gmail dot com Created attachment 31424 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31424&action=edit The preprocessed file for newlib/libc/stdio/findfp.c Hi, for attached preprocessed code from newlib/libc/stdio/findfp.c, GCC inlines static function `std' even when optimizing for Os. With command line: $ ./arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m0 -c -xc findfp.E -o findfp.o The dumped symbols are like: 21: 000916 FUNCGLOBAL DEFAULT1 _cleanup_r ... 29: 0055 224 FUNCGLOBAL DEFAULT1 __sinit ... 41: 01d524 FUNCGLOBAL DEFAULT1 __fp_unlock_all With command line: $ ./arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m0 -c -xc findfp.E -o findfp.o -fno-inline The dumped symbols are like: 9: 0018 0 NOTYPE LOCAL DEFAULT1 $t 10: 001972 FUNCLOCAL DEFAULT1 std.isra.0 ... 24: 000916 FUNCGLOBAL DEFAULT1 _cleanup_r ... 36: 009d80 FUNCGLOBAL DEFAULT1 __sinit This occurs on trunk and 4_8 branch.
[Bug tree-optimization/59445] [4.9 Regression] ICE in add_old_iv_candidates, at tree-ssa-loop-ivopts.c:2541
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445 --- Comment #18 from bin.cheng --- Hi Dominique d'Humieres, Thanks for verifying it.
[Bug middle-end/39838] [4.7/4.8/4.9 regression] unoptimal code for two simple loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39838 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #15 from bin.cheng --- The situation gets a little bit better on 4_9 trunk. The Os assembly code on cortex-m0 (thumb1 as reported) is like: test: push{r0, r1, r2, r4, r5, r6, r7, lr} movr6, r0 movr4, #0 strr2, [sp, #4] .L2: ldrr2, [r6] cmpr4, r2 bge.L7 movr5, #0 lslr7, r4, #2 addr2, r7, #4 <move to before XXX strr2, [sp] <spill .L3: ldrr3, [sp, #4] cmpr5, r3 bge.L8 ldrr3, [r6, #4] ldrr2, [sp] <spill ldrr0, [r3, r7] ldrr1, [r3, r2] <XXX blfunc addr5, r5, #1 b.L3 .L8: addr4, r4, #1 b.L2 .L7: @ sp needed pop{r0, r1, r2, r4, r5, r6, r7, pc} .sizetest, .-test IVOPT chooses the original biv for all uses in outer loop, regression comes from long live range of "r2" and the corresponding spill. Then I realized that GCC IVOPT computes iv (for non-linear uses) at original place, we may be able to teach IVOPT to compute the iv just before it's used in order to shrink live range of iv. The patch I had at http://gcc.gnu.org/ml/gcc-patches/2013-11/msg00535.html is similar to this, only it computes iv uses at appropriate place for outside loop iv uses. But this idea won't help this specific case because LIM will hoist all the computation to basic block .L2 after IVOPT.
[Bug middle-end/39838] [4.7/4.8/4.9 regression] unoptimal code for two simple loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39838 --- Comment #16 from bin.cheng --- For optimization level O2, the dump before IVOPT is like: : _21 = p_6(D)->count; if (_21 > 0) goto ; else goto ; : : # i_26 = PHI if (count_8(D) > 0) goto ; else goto ; : pretmp_23 = (sizetype) i_26; pretmp_32 = pretmp_23 + 1; pretmp_33 = pretmp_32 * 4; pretmp_34 = pretmp_23 * 4; : # j_27 = PHI _9 = p_6(D)->data; _13 = _9 + pretmp_33; _14 = *_13; _16 = _9 + pretmp_34; _17 = *_16; func (_17, _14); j_19 = j_27 + 1; if (count_8(D) > j_19) goto ; else goto ; : goto ; : : i_20 = i_26 + 1; _7 = p_6(D)->count; if (_7 > i_20) goto ; else goto ; : goto ; : return; There might be two issues that block IVOPT choosing the biv(i) for pretmp_33 and pretmp_34: 1) on some target (like ARM), "i << 2 + 4" can be done in one instruction, if the cost is same as simple shift or plus, then overall cost of biv(i) is lower than the two candidate iv sets. GCC doesn't do such check in get_computation_cost_at for now. 2) there is CSE opportunity between computation of pretmp_33 and pretmp_34, for example they can be computed as below: pretmp_33 = i << 2 pretmp_34 = pretmp_33 + 4 but GCC IVOPT is insensitive to such CSE opportunities between different iv uses. I guess this isn't easy because unless the two uses are very close in code (like this one), such CSE may avail to nothing. These kind tweaks on cost are tricky(and most probably has no overall benefit) because the cost IVOPT computed from RTL is far from precise to do such fine granularity tuning. Another point, as Zdenek pointed out, IVOPT doesn't know that pretmp_33/pretmp_34 are going to be used in memory accesses, which means some of address computation can be embedded by appropriate addressing mode. In other words, computation of pretmp_33/pretmp_34 shouldn't be honored when computing overall cost and choosing iv candidates set. Since "_9 + pretmp_33/pretmp_34" is not affine iv, the only way to handle this issue is to lower both memory accesses before IVOPT, into some code like below: : pretmp_23 = (sizetype) i_26; pretmp_32 = pretmp_23 + 1; : # j_27 = PHI _9 = p_6(D)->data; _14 = MEM[_9 + pretmp_32 << 2]; _17 = MEM[_9 + pretmp_23 << 2]; func (_17, _14); j_19 = j_27 + 1; if (count_8(D) > j_19) goto ; else goto ; With this code, the iv uses are biv(i), pretmp_23(i_26) and pretmp_32(i_26+1), and IVOPT won't even add the annoying candidate.
[Bug tree-optimization/59479] Inlining of static function bloats code size when Os
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59479 --- Comment #2 from bin.cheng --- I will investigate it later. Just clarifying, the function is called three times by the caller, it would increase code size usually. BTW, could you explain a little about "2nd-order effect"? I am not familiar with the concept. Thanks in advance.
[Bug tree-optimization/52272] [4.7/4.8/4.9 regression] Performance regression of 410.bwaves on x86.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52272 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #21 from bin.cheng --- Hi Richard, I looked into PR50955 for which the mentioned commit causing this PR is applied: Commit 2012-02-06 Richard Guenther PR tree-optimization/50955 * tree-ssa-loop-ivopts.c (get_computation_cost_at): Artificially raise cost of expressions that replace an address with an expression based on a different pointer. I noticed that the offending non-linear use in PR50955 is actually from memory reference. If I understand the issue correct, the whole alias issue is introduced by rewriting iv use with one base_object through candidate with another incompatible base_object, and it is related to memory reference. An genuine non-linear iv use (the pointer never de-referenced, like in this PR) won't have this issue. So I come up this idea to relax the condition: - if (address_p) + if (address_p + || (use->iv->base_object + && cand->iv->base_object + && POINTER_TYPE_P (TREE_TYPE (use->iv->base_object)) + && POINTER_TYPE_P (TREE_TYPE (cand->iv->base_object { /* Do not try to express address of an object with computation based on address of a different object. This may cause problems in rtl to non-linear uses which truly occurred in memory reference, something like: - if (address_p) + if (address_p + || (use->in_mem_ref_p + && use->iv->base_object + && cand->iv->base_object + && POINTER_TYPE_P (TREE_TYPE (use->iv->base_object)) + && POINTER_TYPE_P (TREE_TYPE (cand->iv->base_object { /* Do not try to express address of an object with computation based on address of a different object. This may cause problems in rtl The flag in_mem_ref_p can be set for appropriate uses when finding interesting address uses. With this change, this PR should be resolved while not violating PR50955. I am not very much into 50955, so how does this sound? I can send a patch for review if the idea is in right direction. BTW, I cannot reproduce 50955 with the reported revision of GCC. The store isn't deleted by pass_cd_dce, though it is re-written just as the PR reported. So maybe I just misunderstood something. Any words? Thanks, bin
[Bug tree-optimization/50955] [4.7 Regression] IVopts incorrectly rewrite the address of a global memory access into a local form.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50955 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #17 from bin.cheng --- Hi Richard, I am having difficulty in understanding cases if this PR. For the reported case with two loops: for( y=0; y<4; y++, pDst += dstStep ) { for( x=y+1; x<4; x++ ) { s = ( p1[x-y-1] + p1[x-y] + p1[x-y] + p1[x-y+1] + 2 ) >> 2; pDst[x] = (unsigned char)s; } pDst[y] = p3; } The dump for statement 'pDst[y] = p3;' before IVOPT is like: : Invalid sum of incoming frequencies 1667, should be 278 y.2_64 = (sizetype) y_89; D.6421_65 = pDst_88 + y.2_64; *D.6421_65 = p3_37; pDst_69 = pDst_88 + pretmp.21_118; ivtmp.35_116 = ivtmp.35_87 - 1; if (ivtmp.35_116 != 0) goto ; else goto ; IVOPT chooses candidate 15: candidate 15 depends on 3 var_before ivtmp.154 var_after ivtmp.154 incremented before exit test type unsigned int base (unsigned int) pDst_39(D) - (unsigned int) &p1 step (unsigned int) (pretmp.21_118 + 1) for use 1: use 1 address in statement *D.6421_65 = p3_37; at position *D.6421_65 type unsigned char * base pDst_39(D) step pretmp.21_118 + 1 base object (void *) pDst_39(D) related candidates After rewriting, the dump is like: : Invalid sum of incoming frequencies 1667, should be 278 MEM[symbol: p1, index: ivtmp.154_200, offset: 0B] = p3_37; pDst_69 = pDst_88 + pretmp.21_118; ivtmp.149_218 = ivtmp.149_249 - 1; ivtmp.154_190 = ivtmp.154_200 + D.6617_250; if (x_40 != 4) goto ; else goto ; Eventually, the storing to TMR[p1,ivtmp,0] is considered local and deleted. BUT, for your reduced case: p3 = (unsigned char)(((signed int)p1[1] + (signed int)p2[1] + (signed int)p1[0] +(signed int)p1[0] + 2 ) >> 2 ); for( x=y+1; x<4; x++ ) { s = ( p1[x-y-1] + p1[x-y] + p1[x-y] + p1[x-y+1] + 2 ) >> 2; pDst[x] = (unsigned char)s; } pDst[y] = p3; It is about the the TMR in below dump (before IVOPT): : # vect_pp1.30_166 = PHI # vect_pp1.37_176 = PHI # vect_pp1.46_194 = PHI # vect_p.60_223 = PHI # ivtmp.64_225 = PHI ... MEM[(unsigned char *)vect_p.60_223] = vect_var_.58_219; vect_pp1.30_167 = vect_pp1.30_166 + 8; vect_pp1.37_177 = vect_pp1.37_176 + 8; vect_pp1.46_195 = vect_pp1.46_194 + 8; vect_p.60_224 = vect_p.60_223 + 8; ivtmp.64_226 = ivtmp.64_225 + 1; if (ivtmp.64_226 < bnd.27_128) goto ; else goto ; Your patch prevents IVOPT from choosing cand 4: candidate 4 (important) var_before ivtmp.110 var_after ivtmp.110 incremented before exit test type unsigned int base (unsigned int) (&p1 + 8) step 8 base object (void *) &p1 for use 3: use 3 generic in statement vect_p.60_223 = PHI at position type vector(8) unsigned char * base batmp.61_221 + 1 step 8 base object (void *) batmp.61_221 is a biv related candidates To prevent IVOPT from rewriting into: : # ivtmp.107_150 = PHI # ivtmp.110_241 = PHI D.6585_133 = (unsigned int) batmp.61_221; p1.131_277 = (unsigned int) &p1; D.6587_278 = D.6585_133 - p1.131_277; D.6588_279 = D.6587_278 + ivtmp.110_241; D.6589_280 = D.6588_279 + 4294967289; D.6590_281 = (vector(8) unsigned char *) D.6589_280; vect_p.60_223 = D.6590_281; ... MEM[(unsigned char *)vect_p.60_223] = vect_var_.58_219; ivtmp.107_256 = ivtmp.107_150 + 1; ivtmp.110_146 = ivtmp.110_241 + 8; if (ivtmp.107_256 < bnd.27_128) goto ; else goto ; Thus prevents IVOPT from generating candidate 15 in outer loop. (Expressing use 3 by cand 4 itself is good, right?) --- But, It seems because the check: if (address_p) { /* Do not try to express address of an object with computation based on address of a different object. This may cause problems in rtl level alias analysis (that does not expect this to be happening, as this is illegal in C), and would be unlikely to be useful anyway. */ if (use->iv->base_object && cand->iv->base_object && !operand_equal_p (use->iv->base_object, cand->iv->base_object, 0)) return infinite_cost; failed because cand(15)->iv->base_object == NULL. For the reported case, it's not about an iv use appearing in memory reference while not marked as address_p, and can be fixed by revise the existing check condition, is it true? PS, sorry for replying to a fixed PR, I found it's kind of impossible to fix PR52272 without fully understanding this one.
[Bug tree-optimization/50955] [4.7 Regression] IVopts incorrectly rewrite the address of a global memory access into a local form.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50955 --- Comment #19 from bin.cheng --- > > >not about an iv use appearing in memory reference while not marked as > >address_p, and can be fixed by revise the existing check condition, is > >it true? > > No, even expressing an address this way is broken as for example dependence > analysis via scev can get confused about the actual base object. Agree, only I think it's not scev's responsibility since scev only cares base value initialized for the analyzing loop, rather than the BASE object. > > IIRC previously we already avoided the mem-use case and I had to generalize > it > to also avoid addresses. Not all. For the reported case, use and cand like: use 3 generic in statement vect_p.70_247 = PHI at position type vector(8) unsigned char * base batmp.71_245 + 1 step 8 base object (void *) batmp.71_245 is a biv related candidates candidate 15 depends on 3 var_before ivtmp.154 var_after ivtmp.154 incremented before exit test type unsigned int base (unsigned int) pDst_39(D) - (unsigned int) &p1 step (unsigned int) (pretmp.21_118 + 1) The check: if (address_p || (use->iv->base_object && cand->iv->base_object && POINTER_TYPE_P (TREE_TYPE (use->iv->base_object)) && POINTER_TYPE_P (TREE_TYPE (cand->iv->base_object { /* Do not try to express address of an object with computation based on address of a different object. This may cause problems in rtl level alias analysis (that does not expect this to be happening, as this is illegal in C), and would be unlikely to be useful anyway. */ if (use->iv->base_object && cand->iv->base_object && !operand_equal_p (use->iv->base_object, cand->iv->base_object, 0)) return infinite_cost; } still evaluates to false because: use->iv->base_object != NULL && cand->iv->base_object == NULL >
[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #5 from bin.cheng --- I will have a look. Thanks.
[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536 --- Comment #6 from bin.cheng --- Hi, Sorry I don't have m68k environment to do the bootstrap, could anyone help dump "-fdump-tree-all-details -fdump-rtl-all-slim" with and without the patch for me? Otherwise I have to revert the patch and hold it for future. Hi Jakub, should I revert the patch for now? Thanks.
[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536 --- Comment #8 from bin.cheng --- (In reply to Andreas Schwab from comment #1) > Between r205951 and r205984. (In reply to H.J. Lu from comment #7) > (In reply to bin.cheng from comment #6) > > Hi, > > Sorry I don't have m68k environment to do the bootstrap, could anyone help > > dump "-fdump-tree-all-details -fdump-rtl-all-slim" with and without the > > patch for me? Otherwise I have to revert the patch and hold it for future. > > > > Can't you use cross compiler on preprocessed input to debug it? The bare-metal tool seems not handle the preprocessed file correctly, so am trying to build cross linux tools. Unfortunately, cross-ng only supports uclinux for m68k. Given that I am not familiar with m68k-linux, so I am having difficulty in enabling one for now.
[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536 --- Comment #9 from bin.cheng --- Turns out my crossed bare-metal tool works after deleting all preprocessed "# xxx file" lines, but why these lines matter?
[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536 --- Comment #10 from bin.cheng --- The offending loop before IVOPT is like: : # var_index_1889 = PHI <1(924), var_index_983(923)> # var_index.250_1269 = PHI <1(924), var_index.250_1959(923)> if (var_index.250_1269 < _1237) goto ; else goto ; : loopi_952 = MEM[(const struct vec *)pretmp_2270].m_vecdata[var_index.250_1269]; _947 = loopi_952->num; if (_947 == pretmp_2268) goto ; else goto ; : var_index_983 = var_index_1889 + 1; var_index.250_1959 = (unsigned int) var_index_983; goto ; : goto ; The patch can recognize var_index.250_1269 is an iv with {1, 1}_loop, thus rewriting the loop into: : # var_index_1889 = PHI <1(924), var_index_983(923)> # ivtmp.1067_1968 = PHI var_index.250_1269 = (unsigned int) var_index_1889; if (var_index_1889 != _958) goto ; else goto ; : _111 = (void *) ivtmp.1067_1968; loopi_952 = MEM[base: _111, offset: 0B]; ivtmp.1067_884 = ivtmp.1067_1968 + 4; _947 = loopi_952->num; if (_947 == pretmp_2268) goto ; else goto ; : var_index_983 = var_index_1889 + 1; goto ; : _1542 = pretmp_2270 + 12; ivtmp.1067_696 = (unsigned int) _1542; _958 = (int) _1237; goto ; The transformation looks good and takes advantage of post-increment addressing mode for memory access "MEM[base: _111, offset: 0B]". The loop is expanded into rtl like: 4438: L4438: 1814: NOTE_INSN_BASIC_BLOCK 352 1815: r626:SI=r817:SI 1816: cc0=cmp(r817:SI,r492:SI) 1817: pc={(cc0==0)?L4244:pc} REG_BR_PROB 900 1818: NOTE_INSN_BASIC_BLOCK 353 1819: r490:SI=[r829:SI] 1820: r829:SI=r829:SI+0x4 1821: cc0=cmp([r490:SI],r864:SI) 1822: pc={(cc0!=0)?L4435:pc} ... 4435: L4435: 4436: NOTE_INSN_BASIC_BLOCK 952 4437: r817:SI=r817:SI+0x1 4439: pc=L4438 4440: barrier 4441: L4441: 4442: NOTE_INSN_BASIC_BLOCK 953 4443: r829:SI=r865:SI+0xc : r492:SI=r621:SI 44: r817:SI=0x1 4445: pc=L4438 Then instruction 1819/1820 are combined by auto-inc-dec pass into: 1819: r490:SI=[r829:SI++] REG_INC r829:SI 1821: cc0=cmp([r490:SI],r864:SI) REG_DEAD r490:SI 1822: pc={(cc0!=0)?L4435:pc} REG_BR_PROB 9550 Problem comes from reload which puts both r490 and r829 into %a0 (reg 8?) and generates below code: 1819: %a0:SI=[%a0:SI++] REG_INC %a0:SI 1821: cc0=cmp([%a0:SI],%d2:SI) 1822: pc={(cc0!=0)?L4435:pc} REG_BR_PROB 9550 Insn 1819 is now bogus and causes assertion in cselib. In IRA, there are dumps like: Popping a1119(r829,l0: a921(r829,l17)) -- assign reg 8 Popping a1122(r,l0: a924(r,l17)) -- assign reg 8 Popping a1120(r494,l0: a922(r494,l17)) -- assign reg 9 Popping a1147(r1054,l0: a1006(r1054,l15)) -- assign reg 8 Popping a1157(r490,l0: a1124(r490,l17: a959(r490,l18))) -- assign reg 2 But in reload, there are dumps: Reloads for insn # 1819 Reload 0: reload_in (SI) = (post_inc:SI (reg:SI 829 [ ivtmp.1067 ])) reload_out (SI) = (post_inc:SI (reg:SI 829 [ ivtmp.1067 ])) ADDR_REGS, RELOAD_FOR_OPERAND_ADDRESS (opnum = 1), inc by 4 reload_in_reg: (post_inc:SI (reg:SI 829 [ ivtmp.1067 ])) reload_reg_rtx: (reg:SI 8 %a0) Reload 1: reload_out (SI) = (reg/v/f:SI 490 [ loopi ]) GENERAL_REGS, RELOAD_FOR_OUTPUT (opnum = 0), optional reload_out_reg: (reg/v/f:SI 490 [ loopi ]) Reload 2: reload_in (SI) = (mem/f:SI (post_inc:SI (reg:SI 829 [ ivtmp.1067 ])) [4 MEM[base: _111, offset: 0B]+0 S4 A16]) GENERAL_REGS, RELOAD_FOR_INPUT (opnum = 1), optional reload_in_reg: (mem/f:SI (post_inc:SI (reg:SI 829 [ ivtmp.1067 ])) [4 MEM[base: _111, offset: 0B]+0 S4 A16]) So I am not sure if there are some bugs in reload for m68k, or ivopt is doing something very trick and wrong? Thanks, bin
[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536 bin.cheng changed: What|Removed |Added CC||bernds at codesourcery dot com, ||uweigand at de dot ibm.com --- Comment #11 from bin.cheng --- Add reload maintainer for some suggestions.
[Bug bootstrap/59536] [4.9 regression] internal compiler error: in cselib_record_set, at cselib.c:2376 breaks m68k-linux bootstrap
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59536 --- Comment #13 from bin.cheng --- (In reply to Andreas Schwab from comment #12) > -fno-auto-inc-dec avoids the crash. Dup of #52306? It looks like, AFAICT. Only this time it's blocking bootstrap :(
[Bug c++/59555] New: bogus error: template with C linkage with preprocessed c++ file
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59555 Bug ID: 59555 Summary: bogus error: template with C linkage with preprocessed c++ file Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: amker.cheng at gmail dot com Created attachment 31478 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31478&action=edit preprocessed c++ file For attached preprocessed file, arm-none-eabi-g++ and m68k-unknown-elf-g++ give below error messages with either "-xc++" or "-xc++-cpp-output": In file included from /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:40:0, from /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:39, from /daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25, from ../../gcc/gcc/system.h:647, from ../../gcc/gcc/tree-loop-distribution.c:45: /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/memoryfwd.h:63:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/memoryfwd.h:66:3: error: template specialization with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/memoryfwd.h:70:3: error: template with C linkage In file included from /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:39:0, from /daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25, from ../../gcc/gcc/system.h:647, from ../../gcc/gcc/tree-loop-distribution.c:45: /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:52:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:55:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:59:3: error: template specialization with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/stringfwd.h:65:3: error: template specialization with C linkage In file included from /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:40:0, from /daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25, from ../../gcc/gcc/system.h:647, from ../../gcc/gcc/tree-loop-distribution.c:45: /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/postypes.h:111:3: error: template with C linkage In file included from /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:40:0, from /daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25, from ../../gcc/gcc/system.h:647, from ../../gcc/gcc/tree-loop-distribution.c:45: /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/postypes.h:214:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/bits/postypes.h:219:3: error: template with C linkage In file included from /daten/cross/m68k-linux/m68k-linux/sys-root/usr/include/gmp.h:25:0, from ../../gcc/gcc/system.h:647, from ../../gcc/gcc/tree-loop-distribution.c:45: /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:76:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:79:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:82:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:85:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:88:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:91:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:95:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:99:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:103:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:107:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:110:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:113:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:116:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:119:3: error: template with C linkage /daten/cross/m68k-linux/gcc-4.8/m68k-linux/include/c++/4.8.2/iosfwd:122:3: error: template with
[Bug middle-end/52306] [4.8/4.9 regression] ICE in cselib_record_set, at cselib.c:2158
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52306 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #27 from bin.cheng --- (In reply to Andreas Schwab from comment #26) > What does that mean, it's too late? We are in stage 3 now, enabling LRA needs non-trivial work, so it's very likely we can't make it work in time.
[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #3 from bin.cheng --- I will look into it.
[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519 --- Comment #4 from bin.cheng --- First clue. b_lsm.11_13 is recognized as chrec {1, +, 1}_2 with the patch, thus the loop can be vectorized now. : : # b.4_30 = PHI # prephitmp_28 = PHI # b_lsm.11_13 = PHI # ivtmp_46 = PHI c.1_9 = prephitmp_28 | 1; b.4_12 = b.4_30 + 1; ivtmp_45 = ivtmp_46 - 1; if (ivtmp_45 != 0) goto ; else goto ; Problem arises in calling stack like: vect_do_peeling_for_loop_bound slpeel_tree_peel_loop_to_edge slpeel_update_phi_nodes_for_guard1 for phi node : # b_lsm.11_13 = PHI It looks like loop peeling has difficulty in coping with peeled phi node.
[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519 --- Comment #5 from bin.cheng --- For the offending loop: : : # b.4_30 = PHI # prephitmp_28 = PHI # b_lsm.11_13 = PHI # ivtmp_46 = PHI c.1_9 = prephitmp_28 | 1; b.4_12 = b.4_30 + 1; ivtmp_45 = ivtmp_46 - 1; if (ivtmp_45 != 0) goto ; else goto ; Now SCEV recognizes b_lsm.11_13 as {1,1}_2, and vectorizer considers it can be vectorized. The problem comes in function slpeel_update_phi_nodes_for_guard1 for phi node :# b_lsm.11_13 = PHI . It's special because its loop_arg: b.4_12 has already been handled in previous node and has non-null current definition, resulting in assertion failure at line: gcc_assert (get_current_def (current_new_name) == NULL_TREE); It seems loop manipulating utility for vectorization can't cope with this kind PEELED phi node. We can get more loops vectorized if we can handle this issue in vectorization. For example, the more complicated example reported can be vectorized successfully. But, I think it's a little bit difficult to handle the case because it's possible to have the PEELED phi node come before the phi node from which it's peeled from (b.4_30, in this case), just like: : : # b_lsm.11_13 = PHI
[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519 --- Comment #7 from bin.cheng --- (In reply to Jakub Jelinek from comment #6) > Created attachment 31562 [details] > gcc49-pr59519.patch > > I wonder if this isn't just a checking issue, the two PHI nodes created in > *new_exit_bb have the same argument, so I think it is just fine if the two > PHI results are used interchangeably, later optimization passes should > hopefully coalesce them into a single IV. I tested one similar patch before. It passed x86_64 bootstrap and normal regression test. It failed some ada (also one go) cases if I ran regression test with "-O3" option. The failures look like noise to me, which I am not sure about. What's your test results? One potential shortage is it introduces additional PHI/copy of different ssa names and makes the generated code some kind of ugly and hard to read, but just as you pointed out, later passes should be able to coalescing them (I am not sure about that, especially after seeing ssa names not get coalesced in some more regular cases.) Thanks.
[Bug tree-optimization/59519] [4.9 Regression] ICE on valid code at -O3 on x86_64-linux-gnu in slpeel_update_phi_nodes_for_guard1, at tree-vect-loop-manip.c:486
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59519 --- Comment #10 from bin.cheng --- (In reply to Jakub Jelinek from comment #9) > BTW, the patch can hardly regress anything, it only affects cases that ICEd > before the patch. Em, I am worried if vectorization can handle peeled phi correctly for each scenario before, because I barely know the implementation. That's why I looked for your guys' suggestions in the first place. Thanks.
[Bug rtl-optimization/43491] Unnecessary temporary for global register variable
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43491 amker.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot ||com --- Comment #2 from amker.cheng 2011-11-23 05:50:51 UTC --- Noticed that pass 097t.copyprop4 propagates reg.0_12 to statement Y in following dump: - : reg.0_12 = reg; D.4705_13 = MEM[(unsigned int *)reg.0_12 + 8B]; <-statement Z if (D.4705_13 != 0) goto ; else goto ; : : c (); reg.0_1 = reg.0_12; <-statement X D.4705_3 = MEM[(unsigned int *)reg.0_1 + 8B]; <-statement Y if (D.4705_3 != 0) goto ; else goto ; : goto ; : return; - to be: reg.0_1 = reg.0_12; <-statement X D.4705_3 = MEM[(unsigned int *)reg.0_12 + 8B]; <-statement Y So, should it propagates reg directly? Could this be done on ssa? Also I found 1) there are similar cases on redundant copy or load constant, for example, http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44025 2) some of these cases are generated after expanding into rtl; 3) redundant copy might be handled in IRA, but redundant load const might be more difficult. How about extending regcprop.c pass into a global pass?
[Bug rtl-optimization/43491] Unnecessary temporary for global register variable
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43491 --- Comment #3 from amker.cheng 2011-11-24 09:24:37 UTC --- (In reply to comment #1) > > I'm thinking that this is perfectly normal thing to do, and that the redundant > move is meant to disappear in a later pass. My guess is that IRA is choosing > not to assign the pseudo to r4, but I do not know why at the moment. As dump in 191r.shed1: -- (insn 5 7 6 2 (set (reg/f:SI 135 [ reg.0 ]) (reg/v:SI 4 r4 [ reg ])) pr43491.c:16 709 {*thumb2_movsi_insn} (expr_list:REG_DEAD (reg/v:SI 4 r4 [ reg ]) (nil))) (insn 6 5 8 2 (set (reg:SI 137 [ MEM[(unsigned int *)reg.0_12 + 8B] ]) (mem:SI (plus:SI (reg/f:SI 135 [ reg.0 ]) (const_int 8 [0x8])) [2 MEM[(unsigned int *)reg.0_12 + 8B]+0 S4 A32])) pr43491.c:16 709 {*thumb2_movsi_insn} (nil)) (jump_insn 8 6 49 2 (parallel [ (set (pc) (if_then_else (eq (reg:SI 137 [ MEM[(unsigned int *)reg.0_12 + 8B] ]) (const_int 0 [0])) (label_ref:SI 22) (pc))) (clobber (reg:CC 24 cc)) ]) pr43491.c:16 747 {*thumb2_cbz} (expr_list:REG_DEAD (reg:SI 137 [ MEM[(unsigned int *)reg.0_12 + 8B] ]) (expr_list:REG_UNUSED (reg:CC 24 cc) (expr_list:REG_BR_PROB (const_int 900 [0x384]) (nil -> 22) (code_label 49 8 48 3 4 "" [1 uses]) (note 48 49 16 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (note 16 48 14 3 NOTE_INSN_DELETED) (call_insn 14 16 15 3 (parallel [ (call (mem:SI (symbol_ref:SI ("c") [flags 0x41] ) [0 c S4 A32]) (const_int 0 [0])) (use (const_int 0 [0])) (clobber (reg:SI 14 lr)) ]) pr43491.c:17 247 {*call_symbol} (nil) (nil)) (insn 15 14 17 3 (set (reg:SI 138 [ MEM[(unsigned int *)reg.0_12 + 8B] ]) (mem:SI (plus:SI (reg/f:SI 135 [ reg.0 ]) (const_int 8 [0x8])) [2 MEM[(unsigned int *)reg.0_12 + 8B]+0 S4 A32])) pr43491.c:16 709 {*thumb2_movsi_insn} (nil)) -- Since reg is manually declared in r4, function globalize_reg sets r4 in fixed_reg_set/call_used_reg_set/call_fixed_reg_set. IRA then add r4 into allocno(r135)'s conflict_hard_regs. That's why IRA not assigns the pseudo(r135) to r4. I guess it's natural unless we can make IRA aware of constant register.
[Bug rtl-optimization/43491] Unnecessary temporary for global register variable
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43491 --- Comment #4 from amker.cheng 2011-12-21 03:44:03 UTC --- This bug is even worse on mips. The cause is ssa-pre eliminates global register variable when it is the RHS of single assign statment, while following passes do not handle the const/register attributes of the variable. It can be handled in tree-ssa-pre.c without hurting true redundancy elimination on global register variables. So could somebody change the tag from rtl-optimization to tree-optimization?
[Bug target/51835] New: ARM EABI violation when passing arguments to helper floating functions like __aeabi_d2iz
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51835 Bug #: 51835 Summary: ARM EABI violation when passing arguments to helper floating functions like __aeabi_d2iz Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com For following program int func(float f) { double d = (double)f; return (int)d; } compile it with following command: $ arm-none-eabi-gcc -O2 -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 -S test.c -o test.S the generated assembly code is: --- fun: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 push{r3, lr} fmrsr0, s0 bl__aeabi_f2d fmdrrd0, r0, r1 bl__aeabi_d2iz pop{r3, pc} .sizefun, .-fun The argument of __aeabi_d2iz is passed in fp register, While ARM RTABI document says that such functions should use the soft-float ABI, even when -mfloat-abi=hard is specified. The problem at least exists on trunk and 4.6 branch. I am working a patch and will send it for review later.
[Bug middle-end/51867] New: GCC generates inconsistent code for same sources calling builtin calls, like sqrtf
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51867 Bug #: 51867 Summary: GCC generates inconsistent code for same sources calling builtin calls, like sqrtf Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: trivial Priority: P3 Component: middle-end AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com compile following program: -- #include int a(float x) { return sqrtf(x); } int b(float x) { return sqrtf(x); } With command: arm-none-eabi-gcc -mthumb -mhard-float -mfpu=fpv4-sp-d16 -mcpu=cortex-m4 -O0 -S a.c -o a.S The generated assembly codes is like: -- a: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 1, uses_anonymous_args = 0 push{r7, lr} sub sp, sp, #8 add r7, sp, #0 fstss0, [r7, #4] fldss15, [r7, #4] fsqrts s15, s15 fcmps s15, s15 fmstat beq .L2 fldss0, [r7, #4] bl sqrtf fcpys s15, s0 .L2: ftosizs s15, s15 fmrsr3, s15 @ int mov r0, r3 add r7, r7, #8 mov sp, r7 pop {r7, pc} .size a, .-a .align 2 .global b .thumb .thumb_func .type b, %function b: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 1, uses_anonymous_args = 0 push{r7, lr} sub sp, sp, #8 add r7, sp, #0 fstss0, [r7, #4] fldss0, [r7, #4] bl sqrtf fcpys s15, s0 ftosizs s15, s15 fmrsr3, s15 @ int mov r0, r3 add r7, r7, #8 mov sp, r7 pop {r7, pc} .size b, .-b The problem exists on trunk and triggered only by O0 optimization. The problem stands for x86 target too.
[Bug middle-end/51867] GCC generates inconsistent code for same sources calling builtin calls, like sqrtf
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51867 --- Comment #1 from amker.cheng 2012-01-16 10:15:59 UTC --- The cause is in function expand_builtin, gcc checks following conditions: -- /* When not optimizing, generate calls to library functions for a certain set of builtins. */ if (!optimize && !called_as_built_in (fndecl) && DECL_ASSEMBLER_NAME_SET_P (fndecl) && fcode != BUILT_IN_ALLOCA && fcode != BUILT_IN_ALLOCA_WITH_ALIGN && fcode != BUILT_IN_FREE) return expand_call (exp, target, ignore); The control flow is: 1, DECL_ASSEMBLER_NAME_SET_P (fndecl) is false at the first time when compiling a; 2, It is then set in following codes when expanding sqrtf call in function a; 3, When compiling function b, gcc checks DECL_ASSEMBLER_NAME_SET_P (fndecl) again and this time it's true;
[Bug middle-end/51867] GCC generates inconsistent code for same sources calling builtin calls, like sqrtf
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51867 --- Comment #3 from amker.cheng 2012-01-17 10:35:14 UTC --- test case c-c++-common/dfp/signbit-2.c depends on this check. the case is like: - /* { dg-options "-O0" } */ /* Check that the compiler uses builtins for signbit; if not the link will fail because library functions are in libm. */ #include "dfp-dbg.h" volatile _Decimal32 sd = 2.3df; volatile _Decimal64 dd = -4.5dd; volatile _Decimal128 tf = 5.3dl; volatile float f = 1.2f; volatile double d = -7.8; volatile long double ld = 3.4L; EXTERN int signbitf (float); EXTERN int signbit (double); EXTERN int signbitl (long double); EXTERN int signbitd32 (_Decimal32); EXTERN int signbitd64 (_Decimal64); EXTERN int signbitd128 (_Decimal128); int main () { if (signbitf (f) != 0) FAILURE if (signbit (d) == 0) FAILURE if (signbitl (ld) != 0) FAILURE if (signbitd32 (sd) != 0) FAILURE if (signbitd64 (dd) == 0) FAILURE if (signbitd128 (tf) != 0) FAILURE FINISH } It is compiled without optimization and will fail if no builtin_* functions are used. Not sure it is intended or not.
[Bug tree-optimization/88932] [8/9 Regression] ICE: verify_ssa failed (Error: definition in block 29 does not dominate use in block 25)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88932 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #4 from bin.cheng --- (In reply to Jakub Jelinek from comment #3) > This has been approved for trunk, are you going to commit it? Thanks for reminding, will commit it tomorrow. I would also need an approval for 8 branch.
[Bug tree-optimization/82965] [8 regression][armeb] gcc.dg/vect/pr79347.c starts failing after r254379
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82965 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #10 from bin.cheng --- a proposed patch @https://gcc.gnu.org/ml/gcc-patches/2018-01/msg02419.html
[Bug tree-optimization/28364] poor optimization choices when iterating over a std::string (probably not c++-specific)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28364 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #31 from bin.cheng --- This is a really old issue! I will also check status of this issue on trunk.
[Bug tree-optimization/49498] [4.7/4.8 Regression]: gcc.dg/uninit-pred-8_b.c bogus warning line 20
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49498 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot ||com --- Comment #17 from bin.cheng 2012-11-20 07:08:18 UTC --- Hi, I spent some time analyzing this bug and I think I understand the problem now. For below dump file from trunk/cris-elf when compiling the attached k.c: ;; Function foo (foo, funcdef_no=0, decl_uid=1323, cgraph_uid=0) ;; 1 loops found ;; ;; Loop 0 ;; header 0, latch 1 ;; depth 0, outer -1 ;; nodes: 0 1 2 3 4 5 6 7 8 9 10 11 ;; 2 succs { 10 3 } ;; 3 succs { 11 4 } ;; 4 succs { 11 } ;; 5 succs { 6 7 } ;; 6 succs { 9 } ;; 7 succs { 6 8 } ;; 8 succs { 9 } ;; 9 succs { 1 } ;; 10 succs { 5 6 } ;; 11 succs { 5 8 } foo (int n, int l, int m, int r) { int v; int g.1; int g.0; : if (n_4(D) <= 9) goto ; else goto ; : if (m_5(D) > 100) goto ; else goto ; : goto ; : # v_14 = PHI g.0_9 = g; g.1_10 = g.0_9 + 1; g = g.1_10; if (n_4(D) <= 9) goto ; else goto ; : # v_17 = PHI blah (v_17); goto ; : if (m_5(D) > 100) goto ; else goto ; : : return 0; : if (m_5(D) != 0) goto ; else goto ; : # v_13 = PHI if (m_5(D) != 0) goto ; else goto ; } There are two flaws in tree-ssa-uninit.c revealing this bug. 1. GCC try to find def_chains from cd_root(which is the closest dominating bb for phi_bb) to phi_bb, but only find use_predicates from phi_bb to use_bb. In general case with canonical CFG, this is fine, but in non-canonical CFG, it's possible to have ancestor basic block of phi_bb in def_chains which have branch that never reach to phi_bb, like basic block 10 reported in this PR. In this scenario the corresponding condition should not be counted in def_chains(edge<10, 5> in this case). There are two methods to fix this: a) find use predicates from dom(phi_bb), rather than phi_bb in non-canonical CFGs. b) prune branch conditions that are irrelevant to this use/def in def_chains. Method a is simpler, but the problem is it results in more dep_chains which might exceeds the limit MAX_NUM_CHAINS. As for method b), I haven't got any clue to implement it. 2. When calling is_use_properly_guarded in find_uninit_use, GCC finds predicates from source basic block if the use_stmt is a phi node. This results in missing condition at the end of each def_chain. Different from the first issue, this can be easily fixed.
[Bug tree-optimization/55424] New: [4.8 Regression]gcc.dg/uninit-pred-8_b.c bogus warning line 23 on ARM/Cortex-M0/-Os
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55424 Bug #: 55424 Summary: [4.8 Regression]gcc.dg/uninit-pred-8_b.c bogus warning line 23 on ARM/Cortex-M0/-Os Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com The test case require optimization level "-O2" and it passes on ARM/cortex-m0 with "-O2", but the failure with "-Os" does reveal potential bug in tree-ssa-uninit.c Test command line: arm-none-eabi-gcc ./uninit-pred-8_b.c -fno-diagnostics-show-caret -Wuninitialized -fno-tree-dominator-opts -S-mthumb -mcpu=cortex-m0 -Os -o uninit-pred-8_b.s The warning info: .../trunk-orig/gcc/gcc/testsuite/gcc.dg/uninit-pred-8_b.c: In function 'foo': .../trunk-orig/gcc/gcc/testsuite/gcc.dg/uninit-pred-8_b.c:23:11: warning: 'v' may be used uninitialized in this function [-Wmaybe-uninitialized] .../trunk-orig/gcc/gcc/testsuite/gcc.dg/uninit-pred-8_b.c: In function 'foo_2': .../trunk-orig/gcc/gcc/testsuite/gcc.dg/uninit-pred-8_b.c:42:11: warning: 'v' may be used uninitialized in this function [-Wmaybe-uninitialized] This failure occurs after checking in r193687. The patch prefers to generate branches on ARM/cortex-m0. After investigating tree dump of tree-ssa-uninit.c, I think: tree-ssa-uninit.c computes control dependent chain for uses/def of variable and checks whether each use is guarded by def. It has a upper bound on the number of control dependent chains(MAX_NUM_CHAINS==8) and just retreat to false warning if the number of chains exceeds MAX_NUM_CHAINS. In our scenario, the number of chains exceeds MAX_NUM_CHAINS because we prefer short circuit now, resulting in false warning information. These false warning cannot be fully removed if the MAX_NUM_CHAINS exists, but we can improve it in following way: There are lots of invalid control dependent chains computed in tree-ssa-uninit.c now and should be pruned. I have already implemented a quick fix and it works for our scenario. I am not sure it should be fixed in this way, so please comments if you have any opinions. Thanks Dump of tree-ssa-uninit.c: ;; Function foo (foo, funcdef_no=0, decl_uid=4065, cgraph_uid=0) Use in stmt v_24 = PHI is guarded by : (.NOT.) if (m_6(D) != 0) Operand defs of phi v_1 = PHI is guarded by : (.NOT.) if (n_5(D) <= 9) (.AND.) (.NOT.) if (m_6(D) > 100) (.AND.) if (r_7(D) <= 19) (.OR.) if (n_5(D) <= 9) (.OR.) (.NOT.) if (n_5(D) <= 9) (.AND.) (.NOT.) if (m_6(D) > 100) (.AND.) (.NOT.) if (r_7(D) <= 19) (.AND.) if (l_8(D) != 0) foo (int n, int l, int m, int r) { int v; int g.1; int g.0; : if (n_5(D) <= 9) goto ; else goto ; : if (m_6(D) > 100) goto ; else goto ; : if (r_7(D) <= 19) goto ; else goto ; : if (l_8(D) != 0) goto ; else goto ; : : # v_1 = PHI if (m_6(D) != 0) goto ; else goto ; : # v_25 = PHI g.0_11 = g; g.1_12 = g.0_11 + 1; g = g.1_12; goto ; : bar (); : # v_24 = PHI if (n_5(D) <= 9) goto ; else goto ; : if (m_6(D) > 100) goto ; else goto ; : if (r_7(D) <= 19) goto ; else goto ; : if (m_6(D) > 100) goto ; else goto ; : blah (v_24); if (n_5(D) <= 9) goto ; else goto ; : blah (v_24); goto ; : if (r_7(D) <= 9) goto ; else goto ; : return 0; : # v_22 = PHI goto ; }
[Bug tree-optimization/49498] [4.7/4.8 Regression]: gcc.dg/uninit-pred-8_b.c bogus warning line 20
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49498 --- Comment #19 from bin.cheng 2012-11-21 13:24:02 UTC --- (In reply to comment #18) > *** Bug 55424 has been marked as a duplicate of this bug. *** Just for the record. If the analysis I gave in Comment #17 is right, this PR reveals another bug in tree-ssa-uninit.c, apart from the limitation of MAX_NUM_CHAINS, while PR55424 is only about MAX_NUM_CHAINS.
[Bug rtl-optimization/54910] ARM: Missed optimization of very simple ctz function
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54910 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot ||com --- Comment #2 from bin.cheng 2012-11-29 02:17:37 UTC --- This is fixed if replacing ldr constant by movw/movt. Unfortunately, problem still exists on Thumb1/Cortex-M0, since there is no movw/movt instructions.
[Bug tree-optimization/55906] New: suboptimal code generated for post-inc on Thumb1
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55906 Bug #: 55906 Summary: suboptimal code generated for post-inc on Thumb1 Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com For below program: int ffs(int word) { int i; if (!word) return 0; i = 0; for (;;) { if (((1 << i++) & word) != 0) return i; } } The dump of 164t.optimized is like: ffs (int word) { int i; int _6; int _7; : if (word_3(D) == 0) goto ; else goto ; : : # i_1 = PHI <0(3), i_5(5)> i_5 = i_1 + 1; _6 = word_3(D) >> i_1; _7 = _6 & 1; if (_7 != 0) goto ; else goto ; : goto ; : # i_2 = PHI <0(2), i_5(4)> return i_2; } GCC increases i before i_1 is used, causing i_5 and i_1 to be partitioned into different partitions as in expanded rtl: 2: r115:SI=r0:SI 3: NOTE_INSN_FUNCTION_BEG 9: pc={(r115:SI==0)?L33:pc} REG_BR_PROB 0xf3c 10: NOTE_INSN_BASIC_BLOCK 4 4: r110:SI=0 18: L18: 11: NOTE_INSN_BASIC_BLOCK 5 12: r111:SI=r110:SI+0x1<-i_5/i_1 in different pseudos 13: r116:SI=r115:SI>>r110:SI 14: r118:SI=0x1 15: r117:SI=r116:SI&r118:SI REG_EQUAL r116:SI&0x1 16: pc={(r117:SI!=0)?L21:pc} REG_BR_PROB 0x384 17: NOTE_INSN_BASIC_BLOCK 6 5: r110:SI=r111:SI 19: pc=L18 20: barrier 33: L33: 32: NOTE_INSN_BASIC_BLOCK 7 6: r111:SI=0 21: L21: 22: NOTE_INSN_BASIC_BLOCK 8 23: r114:SI=r111:SI 27: r0:SI=r114:SI 30: use r0:SI Finally, suboptimal codes are generated : ffs: movr3, #0 push{r4, lr} cmpr0, r3 beq.L2 movr2, r3 movr1, #1 .L3: movr4, r0 asrr4, r4, r2 addr3, r2, #1 tstr4, r1 bne.L2 movr2, r3 b.L3 .L2: movr0, r3 @ sp needed pop{r4, pc} While GCC 4.6 generates better codes: ffs: push{lr} subr3, r0, #0 beq.L2 movr3, #0 movr2, #1 .L3: movr1, r0 asrr1, r1, r3 addr3, r3, #1 tstr1, r2 beq.L3 .L2: movr0, r3 @ sp needed for prologue pop{pc} The command line is: arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os -S ffs.c -o ffs.S Same problem exists when optimizing with "-O2"
[Bug target/56058] New: GCC arm-none-eabi build failure
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56058 Bug #: 56058 Summary: GCC arm-none-eabi build failure Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com I configured the gcc with: ../gcc/configure --build=i686-linux-gnu --host=i686-linux-gnu --target=arm-none-eabi --prefix=... --disable-decimal-float --disable-libffi --disable-libgomp --disable-libmudflap --disable-libquadmath --disable-libssp --disable-libstdcxx-pch --disable-lto --disable-nls --disable-shared --disable-threads --disable-tls --with-gnu-as --with-gnu-ld --with-newlib --with-headers=yes --with-sysroot=... --with-gmp=... --with-mpfr=... --with-mpc=... --with-ppl=... --with-cloog=... --with-libelf=... --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm' --enable-languages=c,c++ And it failed with message: build/gengtype \ -S ../../gcc/gcc -I gtyp-input.list -w tmp-gtype.state /bin/sh ../../gcc/gcc/../move-if-change tmp-gtype.state gtype.state build/gengtype \ -r gtype.state echo timestamp > s-gtype build/genattrtab ../../gcc/gcc/config/arm/arm.md insn-conditions.md \ -Atmp-attrtab.c -Dtmp-dfatab.c -Ltmp-latencytab.c genattrtab: unknown value `alu' for `type' attribute make[1]: *** [s-attrtab] Error 1 make[1]: Leaving directory `/home/binche01/work/gcc-patches/arm-none-eabi/trunk-scan_one_insn/build/gcc' make: *** [all-gcc] Error 2 It works if I revert r195295
[Bug target/56102] New: Wrong rtx cost calculated for Thumb1
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56102 Bug #: 56102 Summary: Wrong rtx cost calculated for Thumb1 Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com For below program: double g = 1.0; double func(int a, double d) { if (a > 0) return 0.0 + g; else return 2.0 + d; } compiling with: ./arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os test.c -S -o test.S The assembly code is: .cpu cortex-m0 .fpu softvfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 1 .eabi_attribute 30, 4 .eabi_attribute 34, 0 .eabi_attribute 18, 4 .code16 .file"main.c" .global__aeabi_dadd .text .align1 .globalfunc .code16 .thumb_func .typefunc, %function func: push{r3, lr} cmpr0, #0 ble.L2 ldrr3, .L6+16 ldrr0, [r3] ldrr1, [r3, #4] ldrr3, .L6+4 ldrr2, .L6 b.L4 .L2: movr0, r2 movr1, r3 ldrr2, .L6+8 ldrr3, .L6+12 .L4: bl__aeabi_dadd @ sp needed pop{r3, pc} .L7: .align3 .L6: .word0 .word0 .word0 .word1073741824 .word.LANCHOR0 .sizefunc, .-func .globalg .data .align3 .set.LANCHOR0,. + 0 .typeg, %object .sizeg, 8 g: .word0 .word1072693248 .ident"GCC: (GNU) 4.8.0 20130122 (experimental)" The problem is double word constant isn't split by GCC, causing bigger code size.
[Bug target/56102] Wrong rtx cost calculated for Thumb1
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56102 --- Comment #1 from bin.cheng 2013-01-25 03:46:59 UTC --- I have investigated this issue. GCC uses function init_lower_subreg to initialize costs of MOVE insn with different mode, then uses this information to decompose multi-word pseudo registers into individual registers. The problem is ARM backend returns wrong rtx cost for SET insn with multi-word mode. Specifically, if you define LOG_COSTS in lower-subreg.c, GCC will dump rtx costs when compiling with: arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os/-O2 The dump is: Size costs == SI move: from zero cost 4, from reg cost 4 DI move: original cost 4, split cost 4 * 2 TI move: original cost 4, split cost 4 * 4 EI move: original cost 4, split cost 4 * 6 OI move: original cost 4, split cost 4 * 8 CI move: original cost 4, split cost 4 * 12 XI move: original cost 4, split cost 4 * 16 DQ move: original cost 4, split cost 4 * 2 TQ move: original cost 4, split cost 4 * 4 UDQ move: original cost 4, split cost 4 * 2 UTQ move: original cost 4, split cost 4 * 4 DA move: original cost 4, split cost 4 * 2 TA move: original cost 4, split cost 4 * 4 UDA move: original cost 4, split cost 4 * 2 UTA move: original cost 4, split cost 4 * 4 DF move: original cost 4, split cost 4 * 2 XF move: original cost 4, split cost 4 * 3 DD move: original cost 4, split cost 4 * 2 TD move: original cost 4, split cost 4 * 4 CSI move: original cost 4, split cost 4 * 2 CDI move: original cost 4, split cost 4 * 4 CTI move: original cost 4, split cost 4 * 8 CEI move: original cost 4, split cost 4 * 12 COI move: original cost 4, split cost 4 * 16 CCI move: original cost 4, split cost 4 * 24 CXI move: original cost 4, split cost 4 * 32 SC move: original cost 4, split cost 4 * 2 DC move: original cost 4, split cost 4 * 4 XC move: original cost 4, split cost 4 * 6 V8QI move: original cost 4, split cost 4 * 2 V4HI move: original cost 4, split cost 4 * 2 V2SI move: original cost 4, split cost 4 * 2 V16QI move: original cost 4, split cost 4 * 4 V8HI move: original cost 4, split cost 4 * 4 V4SI move: original cost 4, split cost 4 * 4 V2DI move: original cost 4, split cost 4 * 4 V4HF move: original cost 4, split cost 4 * 2 V2SF move: original cost 4, split cost 4 * 2 V8HF move: original cost 4, split cost 4 * 4 V4SF move: original cost 4, split cost 4 * 4 V2DF move: original cost 4, split cost 4 * 4 Speed costs === SI move: from zero cost 4, from reg cost 4 DI move: original cost 4, split cost 4 * 2 TI move: original cost 4, split cost 4 * 4 EI move: original cost 4, split cost 4 * 6 OI move: original cost 4, split cost 4 * 8 CI move: original cost 4, split cost 4 * 12 XI move: original cost 4, split cost 4 * 16 DQ move: original cost 4, split cost 4 * 2 TQ move: original cost 4, split cost 4 * 4 UDQ move: original cost 4, split cost 4 * 2 UTQ move: original cost 4, split cost 4 * 4 DA move: original cost 4, split cost 4 * 2 TA move: original cost 4, split cost 4 * 4 UDA move: original cost 4, split cost 4 * 2 UTA move: original cost 4, split cost 4 * 4 DF move: original cost 4, split cost 4 * 2 XF move: original cost 4, split cost 4 * 3 DD move: original cost 4, split cost 4 * 2 TD move: original cost 4, split cost 4 * 4 CSI move: original cost 4, split cost 4 * 2 CDI move: original cost 4, split cost 4 * 4 CTI move: original cost 4, split cost 4 * 8 CEI move: original cost 4, split cost 4 * 12 COI move: original cost 4, split cost 4 * 16 CCI move: original cost 4, split cost 4 * 24 CXI move: original cost 4, split cost 4 * 32 SC move: original cost 4, split cost 4 * 2 DC move: original cost 4, split cost 4 * 4 XC move: original cost 4, split cost 4 * 6 V8QI move: original cost 4, split cost 4 * 2 V4HI move: original cost 4, split cost 4 * 2 V2SI move: original cost 4, split cost 4 * 2 V16QI move: original cost 4, split cost 4 * 4 V8HI move: original cost 4, split cost 4 * 4 V4SI move: original cost 4, split cost 4 * 4 V2DI move: original cost 4, split cost 4 * 4 V4HF move: original cost 4, split cost 4 * 2 V2SF move: original cost 4, split cost 4 * 2 V8HF move: original cost 4, split cost 4 * 4 V4SF move: original cost 4, split cost 4 * 4 V2DF move: original cost 4, split cost 4 * 4 The original MOVE insn with multi-word mode has lower costs then split insns, thus preventing gcc from splitting. Root cause is that thumb1_rtx_costs/thumb1_size_rtx_costs does not handle SET/ASHIFT/ASHIFTRT/LSHIFTRT/ROTATERT patterns with multi-word mode, as rtx_cost does. I am working on this and will send a patch.
[Bug target/56102] Wrong rtx cost calculated for Thumb1
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56102 --- Comment #2 from bin.cheng 2013-01-25 07:25:34 UTC --- Created attachment 29270 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29270 correct test case The previous test case is not appropriate, because gcc won't split even with correct thumb1_rtx_cost. Here attaches the right test case.
[Bug rtl-optimization/56124] New: Redundant reload for loading from memory
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56124 Bug #: 56124 Summary: Redundant reload for loading from memory Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com For below test case: typedef __builtin_va_list __gnuc_va_list; typedef __gnuc_va_list va_list; struct _reent { int _stdout; }; struct _reent *_impure_ptr; int bar (struct _reent *, int, const char *, va_list); int foo(const char *fmt , ...) { int ret; va_list ap; struct _reent *ptr = _impure_ptr; __builtin_va_start(ap,fmt); ret = bar (ptr, ((ptr)->_stdout), fmt, ap); __builtin_va_end(ap); return ret; } The dump of reload pass is: 1: NOTE_INSN_DELETED 4: NOTE_INSN_BASIC_BLOCK 2 28: r3:SI=sp:SI+0x10 REG_EQUAL sp:SI+0x10 2: r2:SI=[r3:SI++] REG_INC r3:SI REG_EQUIV [afp:SI] 31: [sp:SI+0x10]=r2:SI 3: NOTE_INSN_FUNCTION_BEG 6: r2:SI=[`*.LC0'] REG_EQUIV `_impure_ptr' 7: r0:SI=[r2:SI] 9: [sp:SI+0x4]=r3:SI 10: r1:SI=[r0:SI] 14: r2:SI=[sp:SI+0x10] 16: r0:SI=call [`bar'] argc:0 25: use r0:SI 29: NOTE_INSN_DELETED which could be: 1: NOTE_INSN_DELETED 4: NOTE_INSN_BASIC_BLOCK 2 28: r3:SI=sp:SI+0x10 REG_EQUAL sp:SI+0x10 2: r2:SI=[r3:SI++] REG_INC r3:SI REG_EQUIV [afp:SI] 3: NOTE_INSN_FUNCTION_BEG 6: r1:SI=[`*.LC0'] REG_EQUIV `_impure_ptr' 7: r0:SI=[r1:SI] 9: [sp:SI+0x4]=r3:SI 10: r1:SI=[r0:SI] 16: r0:SI=call [`bar'] argc:0 25: use r0:SI 29: NOTE_INSN_DELETED It is obvious that insn 31/14 are generated/kept by redundant reload. The command line is: arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os ...
[Bug rtl-optimization/56124] Redundant reload for loading from memory
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56124 --- Comment #1 from bin.cheng 2013-01-28 02:43:10 UTC --- The root cause is in ira:scan_one_insn function. It decrease cost of memory for pseudo which are target of loading from memory: if (set != 0 && REG_P (SET_DEST (set)) && MEM_P (SET_SRC (set)) && (note = find_reg_note (insn, REG_EQUIV, NULL_RTX)) != NULL_RTX && ((MEM_P (XEXP (note, 0))) || (CONSTANT_P (XEXP (note, 0)) && targetm.legitimate_constant_p (GET_MODE (SET_DEST (set)), XEXP (note, 0)) && REG_N_SETS (REGNO (SET_DEST (set))) == 1)) && general_operand (SET_SRC (set), GET_MODE (SET_SRC (set { enum reg_class cl = GENERAL_REGS; rtx reg = SET_DEST (set); int num = COST_INDEX (REGNO (reg)); COSTS (costs, num)->mem_cost -= ira_memory_move_cost[GET_MODE (reg)][cl][1] * frequency; record_address_regs (GET_MODE (SET_SRC (set)), MEM_ADDR_SPACE (SET_SRC (set)), XEXP (SET_SRC (set), 0), 0, MEM, SCRATCH, frequency * 2); counted_mem = true; } The problem is if the src memory rtx (like in insn 2) has side effect, the orig load insn won't be eliminated and causes redundant reload. Patch will be sent for review.
[Bug tree-optimization/56139] New: unmodified static data could go in .rodata, not .data
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56139 Bug #: 56139 Summary: unmodified static data could go in .rodata, not .data Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: amker.ch...@gmail.com For below program: static int x[] = {1, 2, 3, 4}; void bar (int x); int func(int i) { int * const p = (int * const)&x; bar(p[i]); return 0; } build with: arm-none-eabi-gcc -mthumb -mcpu=cortex-m0 -Os ... The generated assembly code is: .text .align1 .globalfunc .code16 .thumb_func .typefunc, %function func: push{r3, lr} ldrr3, .L2 lslr0, r0, #2 ldrr0, [r0, r3] blbar @ sp needed for prologue movr0, #0 pop{r3, pc} .L3: .align2 .L2: .word.LANCHOR0 .sizefunc, .-func .data .align2 .set.LANCHOR0,. + 0 .typex, %object .sizex, 16 x: .word1 .word2 .word3 .word4 while GCC 4.6 puts x in .rodata.
[Bug target/53090] suboptimal ivopt
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53090 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #2 from bin.cheng --- I tried the simple case, gcc doesn't work as expected on x86_64, but x86 is fine. I think there are several issues in ivopt causing this. The first issue is IVOPT is too conservative when representing iv_use with iv_cand in type with smaller precision. Consider below use/cand: use 0 address in statement t_mp_11 = *_10; at position *_10 type int * base perm_9(D) + 4 step 4 base object (void *) perm_9(D) related candidates candidate 5 (important) var_before ivtmp.8 var_after ivtmp.8 incremented before exit test type unsigned int base 1 step 1 candidate 6 (important) original biv type int base 1 step 1 Use 0 is in type "int *" which has precision 64 on x86_64; cand is in type "int" which has precision 32 on x86_64. In function get_computation_cost_at, there is below code: if (TYPE_PRECISION (utype) > TYPE_PRECISION (ctype)) { /* We do not have a precision to express the values of use. */ return infinite_cost; } But this is too conservative because the loop runs for "(j-i)/2" times, which can be expressed by the candidate. Even though the candidate has smaller type than iv_use. We should add some code checking loop niters against candidate's coverage here. For example, the generated assembly changed into: .L14: movl(%rdx), %edi movslq%eax, %rcx addl$1, %eax movl(%r15,%rcx,4), %esi subq$4, %rdx movl%edi, (%r15,%rcx,4) movl%r8d, %ecx subl%eax, %ecx movl%esi, 4(%rdx) cmpl%ecx, %eax jl.L14 Now the original candidate is chosen as rcs for original induction variable "i". Unfortunately there are some other issues which prevent IVOPT from choosing right candidate for original induction variable "j". I will keep looking into it see what's going on.
[Bug tree-optimization/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363 --- Comment #10 from bin.cheng --- Patch sent at http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00857.html , but it need to wait for stage 1. I will xfail it for now.
[Bug tree-optimization/60363] [4.9/4.10 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363 --- Comment #15 from bin.cheng --- Should be fixed now.
[Bug target/61367] New: Annoying rtx cost information in middle end dumps on arm/aarch64 targets
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61367 Bug ID: 61367 Summary: Annoying rtx cost information in middle end dumps on arm/aarch64 targets Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: amker.cheng at gmail dot com Created attachment 32877 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=32877&action=edit zipped dump files. Given a simple program like: #define LEN (32000) __attribute__((aligned(16))) float a[LEN],b[LEN]; int s174 (int M) { for (int i = 0; i < M; i++) { a[i+M] = a[i] + b[i]; } return 0; } Build with O2/O3 -fdump-tree-all-details -fdump-rtl-all-details options. The middle-end's dump files contain lots of rtx cost information, which messes up with true dump information. The dump files of ivopt/cse2 are attached to show this annoying problem.
[Bug target/61411] [NEON] ICE in reload_cse_simplify_operands, at postreload.c:411
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61411 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com, ||mshawcroft at gcc dot gnu.org, ||vmakarov at gcc dot gnu.org --- Comment #1 from bin.cheng --- The patch can fix the issue, but problem is why GCC/lra generated register-indexing ([reg+reg]) addressing mode for V8HImode in the first place. Since without this patch, the address expression is illegal and shouldn't be generated. I didn't look into LRA's code and am not very sure whether this patch is covering the problem. Also added Marcus and Vlad to the CC list.
[Bug target/61411] [NEON] ICE in reload_cse_simplify_operands, at postreload.c:411
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61411 --- Comment #3 from bin.cheng --- Then I think it's a latent bug in LRA. It should consult back-end about legitimized address expressions.
[Bug tree-optimization/60280] New: gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c failed caused by preserving loop structure.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60280 Bug ID: 60280 Summary: gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c failed caused by preserving loop structure. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: amker.cheng at gmail dot com gcc.target/arm/ivopts-2.c is like: /* { dg-do assemble } */ /* { dg-options "-Os -fdump-tree-ivopts -save-temps" } */ extern void foo2 (short*); void tr4 (short array[], int n) { int x; if (n > 0) for (x = 0; x < n; x++) foo2 (&array[x]); } /* { dg-final { scan-tree-dump-times "PHI
[Bug tree-optimization/60280] gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c failed caused by preserving loop structure.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60280 --- Comment #1 from bin.cheng --- It's caused by patch at (revision r198333): http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01530.html After patching, forwarder basic block 6 in below dump didn't get removed: tr4 (short int * array, int n) { int x; unsigned int x.0; unsigned int _7; short int * _9; : if (n_4(D) > 0) goto ; else goto ; : : # x_14 = PHI x.0_6 = (unsigned int) x_14; _7 = x.0_6 * 2; _9 = array_8(D) + _7; foo2 (_9); x_11 = x_14 + 1; if (x_11 < n_4(D)) goto ; else goto ; : return; : goto ; } After expanding, pre-header is filled with pre-loop initialization instructions and the problem turns into a cfglayout problem: 5: NOTE_INSN_BASIC_BLOCK 2 2: r115:SI=r0:SI REG_DEAD r0:SI 3: NOTE_INSN_DELETED 4: NOTE_INSN_FUNCTION_BEG 7: {cc:CC=cmp(r1:SI,0);r116:SI=r1:SI;} REG_DEAD r1:SI 8: pc={(cc:CC>0)?L24:pc} REG_DEAD cc:CC REG_BR_PROB 0x1f98 ;; succ: 4 ;; 5 29: L29: 13: NOTE_INSN_BASIC_BLOCK 3 14: r0:SI=r110:SI 15: call [`foo2'] argc:0 REG_DEAD r0:SI 16: r110:SI=r110:SI+0x2 18: cc:CC=cmp(r110:SI,r114:SI) 19: pc={(cc:CC!=0)?L29:pc} REG_DEAD cc:CC REG_BR_PROB 0x2333 ;; succ: 3 ;; 5 24: L24: 25: NOTE_INSN_BASIC_BLOCK 4 26: r110:SI=r115:SI REG_DEAD r115:SI 27: NOTE_INSN_DELETED 28: r114:SI=r116:SI*0x2+r110:SI REG_DEAD r116:SI ;; succ: 3 32: L32: 33: NOTE_INSN_BASIC_BLOCK 5 ;; succ: EXIT After outof_cfglayout, a jump (in bb3) to exit block is introduced: 5: NOTE_INSN_BASIC_BLOCK 2 3: NOTE_INSN_DELETED 4: NOTE_INSN_FUNCTION_BEG 7: {cc:CC=cmp(r1:SI,0);r1:SI=r1:SI;} 8: pc={(cc:CC>0)?L24:pc} REG_BR_PROB 0x1f98 ;; succ: 6 ;; 3 55: NOTE_INSN_BASIC_BLOCK 3 56: pc=L32 ;; succ: 7 29: L29: 13: NOTE_INSN_BASIC_BLOCK 4 14: r0:SI=r4:SI 15: call [`foo2'] argc:0 16: r4:SI=r4:SI+0x2 18: cc:CC=cmp(r4:SI,r5:SI) 19: pc={(cc:CC!=0)?L29:pc} REG_BR_PROB 0x2333 ;; succ: 4 ;; 5 58: NOTE_INSN_BASIC_BLOCK 5 59: pc=L32 ;; succ: 7 24: L24: 25: NOTE_INSN_BASIC_BLOCK 6 26: r4:SI=r0:SI 27: NOTE_INSN_DELETED 28: r5:SI=r1:SI*0x2+r4:SI 61: pc=L29 ;; succ: 4 32: L32: 33: NOTE_INSN_BASIC_BLOCK 7 ;; succ: EXIT Ideally, basic block reordering could fix this, but before that, pass pro_and_epilogue threads jump in bb3 to a direct return instruction and bb reordering can do nothing any more. So: 1) Unless we can teach passes before pro_and_epilogue to do some bb reordering work, it's inappropriate to fix it on RTL. 2) It's natural to be fixed on GIMPLE, but it's disruptive because the cfg stuff are shared by all GIMPLE(even RTL) optimizers. Yet this method makes more sense than 1). I am trying to work out a less intrusive patch for stage 4.
[Bug tree-optimization/60280] [4.9 Regression] gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c failed caused by preserving loop structure.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60280 --- Comment #3 from bin.cheng --- I think 4_8 is ok for this case. At least it doesn't have http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01530.html committed if I was right.
[Bug tree-optimization/60280] [4.9 Regression] gcc.target/arm/ivopts.c and gcc.target/arm/ivopts-2.c failed caused by preserving loop structure.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60280 bin.cheng changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #6 from bin.cheng --- Patch applied. Fixed I think.
[Bug regression/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363 bin.cheng changed: What|Removed |Added CC||amker.cheng at gmail dot com --- Comment #2 from bin.cheng --- Created attachment 32315 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32315&action=edit tar of dump files.
[Bug regression/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363 --- Comment #3 from bin.cheng --- After patching 208165, there are two more jump threading opportunities for dom1 pass. Jump threading is doing alright, the interesting thing is why there is no such opportunities before patching. I attatched related dump files with/without patch. It seems dumps before vrp1 pass are pretty similar, while after vrp1, dump with patch shows the two additional jump threading opportunities. In other words, they are somehow already fixed (not introduced) in pass vrp1 without patching. For now I can just change ssa-dom-thread-4.c to handle the two jump threadings, or should I look into vrp to find the difference in the first place?
[Bug regression/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363 --- Comment #4 from bin.cheng --- Although may be irrelavant. I found loop's latch doesn't get updated after removing the forwarder latch basic block. Previous patch only catches function remove_forwarder_block, but remove_forwarder_block_with_phi should be handled too. I will send a patch picking this up.
[Bug regression/60363] [4.9 Regression]: logical_op_short_circuit, gcc.dg/tree-ssa/ssa-dom-thread-4.c scan-tree-dump-times dom1 "Threaded" 4
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60363 --- Comment #5 from bin.cheng --- Vrp1 generates below code: : if (b_elt_11(D) != 0B) goto ; else goto ; : # kill_elt_10 = PHI goto ; : kill_elt_14 = kill_elt_2->next; : # kill_elt_2 = PHI if (kill_elt_2 != 0B) goto ; else goto ; : _12 = kill_elt_2->indx; _13 = b_elt_11(D)->indx; if (_12 < _13) goto ; else goto ; ... : goto ; : # kill_elt_41 = PHI <0B(6)> if (b_elt_11(D) != 0B) goto ; else goto ; The whole bb 19 is unnecessary since we know "b_elt_11(D) != 0" holds.