https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2018-11-05 CC| |hubicka at gcc dot gnu.org, | |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- You can follow the cost model with -fdump-tree-cunroll-details: Estimating sizes for loop 1 BB: 3, after_exit: 0 size: 0 _1 = (long unsigned int) i_12; Induction variable computation will be folded away. size: 1 _2 = _1 * 4; Induction variable computation will be folded away. size: 1 _3 = 4026531840B + _2; Induction variable computation will be folded away. size: 1 *_3 ={v} 3; size: 1 i_8 = i_12 + 1; Induction variable computation will be folded away. size: 1 ivtmp_4 = ivtmp_10 - 1; Induction variable computation will be folded away. size: 2 if (ivtmp_4 != 0) Exit condition will be eliminated in peeled copies. Exit condition will be eliminated in last copy. Constant conditional. BB: 5, after_exit: 1 size: 7-6, last_iteration: 7-6 Loop size: 7 Estimated size after unrolling: 7 basically it assumes the store has cost 1 and the jump we elide has cost two (that's probably not a good estimate size-wise for a compare against zero, otherwise it's the compare and the jump). All this costing of course is simplified. Anyways, the real issue is that we apply Loop body is likely going to simplify further, this is difficult to guess, we just decrease the result by 1/3. */ ... unr_insns = unr_insns * 2 / 3; even when optimizing for size (it's a good guess only). So a more conservative estimate would be to do the following, which in turn might regress some more complex cases where followup CSE would eliminate the unrolled stores by eliding a temporary local array for example. So it's all about heuristics ... With the patch below the cutoff happens at 7. On x86_64 the looping function is 32 bytes while the unrolled variant is 54 bytes. I think a better target for optimizing would be the RTL side, on x86_64 we end up with 000000000000001a <do_stuff_11iter>: 1a: b8 00 00 00 f0 mov $0xf0000000,%eax 1f: c7 00 03 00 00 00 movl $0x3,(%rax) 25: c7 40 04 03 00 00 00 movl $0x3,0x4(%rax) 2c: c7 40 08 03 00 00 00 movl $0x3,0x8(%rax) 33: c7 40 0c 03 00 00 00 movl $0x3,0xc(%rax) 3a: c7 40 10 03 00 00 00 movl $0x3,0x10(%rax) 41: c7 40 14 03 00 00 00 movl $0x3,0x14(%rax) 48: c7 40 18 03 00 00 00 movl $0x3,0x18(%rax) 4f: c3 retq where loading 0x3 to a register first would shrink it to 31 bytes: 000000000000001a <do_stuff_11iter>: 1a: b8 00 00 00 f0 mov $0xf0000000,%eax 1f: bb 03 00 00 00 mov $0x3,%ebx 24: 89 18 mov %ebx,(%rax) 26: 89 58 04 mov %ebx,0x4(%rax) 29: 89 58 08 mov %ebx,0x8(%rax) 2c: 89 58 0c mov %ebx,0xc(%rax) 2f: 89 58 10 mov %ebx,0x10(%rax) 32: 89 58 14 mov %ebx,0x14(%rax) 35: 89 58 18 mov %ebx,0x18(%rax) 38: c3 retq and I guess using rep movl might be even smaller (at the expense of speed of course). I'm sure arc can store to a register address as well. In the end I'm not really proposing the below patch but getting rid of applying the simple factor might be sth to consider in general. diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c index c2953059fb9..9d5d4eede9c 100644 --- a/gcc/tree-ssa-loop-ivcanon.c +++ b/gcc/tree-ssa-loop-ivcanon.c @@ -395,11 +395,12 @@ tree_estimate_loop_size (struct loop *loop, edge exit, edg e edge_to_cancel, peeling. Loop body is likely going to simplify further, this is difficult - to guess, we just decrease the result by 1/3. */ + to guess, we just decrease the result by 1/3 when not optimizing + for size. */ static unsigned HOST_WIDE_INT estimated_unrolled_size (struct loop_size *size, - unsigned HOST_WIDE_INT nunroll) + unsigned HOST_WIDE_INT nunroll, bool size_p) { HOST_WIDE_INT unr_insns = ((nunroll) * (HOST_WIDE_INT) (size->overall @@ -408,7 +409,8 @@ estimated_unrolled_size (struct loop_size *size, unr_insns = 0; unr_insns += size->last_iteration - size->last_iteration_eliminated_by_peeling; - unr_insns = unr_insns * 2 / 3; + if (!size_p) + unr_insns = unr_insns * 2 / 3; if (unr_insns <= 0) unr_insns = 1; @@ -791,7 +793,8 @@ try_unroll_loop_completely (struct loop *loop, unsigned HOST_WIDE_INT ninsns = size.overall; unsigned HOST_WIDE_INT unr_insns - = estimated_unrolled_size (&size, n_unroll); + = estimated_unrolled_size (&size, n_unroll, + optimize_loop_for_size_p (loop)); if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, " Loop size: %d\n", (int) ninsns);