https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2018-11-05
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
You can follow the cost model with -fdump-tree-cunroll-details:

Estimating sizes for loop 1
 BB: 3, after_exit: 0
  size:   0 _1 = (long unsigned int) i_12;
   Induction variable computation will be folded away.
  size:   1 _2 = _1 * 4;
   Induction variable computation will be folded away.
  size:   1 _3 = 4026531840B + _2;
   Induction variable computation will be folded away.
  size:   1 *_3 ={v} 3;
  size:   1 i_8 = i_12 + 1;
   Induction variable computation will be folded away.
  size:   1 ivtmp_4 = ivtmp_10 - 1;
   Induction variable computation will be folded away.
  size:   2 if (ivtmp_4 != 0)
   Exit condition will be eliminated in peeled copies.
   Exit condition will be eliminated in last copy.
   Constant conditional.
 BB: 5, after_exit: 1
size: 7-6, last_iteration: 7-6
  Loop size: 7
  Estimated size after unrolling: 7

basically it assumes the store has cost 1 and the jump we elide
has cost two (that's probably not a good estimate size-wise for
a compare against zero, otherwise it's the compare and the jump).
All this costing of course is simplified.

Anyways, the real issue is that we apply

   Loop body is likely going to simplify further, this is difficult
   to guess, we just decrease the result by 1/3.  */
...
    unr_insns = unr_insns * 2 / 3;

even when optimizing for size (it's a good guess only).

So a more conservative estimate would be to do the following, which in
turn might regress some more complex cases where followup CSE would
eliminate the unrolled stores by eliding a temporary local array for example.

So it's all about heuristics ...

With the patch below the cutoff happens at 7.  On x86_64 the looping
function is 32 bytes while the unrolled variant is 54 bytes.

I think a better target for optimizing would be the RTL side, on x86_64
we end up with

000000000000001a <do_stuff_11iter>:
  1a:   b8 00 00 00 f0          mov    $0xf0000000,%eax
  1f:   c7 00 03 00 00 00       movl   $0x3,(%rax)
  25:   c7 40 04 03 00 00 00    movl   $0x3,0x4(%rax)
  2c:   c7 40 08 03 00 00 00    movl   $0x3,0x8(%rax)
  33:   c7 40 0c 03 00 00 00    movl   $0x3,0xc(%rax)
  3a:   c7 40 10 03 00 00 00    movl   $0x3,0x10(%rax)
  41:   c7 40 14 03 00 00 00    movl   $0x3,0x14(%rax)
  48:   c7 40 18 03 00 00 00    movl   $0x3,0x18(%rax)
  4f:   c3                      retq   

where loading 0x3 to a register first would shrink it to 31 bytes:

000000000000001a <do_stuff_11iter>:
  1a:   b8 00 00 00 f0          mov    $0xf0000000,%eax
  1f:   bb 03 00 00 00          mov    $0x3,%ebx
  24:   89 18                   mov    %ebx,(%rax)
  26:   89 58 04                mov    %ebx,0x4(%rax)
  29:   89 58 08                mov    %ebx,0x8(%rax)
  2c:   89 58 0c                mov    %ebx,0xc(%rax)
  2f:   89 58 10                mov    %ebx,0x10(%rax)
  32:   89 58 14                mov    %ebx,0x14(%rax)
  35:   89 58 18                mov    %ebx,0x18(%rax)
  38:   c3                      retq   

and I guess using rep movl might be even smaller (at the expense of
speed of course).

I'm sure arc can store to a register address as well.

In the end I'm not really proposing the below patch but getting rid of
applying the simple factor might be sth to consider in general.

diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index c2953059fb9..9d5d4eede9c 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -395,11 +395,12 @@ tree_estimate_loop_size (struct loop *loop, edge exit,
edg
e edge_to_cancel,
    peeling.

    Loop body is likely going to simplify further, this is difficult
-   to guess, we just decrease the result by 1/3.  */
+   to guess, we just decrease the result by 1/3 when not optimizing
+   for size.  */

 static unsigned HOST_WIDE_INT
 estimated_unrolled_size (struct loop_size *size,
-                        unsigned HOST_WIDE_INT nunroll)
+                        unsigned HOST_WIDE_INT nunroll, bool size_p)
 {
   HOST_WIDE_INT unr_insns = ((nunroll)
                             * (HOST_WIDE_INT) (size->overall
@@ -408,7 +409,8 @@ estimated_unrolled_size (struct loop_size *size,
     unr_insns = 0;
   unr_insns += size->last_iteration -
size->last_iteration_eliminated_by_peeling;

-  unr_insns = unr_insns * 2 / 3;
+  if (!size_p)
+    unr_insns = unr_insns * 2 / 3;
   if (unr_insns <= 0)
     unr_insns = 1;

@@ -791,7 +793,8 @@ try_unroll_loop_completely (struct loop *loop,

          unsigned HOST_WIDE_INT ninsns = size.overall;
          unsigned HOST_WIDE_INT unr_insns
-           = estimated_unrolled_size (&size, n_unroll);
+           = estimated_unrolled_size (&size, n_unroll,
+                                      optimize_loop_for_size_p (loop));
          if (dump_file && (dump_flags & TDF_DETAILS))
            {
              fprintf (dump_file, "  Loop size: %d\n", (int) ninsns);

Reply via email to