https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67783

            Bug ID: 67783
           Summary: quadratic time consumption in IPA inlining with -O1
                    and higher
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: ipa
          Assignee: unassigned at gcc dot gnu.org
          Reporter: skvadrik at gmail dot com
  Target Milestone: ---

Created attachment 36423
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36423&action=edit
1.c, real1.c

Consider the following code (1.c):

    static void f (const char * const * p) {}
    void g (const char * p)
    {
    loop:
        if (*p++) goto loop;
        if (*p++) goto loop;
        // ...
        if (*p++) goto loop;
        f (&p);
    }

Compilation time with -O1 grows quadratically in the number of repetitions of
statement 'if (*p++) goto loop;'. For example, 80 repetitions result in 11s:
    $ time gcc -c 1.c -O1

    real    0m11.134s
    user    0m11.097s
    sys     0m0.013s

96 repetitions result in 30s, 120 repetitions result in 1m20s and so on. See
file '1.c' in attach. Also see file 'real1.c' in attach for an example of
real-world program (autogenerated by re2c) from which '1.c' was cut down (it
exhibits quadratic behaviour with -O2 and higher).

I tried to analyze a bit. Here's what I found:

Time report:
    $ gcc -c 1.c -O1 -ftime-report

    Execution times (seconds)
     phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%)
wall    1066 kB (29%) ggc
     phase opt and generate  :  11.44 (100%) usr   0.00 ( 0%) sys  11.47 (100%)
wall    2385 kB (66%) ggc
     ipa inlining heuristics :  11.34 (99%) usr   0.00 ( 0%) sys  11.36 (99%)
wall      12 kB ( 0%) ggc
     df reaching defs        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%)
wall       0 kB ( 0%) ggc
     df live&initialized regs:   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%)
wall       0 kB ( 0%) ggc
     register information    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%)
wall       0 kB ( 0%) ggc
     alias stmt walking      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%)
wall       0 kB ( 0%) ggc
     tree CFG cleanup        :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%)
wall       0 kB ( 0%) ggc
     tree split crit edges   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%)
wall      37 kB ( 1%) ggc
     scev constant prop      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%)
wall      33 kB ( 1%) ggc
     tree SSA verifier       :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%)
wall       0 kB ( 0%) ggc
     tree STMT verifier      :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%)
wall       0 kB ( 0%) ggc
     dominance computation   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%)
wall       0 kB ( 0%) ggc
     loop init               :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%)
wall    1595 kB (44%) ggc
     branch prediction       :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%)
wall       6 kB ( 0%) ggc
     initialize rtl          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%)
wall      12 kB ( 0%) ggc
     TOTAL                 :  11.44             0.00            11.48          
    3617 kB
    Extra diagnostic checks enabled; compiler may run slowly.
    Configure with --enable-checking=release to disable checks.

So, it must be IPA inlining.

First,`valgrind --tool=callgrind` showed that 99% of time was spent in
'estimate_function_body_sizes' from 'gcc/gcc/ipa-inline-analysis.c'.I further
tracked it down to loop on line 2759:
   
https://github.com/gcc-mirror/gcc/blob/master/gcc/ipa-inline-analysis.c#L2759
and inner loop on line 2791:
   
https://github.com/gcc-mirror/gcc/blob/master/gcc/ipa-inline-analysis.c#L2791
This nested loop causes quadratic time consumption.

Second, if I comment out 'f (&p);' in the end of '1.c', file compiles
instantly, but the number of iterations in nested loop remains quadratic. So
the problem is not in quadratic number of iteration, it's in the loop's body.
Note that the function 'f' is empty and its argument is immutable, so it cannot
change 'p' and influence control flow.

This is what `perf record`/`perf report` shows:

  14.38%  cc1      cc1                [.] find_var_scev_info
  13.58%  cc1      cc1                [.] analyze_scalar_evolution_1
  11.25%  cc1      cc1                [.] loop_preheader_edge
   8.56%  cc1      cc1                [.] instantiate_scev_r
   7.83%  cc1      cc1                [.] interpret_rhs_expr
   6.82%  cc1      cc1                [.] htab_find_slot_with_hash
   6.65%  cc1      cc1                [.] analyze_scalar_evolution
   5.63%  cc1      cc1                [.] useless_type_conversion_p
   5.62%  cc1      cc1                [.] chrec_convert_1
   3.06%  cc1      cc1                [.] compute_scalar_evolution_in_loop
   2.98%  cc1      cc1                [.]
chrec_contains_symbols_defined_in_loop
   1.90%  cc1      cc1                [.] instantiate_scev
   1.80%  cc1      cc1                [.] eq_idx_scev_info
   1.57%  cc1      cc1                [.] flow_loop_nested_p
   1.36%  cc1      cc1                [.] dominated_by_p
   1.16%  cc1      cc1                [.] is_gimple_min_invariant
   1.02%  cc1      cc1                [.] compute_overall_effect_of_inner_loop
   0.88%  cc1      cc1                [.] flow_bb_inside_loop_p
   0.83%  cc1      cc1                [.] extract_ops_from_tree_1
   0.51%  cc1      cc1                [.] superloop_at_depth

`valgrind --tool=callgrind` suggests that 'analyze_scalar_evolution' was called
~50000000 times.

Reply via email to