[Bug tree-optimization/93404] [10 regression] -O2 and -O2 -flto SPEC2006 and 2017 code size regression

hubicka at gcc dot gnu.org Thu, 23 Jan 2020 09:16:25 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93404


--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Created attachment 47701
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47701&action=edit
inliner size stats parser

It took me a while to find some reasonable way to analyze code size regressions
 caused by inliner with -flto. The attached script parses output of objdump -dl
and maps code size back to original functions.

Here is an example of analysis for SPEC2k6 GCC which grows by 16% compared to
GCC 9 and is the main contributor to the SPECint2006 regression (0.42MB out of
1.11MB)

Building binaries with and w/o auto-inlining:

$ /aux/hubicka/trunk-install/bin/gcc -O2 -g -flto=auto *.c -o gcc-inline -I ./
-DSPEC_CPU -DSPEC_CPU_LP64
$ /aux/hubicka/trunk-install/bin/gcc -O2 -g -flto=auto *.c -o gcc-noinline -I
./ -DSPEC_CPU -DSPEC_CPU_LP64 -fno-inline-functions
$ size -A gcc-inline | grep text
.text            2459681   4203552
$ size -A gcc-noinline | grep text
.text            2126369   4203552

(that is 15% regression, so indeed most of regression to gcc 9 is due to
inlining)

$ objdump -dl gcc-inline | awk -f ~/parseobjdump.awk | sort -n | tail -15
12897   result_ready_cost():
14162   xmalloc():
14517   find_reloads.constprop.0():
14525   gen_rtx_fmt_e0():
14841   insn_extract():
15703   rtx_equal_p():
17105   yyparse_1.isra.0():
18033   recog_26():
22079   expand_expr():
22659   fold():
24046   recog_16():
26606   end_sequence():
31762   gen_rtx_fmt_e():
78479   gen_rtx_fmt_ee():
2511153

$ objdump -dl gcc-noinline | awk -f ~/parseobjdump.awk | sort -n | tail -15
10220   expand_binop():
11667   result_ready_cost():
11782   recog_30.constprop.0():
11939   split_2():
12230   try_combine():
13423   cse_insn():
14740   simplify_comparison():
14829   insn_extract():
14832   find_reloads.constprop.0():
16908   yyparse_1.isra.0():
18101   recog_26():
23255   expand_expr():
23450   fold():
24171   recog_16():
2167901


So the largest difference is inlining of gen_rtx_fmt_ee. It is still the same
as in mainline:

     34 rtx
     35 gen_rtx_fmt_ee (code, mode, arg0, arg1)
     36      RTX_CODE code;
     37      enum machine_mode mode;
     38      rtx arg0;
     39      rtx arg1;
     40 {
     41   rtx rt;
     42   rt = ggc_alloc_rtx (2);
     43   memset (rt, 0, sizeof (struct rtx_def) - sizeof (rtunion));
     44 
     45   PUT_CODE (rt, code);
     46   PUT_MODE (rt, mode);
     47   XEXP (rt, 0) = arg0;
     48   XEXP (rt, 1) = arg1;
     49 
     50   return rt;
     51 }

so pretty basic ctor that makes sense to inline.  Codegen differs on context
but typically is something like (which is bit odd):
gen_rtx_fmt_ee():
/aux/hubicka/403.gcc/src/genrtl.c:41
  4036ae:       bf 18 00 00 00          mov    $0x18,%edi
init_reload():
/aux/hubicka/403.gcc/src/reload1.c:500
  4036b3:       41 0f 95 c6             setne  %r14b
  4036b7:       49 89 c5                mov    %rax,%r13
gen_rtx_fmt_ee():
/aux/hubicka/403.gcc/src/genrtl.c:41
  4036ba:       e8 d1 64 0d 00          callq  4d9b90 <ggc_alloc>
init_reload():
/aux/hubicka/403.gcc/src/reload1.c:500
  4036bf:       41 83 c6 04             add    $0x4,%r14d
plus_constant_wide():
/aux/hubicka/403.gcc/src/genrtl.c:41
  4036c3:       be 04 00 00 00          mov    $0x4,%esi
gen_rtx_fmt_ee():
/aux/hubicka/403.gcc/src/genrtl.c:42
  4036c8:       c7 40 03 00 00 00 00    movl   $0x0,0x3(%rax)
/aux/hubicka/403.gcc/src/genrtl.c:41
  4036cf:       48 89 c7                mov    %rax,%rdi
/aux/hubicka/403.gcc/src/genrtl.c:42
  4036d2:       c6 40 07 00             movb   $0x0,0x7(%rax)
/aux/hubicka/403.gcc/src/genrtl.c:44
  4036d6:       b8 4b 00 00 00          mov    $0x4b,%eax
  4036db:       66 89 07                mov    %ax,(%rdi)
/aux/hubicka/403.gcc/src/genrtl.c:45
  4036de:       44 88 77 02             mov    %r14b,0x2(%rdi)
/aux/hubicka/403.gcc/src/genrtl.c:46
  4036e2:       4c 89 6f 08             mov    %r13,0x8(%rdi)
/aux/hubicka/403.gcc/src/genrtl.c:47
  4036e6:       4c 89 67 10             mov    %r12,0x10(%rdi)
So about 49 bytes of code which does have a good chance to optimize with
surrounding code.

Adding noinline to gen_rtx_fmt_ee indeed saves 53k of binary which seems
reasonable given that inlined code was reported to be 75k and offlining leads
to a call sequence which is 5 bytes itself + colateral damage on surrounding
code.
This is not the case of our gen_* expanders which just produce a lot of
construction and there is no much to optimize and they do not happen to be
performance cirtical enough for inlining to pay back. Inliner does not
understnad that.

Next two most copied functions:
26606   end_sequence():
31762   gen_rtx_fmt_e():
are clearly the same story.

Adding noinline to all of gen_rtx_fmt* and end_sequence reduces code size to
2340961, so about 5% reduction or 1/3 of the regression. I suppose a lot of
remaining code will go into similar kind of inline decisions.

11808   incdec_operand():
11869   try_combine():
11937   split_2():
12849   cse_insn():
12897   result_ready_cost():
14229   xmalloc():
14601   find_reloads.constprop.0():
14841   insn_extract():
15696   rtx_equal_p():
17105   yyparse_1.isra.0():
18033   recog_26():
22035   expand_expr():
22659   fold():
24046   recog_16():
2389543

Inlining of rtx_equal_p seems to be next case

So most of growth of GCC seems to be due to inlining "constructors" I can see
the following:
 1) let it be
 2) introduce parameter limiting auto-inline on functions which are called many
times (i.e. cause more than given percentage growth of whole pgoram) which may
be hard to tune
 3) restrict auto-inlining on non-leaf functions with -O2

I do not particularly fancy any of those, but probably 3 is most interesting.

Comparing to clang:
$ size -A gcc-clang  | grep text
.text             3100881   4203600
20142   expand_expr():
21720   gen_rtx_fmt_e():
22181   recog_26():
22402   fold():
23286   extract_constrain_insn_cached():
24681   gen_rtx_CONST_INT():
24980   extract_insn_cached():
25915   gen_rtx_fmt_i0():
25975   start_sequence():
34935   recog_16():
35293   xmalloc():
40560   nonimmediate_operand():
56208   gen_rtx_fmt_ee():
99014   register_operand():
3167996

So clang additionally inlines operand predicates (called many times). It
however seems better on gen_rtx_fmt_ee codegen:
gen_rtx_fmt_ee():
/aux/hubicka/403.gcc/src/genrtl.c:43
  67c72f:       bf 18 00 00 00          mov    $0x18,%edi
  67c734:       e8 a7 1a eb ff          callq  52e1e0 <ggc_alloc>
  67c739:       49 89 c6                mov    %rax,%r14
/aux/hubicka/403.gcc/src/genrtl.c:44
  67c73c:       48 c7 00 00 00 00 00    movq   $0x0,(%rax)
/aux/hubicka/403.gcc/src/genrtl.c:46
  67c743:       8d 85 4b 00 04 00       lea    0x4004b(%rbp),%eax
/aux/hubicka/403.gcc/src/genrtl.c:47
  67c749:       41 89 06                mov    %eax,(%r14)
/aux/hubicka/403.gcc/src/genrtl.c:48
  67c74c:       49 89 5e 08             mov    %rbx,0x8(%r14)
/aux/hubicka/403.gcc/src/genrtl.c:49
  67c750:       4d 89 7e 10             mov    %r15,0x10(%r14)
about 37 bytes including a spill.

[Bug tree-optimization/93404] [10 regression] -O2 and -O2 -flto SPEC2006 and 2017 code size regression

Reply via email to