https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93404
--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Created attachment 47701 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47701&action=edit inliner size stats parser It took me a while to find some reasonable way to analyze code size regressions caused by inliner with -flto. The attached script parses output of objdump -dl and maps code size back to original functions. Here is an example of analysis for SPEC2k6 GCC which grows by 16% compared to GCC 9 and is the main contributor to the SPECint2006 regression (0.42MB out of 1.11MB) Building binaries with and w/o auto-inlining: $ /aux/hubicka/trunk-install/bin/gcc -O2 -g -flto=auto *.c -o gcc-inline -I ./ -DSPEC_CPU -DSPEC_CPU_LP64 $ /aux/hubicka/trunk-install/bin/gcc -O2 -g -flto=auto *.c -o gcc-noinline -I ./ -DSPEC_CPU -DSPEC_CPU_LP64 -fno-inline-functions $ size -A gcc-inline | grep text .text 2459681 4203552 $ size -A gcc-noinline | grep text .text 2126369 4203552 (that is 15% regression, so indeed most of regression to gcc 9 is due to inlining) $ objdump -dl gcc-inline | awk -f ~/parseobjdump.awk | sort -n | tail -15 12897 result_ready_cost(): 14162 xmalloc(): 14517 find_reloads.constprop.0(): 14525 gen_rtx_fmt_e0(): 14841 insn_extract(): 15703 rtx_equal_p(): 17105 yyparse_1.isra.0(): 18033 recog_26(): 22079 expand_expr(): 22659 fold(): 24046 recog_16(): 26606 end_sequence(): 31762 gen_rtx_fmt_e(): 78479 gen_rtx_fmt_ee(): 2511153 $ objdump -dl gcc-noinline | awk -f ~/parseobjdump.awk | sort -n | tail -15 10220 expand_binop(): 11667 result_ready_cost(): 11782 recog_30.constprop.0(): 11939 split_2(): 12230 try_combine(): 13423 cse_insn(): 14740 simplify_comparison(): 14829 insn_extract(): 14832 find_reloads.constprop.0(): 16908 yyparse_1.isra.0(): 18101 recog_26(): 23255 expand_expr(): 23450 fold(): 24171 recog_16(): 2167901 So the largest difference is inlining of gen_rtx_fmt_ee. It is still the same as in mainline: 34 rtx 35 gen_rtx_fmt_ee (code, mode, arg0, arg1) 36 RTX_CODE code; 37 enum machine_mode mode; 38 rtx arg0; 39 rtx arg1; 40 { 41 rtx rt; 42 rt = ggc_alloc_rtx (2); 43 memset (rt, 0, sizeof (struct rtx_def) - sizeof (rtunion)); 44 45 PUT_CODE (rt, code); 46 PUT_MODE (rt, mode); 47 XEXP (rt, 0) = arg0; 48 XEXP (rt, 1) = arg1; 49 50 return rt; 51 } so pretty basic ctor that makes sense to inline. Codegen differs on context but typically is something like (which is bit odd): gen_rtx_fmt_ee(): /aux/hubicka/403.gcc/src/genrtl.c:41 4036ae: bf 18 00 00 00 mov $0x18,%edi init_reload(): /aux/hubicka/403.gcc/src/reload1.c:500 4036b3: 41 0f 95 c6 setne %r14b 4036b7: 49 89 c5 mov %rax,%r13 gen_rtx_fmt_ee(): /aux/hubicka/403.gcc/src/genrtl.c:41 4036ba: e8 d1 64 0d 00 callq 4d9b90 <ggc_alloc> init_reload(): /aux/hubicka/403.gcc/src/reload1.c:500 4036bf: 41 83 c6 04 add $0x4,%r14d plus_constant_wide(): /aux/hubicka/403.gcc/src/genrtl.c:41 4036c3: be 04 00 00 00 mov $0x4,%esi gen_rtx_fmt_ee(): /aux/hubicka/403.gcc/src/genrtl.c:42 4036c8: c7 40 03 00 00 00 00 movl $0x0,0x3(%rax) /aux/hubicka/403.gcc/src/genrtl.c:41 4036cf: 48 89 c7 mov %rax,%rdi /aux/hubicka/403.gcc/src/genrtl.c:42 4036d2: c6 40 07 00 movb $0x0,0x7(%rax) /aux/hubicka/403.gcc/src/genrtl.c:44 4036d6: b8 4b 00 00 00 mov $0x4b,%eax 4036db: 66 89 07 mov %ax,(%rdi) /aux/hubicka/403.gcc/src/genrtl.c:45 4036de: 44 88 77 02 mov %r14b,0x2(%rdi) /aux/hubicka/403.gcc/src/genrtl.c:46 4036e2: 4c 89 6f 08 mov %r13,0x8(%rdi) /aux/hubicka/403.gcc/src/genrtl.c:47 4036e6: 4c 89 67 10 mov %r12,0x10(%rdi) So about 49 bytes of code which does have a good chance to optimize with surrounding code. Adding noinline to gen_rtx_fmt_ee indeed saves 53k of binary which seems reasonable given that inlined code was reported to be 75k and offlining leads to a call sequence which is 5 bytes itself + colateral damage on surrounding code. This is not the case of our gen_* expanders which just produce a lot of construction and there is no much to optimize and they do not happen to be performance cirtical enough for inlining to pay back. Inliner does not understnad that. Next two most copied functions: 26606 end_sequence(): 31762 gen_rtx_fmt_e(): are clearly the same story. Adding noinline to all of gen_rtx_fmt* and end_sequence reduces code size to 2340961, so about 5% reduction or 1/3 of the regression. I suppose a lot of remaining code will go into similar kind of inline decisions. 11808 incdec_operand(): 11869 try_combine(): 11937 split_2(): 12849 cse_insn(): 12897 result_ready_cost(): 14229 xmalloc(): 14601 find_reloads.constprop.0(): 14841 insn_extract(): 15696 rtx_equal_p(): 17105 yyparse_1.isra.0(): 18033 recog_26(): 22035 expand_expr(): 22659 fold(): 24046 recog_16(): 2389543 Inlining of rtx_equal_p seems to be next case So most of growth of GCC seems to be due to inlining "constructors" I can see the following: 1) let it be 2) introduce parameter limiting auto-inline on functions which are called many times (i.e. cause more than given percentage growth of whole pgoram) which may be hard to tune 3) restrict auto-inlining on non-leaf functions with -O2 I do not particularly fancy any of those, but probably 3 is most interesting. Comparing to clang: $ size -A gcc-clang | grep text .text 3100881 4203600 20142 expand_expr(): 21720 gen_rtx_fmt_e(): 22181 recog_26(): 22402 fold(): 23286 extract_constrain_insn_cached(): 24681 gen_rtx_CONST_INT(): 24980 extract_insn_cached(): 25915 gen_rtx_fmt_i0(): 25975 start_sequence(): 34935 recog_16(): 35293 xmalloc(): 40560 nonimmediate_operand(): 56208 gen_rtx_fmt_ee(): 99014 register_operand(): 3167996 So clang additionally inlines operand predicates (called many times). It however seems better on gen_rtx_fmt_ee codegen: gen_rtx_fmt_ee(): /aux/hubicka/403.gcc/src/genrtl.c:43 67c72f: bf 18 00 00 00 mov $0x18,%edi 67c734: e8 a7 1a eb ff callq 52e1e0 <ggc_alloc> 67c739: 49 89 c6 mov %rax,%r14 /aux/hubicka/403.gcc/src/genrtl.c:44 67c73c: 48 c7 00 00 00 00 00 movq $0x0,(%rax) /aux/hubicka/403.gcc/src/genrtl.c:46 67c743: 8d 85 4b 00 04 00 lea 0x4004b(%rbp),%eax /aux/hubicka/403.gcc/src/genrtl.c:47 67c749: 41 89 06 mov %eax,(%r14) /aux/hubicka/403.gcc/src/genrtl.c:48 67c74c: 49 89 5e 08 mov %rbx,0x8(%r14) /aux/hubicka/403.gcc/src/genrtl.c:49 67c750: 4d 89 7e 10 mov %r15,0x10(%r14) about 37 bytes including a spill.