On Mon, 6 May 2019, Martin Liška wrote:
> On 5/2/19 3:18 PM, Richard Biener wrote:
> > On Mon, 29 Apr 2019, Martin Liška wrote:
> >
> >> On 9/10/18 1:43 PM, Martin Liška wrote:
> >>> On 09/04/2018 05:07 PM, Martin Liška wrote:
> >>>> - in order to achieve real speed up we need to split also other
> >>>> generated (and also dwarf2out.c, i386.c, ..) files:
> >>>> here I'm most concerned about insn-recog.c, which can't be split the
> >>>> same way without ending up with a single huge SCC component.
> >>>
> >>> About the insn-recog.c file: all functions are static and using SCC one
> >>> ends
> >>> up with all functions in one component. In order to split the callgraph
> >>> one
> >>> needs to promote some functions to be extern and then split would be
> >>> possible.
> >>> In order to do that we'll probably need to teach splitter how to do
> >>> partitioning
> >>> based on minimal number of edges to be removed.
> >>>
> >>> I need to inspire in lto_balanced_map, or is there some simple algorithm
> >>> I can start with?
> >>>
> >>> Martin
> >>>
> >>
> >> I'm adding here Richard Sandiford as he wrote majority of gcc/genrecog.c
> >> file.
> >> As mentioned, I'm seeking for a way how to split the generated file. Or how
> >> to learn the generator to process a reasonable splitting.
> >
> > Somewhen earlier this year I've done the experiment with using
> > a compile with -flto -fno-fat-lto-objects
>
> -fno-fat-lto-objects is default, isn't it?
Where linker plugin support is detected, yes.
> > and a link
> > via -flto -r -flinker-output=rel into the object file. This cut
> > compile-time more than in half with less maintainance overhead.
>
> Can you please provide exact command line how to do that?
gcc t.c -o t.o -flto=8 -r -flinker-output=nolto-rel
there's an annoying warning:
cc1plus: warning: command line option ‘-flinker-output=nolto-rel’ is valid
for LTO but not for C++
which can be avoided by splitting the above into a compile and
a separate LTO "link" step. Using -Wl,-flinker-.... doesn't
work unfortunately (ld doesn't understand it).
Using installed GCC 9.1 compiling trunk gimple-match.c with -O2 -g
takes 58.7s while with the LTO trick it takes 23.3s (combined
CPU time is up to 96s). That was with -flto=8 on a CPU with
4 physical and 8 logical cores. As it includes -g it includes
the debug copy dance as well.
> bloaty gimple-match.o -- gimple-match.o.nolto
VM SIZE FILE SIZE
++++++++++++++ GROWING
++++++++++++++
[ = ] 0 .rela.debug_info +3.62Mi
+45%
[ = ] 0 .rela.debug_ranges +161Ki
+1.8%
[ = ] 0 .debug_str +95.8Ki
+19%
[ = ] 0 .rela.text +77.6Ki
+10%
[ = ] 0 .debug_ranges +58.9Ki
+1.7%
[ = ] 0 .symtab +22.9Ki
+68%
[ = ] 0 .debug_abbrev +21.1Ki
+394%
[ = ] 0 .strtab +11.4Ki
+9.5%
+8.1% +5.34Ki .eh_frame +5.34Ki
+8.1%
+84% +4.09Ki .rodata.str1.8 +4.09Ki
+84%
[ = ] 0 .rela.text.unlikely +3.87Ki
+1.0%
[ = ] 0 .rela.debug_aranges +3.68Ki
+872%
[ = ] 0 .debug_aranges +3.02Ki
+10e2%
+42% +2.59Ki .rodata.str1.1 +2.59Ki
+42%
+0.2% +2.41Ki [Other] +2.45Ki
+0.2%
[ = ] 0 .rela.debug_line +2.09Ki
+16%
[ = ] 0 .rela.eh_frame +1.17Ki
+4.3%
[NEW] +1.09Ki .rodata._Z7get_defPFP9tree_nodeS0_ES0_.str1.8 +1.09Ki
[NEW]
[ = ] 0 .shstrtab +784
+44%
[ = ] 0 [ELF Headers] +768
+16%
[ = ] 0 .comment +666
+37e2%
-------------- SHRINKING
--------------
[ = ] 0 .debug_line -256Ki
-17.3%
[ = ] 0 .rela.debug_loc -73.6Ki
-0.6%
[ = ] 0 .debug_info -63.4Ki
-1.6%
[ = ] 0 .debug_loc -39.3Ki
-0.6%
+1.1% +15.5Ki TOTAL +3.67Mi
+7.8%
.debug_line probably shrinks because we drop columns with LTO.
Richard.