Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

Giuliano Belinassi via Gcc Mon, 31 Aug 2020 09:22:53 -0700

Hi, Richi.

On 08/31, Richard Biener wrote:
> On Mon, Aug 31, 2020 at 1:15 PM Jan Hubicka <hubi...@ucw.cz> wrote:
> >
> > > On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
> > > <giuliano.belina...@usp.br> wrote:
> > > >
> > > > Hi,
> > > >
> > > > This is the final report of the "Automatic Parallel Compilation
> > > > Viability" project.  Please notice that this report is pretty
> > > > similar to the delivered from the 2nd evaluation, as this phase
> > > > consisted of mostly rebasing and bug fixing.
> > > >
> > > > Please, reply this message for any question or suggestion.
> > >
> > > Thank you for your great work Giuliano!
> >
> > Indeed, it is quite amazing work :)
> > >
> > > It's odd that LTO emulated parallelism is winning here,
> > > I'd have expected it to be slower.  One factor might
> > > be different partitioning choices and the other might
> > > be that the I/O required is faster than the GC induced
> > > COW overhead after forking.  Note you can optimize
> > > one COW operation by re-using the main process for
> > > compiling the last partition.  I suppose you tested
> > > this on a system with a fast SSD so I/O overhead is
> > > small?
> >
> > At the time I implemented fork based parallelism for WPA (which I think
> > we could recover by bit generalizing guiliano's patches), I had same
> > outcome: forked ltranses was simply running slower than those after
> > streaming.  This was however tested on Firefox in my estimate sometime
> > around 2013. I never tried it on units comparable to insn-emit (which
> > would be differnt at that time anyway). I was mostly aiming to get it
> > fully transparent with streaming but never quite finished it since, at
> > that time, it I tought time is better spent on optimizing LTO data
> > layout.
> >
> > I suppose we want to keep both mechanizms in both WPA and normal
> > compilation and make compiler to choose fitting one.
> 
> I repeated Giulianos experiment on gimple-match.ii and
> producing LTO bytecode takes 5.3s and the link step
> 9.5s with two jobs, 6.6s with three and 5.0s with four
> and 2.4s with 32.
> 
> With -fparallel-jobs=N and --param promote-statics=1 I
> see 14.8s, 13.9s and 13.5s here.  With 8 jobs the reduction
> is to 11s.
> 
> It looks like LTO much more aggressively partitions
> this - I see 36 partitions generated for gimple-match.c
> while -fparallel-jobs creates "only" 27.  -fparallel-jobs
> doesn't seem to honor the various lto-partition
> --params btw?  The relevant ones would be
> --param lto-partitions (the max. number of partitions
> to generate) and --param lto-min-partition
> (the minimum size for a partition).  I always thought
> that lto-min-partition should be higher for
> -fparallel-jobs (which I envisioned to be enabled by
> default).


There is a partition balancing mechanism that can be disabled
with --param=balance-partitions=0.

Assuming that you used -fparallel-jobs=32, it may be possible
that it merged small partitions until it reached the average
size of max_size / 33, which resulted in 27 partitions.

The only lto parameter that I use is --param=lto-min-partition
controlling the minimum size in which that it will run
in parallel. 

> 
> I guess before investigating the current state in detail
> it might be worth exploring Honzas wish of sharing
> the actual partitioning code between LTO and -fparallel-jobs.
> 
> Note that larger objects take a bigger hit from the GC COW
> issue so at some point that becomes dominant because the
> first GC walk for each partition is the same as doing a GC
> walk for the whole object.  Eventually it makes sense to
> turn off GC completely for smaller partitions.

Just a side note, I added a ggc_collect () before start forking
and it did not improve things.

Thank you,
Giuliano.

> 
> Richard.
> 
> > Honza
> > >
> > > Thanks again,
> > > Richard.
> > >
> > > > Thank you,
> > > > Giuliano.
> > > >
> > > > --- 8< -----------
> > > >
> > > > # Automatic Parallel Compilation Viability: Final Report
> > > >
> > > > ## Complete Tasks
> > > >
> > > > For the third evaluation, we expected to deliver the product as a
> > > > series of patches for trunk.  The patch series were in fact delivered
> > > > [1], but several items must be fixed before merge.
> > > >
> > > >
> > > > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > > > Bootstrap is working, and therefore this can be used in an experimental
> > > > state.
> > > >
> > > > ## How to use
> > > >
> > > > 1. Clone the autopar_devel branch:
> > > > ```
> > > > git clone --single-branch --branch devel/autopar_devel \
> > > >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > > > ```
> > > > 2. Follow the standard compilation options provided in the Compiling
> > > > GCC page, and install it on some directory. For instance:
> > > >
> > > > ```
> > > > cd gcc_autopar_devel
> > > > mkdir build && cd build
> > > > ../configure --disable-bootstrap --enable-languages=c,c++
> > > > make -j 8
> > > > make DESTDIR=/tmp/gcc11_autopar install
> > > > ```
> > > >
> > > > 3. If you want to test whether your version is working, just launch
> > > > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> > > >
> > > > 5. If you want to compile a project with this version it uses GNU
> > > > Makefiles, you must modify the compilation rule command and prepend a
> > > > `+` token to it. For example, in Git's Makefile, Change:
> > > > ```
> > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > >         $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > > $(EXTRA_CPPFLAGS) $<
> > > > ```
> > > > to:
> > > > ```
> > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > > >         +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > > $(EXTRA_CPPFLAGS) $<
> > > > ```
> > > > as well as point the CC variable to the installed gcc, and
> > > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> > > >
> > > > # How the parallelism works in this project
> > > >
> > > > In LTO, the Whole Program Analysis decides how to partition the
> > > > callgraph for running the LTRANS stage in parallel.  This project
> > > > works very similar to this, however with some changes.
> > > >
> > > > The first was to modify the LTO structure so that it accepts
> > > > the compilation without IR streaming to files.  This avoid an IO
> > > > overhead when compiling in parallel.
> > > >
> > > > The second was to use a custom partitioner to find which nodes
> > > > should be in the same partition.  This was mainly done to bring COMDAT
> > > > together, as well as symbols that are part of other symbols, and even
> > > > private symbols so that we do not output hidden global symbols.
> > > >
> > > > However, experiment showed that bringing private symbols together did
> > > > not yield a interesting speedup on some large files, and therefore
> > > > we implemented two modes of partitioning:
> > > >
> > > > 1. Partition without static promotion. This is the safer method to use,
> > > > as we do not modify symbols in the Compilation Unit. This may lead to
> > > > speedups in files that have multiple entries points with low
> > > > connectivity between then (such as insn-emit.c), however this will not
> > > > provide speedups when this hypothesis is not true (gimple-match.c is an
> > > > example of this). This is the default mode.
> > > >
> > > > 2. Partition with static promotion to global. This is a more aggressive
> > > > method, as we can decide to promote some functions to global to increase
> > > > parallelism opportunity. This also will change the final assembler name
> > > > of the promoted function to avoid collision with functions of others
> > > > Compilation Units. To use this mode, the user has to manually specify
> > > > --param=promote-statics=1, as they must be aware of this.
> > > >
> > > > Currently, partitioner 2. do not account the number of nodes to be
> > > > promoted.  Implementing this certainly will reduce impact on produced
> > > > code.
> > > >
> > > > ## Jobserver Integration
> > > >
> > > > We implemented a interface to communicate with the GNU Make's Jobserver
> > > > that is able to detect when the GNU Make Jobserver is active, thanks to
> > > > Nathan Sidwell. This works as follows:
> > > >
> > > > When -fparallel-jobs=jobserver is provided, GCC will try to detect if
> > > > there is a running Jobserver in which we can communicate to. If true,
> > > > we return the token that Make originally gave to us, then we wait for
> > > > make for a new token that, when provided, will launch a forked child
> > > > process with the partition information; or fall back to default
> > > > compilation if Jobserver is not detected with a warning. Now if
> > > > -fparallel-jobs=auto, this warning will not be provided and the default
> > > > number of jobs will be set to 2, unless a single core CPU is detected.
> > > > For more information about the implementation, check gcc/jobserver.cc
> > > > file in the autopar_devel branch.
> > > >
> > > > ## Speedups
> > > >
> > > > Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 8565U,
> > > > and from 0.99x to 3.3x when testing on a 32-Cores AMD Opteron 6376.
> > > > As a comparison, the theoretical maximum speedup by parallelizing this
> > > > part of compilation is 4x, so there are things that can be improved
> > > > here.
> > > >
> > > > Results are shown in the following table. There are three columns:
> > > >
> > > > * Sequential, which means running the compilation with just "-c -O2".
> > > > * Without Static Promotion, which means appending 
> > > > -fparallel-jobs=<NUM_THREADS>
> > > > * With Static Promotion, which means appending 
> > > > -fparallel-jobs=<NUM_THREADS>
> > > > and --param=promote-statics=1
> > > > * Speedup, which is Sequential / MIN (W/O Static Promotion, Static 
> > > > Promotion)
> > > > * LTO Emulated Parallelism, which consist of running a two step:
> > > >    g++ -c <FILE> -O2 -o t.il.o -flto -fno-fat-lto-objects && \
> > > >    g++ -o t.o t.il.o -O2 -r -flinker-output=nolto-rel 
> > > > -flto=<NUM_THREADS>
> > > >
> > > > The test was the result of a single execution with a previous warm up
> > > > execution. The compiled GCC had checking enabled, and therefore release
> > > > version might have better timings in both sequential and parallel, but 
> > > > the
> > > > speedup may remain the same.
> > > >
> > > > CPU: Intel Core-i7 8565U (4 Cores, 8 Threads)
> > > > |                |            | Without Static | With Static |         
> > > > | LTO Emulated |
> > > > | File           | Sequential |    Promotion   |  Promotion  | Speedup 
> > > > |  Parallelism |
> > > > |----------------|------------|----------------|-----------------------|--------------|
> > > > | gimple-match.c |     60s    |       63s      |     34s     |   1.7x  
> > > > |     32s      |
> > > > | insn-emit.c    |     37s    |       19s      |     20s     |   1.9x  
> > > > |     20s      |
> > > >
> > > > CPU: 4x AMD Opteron 6376 (32 Cores, 64 Threads)
> > > > |                |            | Without Static | With Static |         
> > > > | LTO Emulated |
> > > > | File           | Sequential |    Promotion   |  Promotion  | Speedup 
> > > > |  Parallelism |
> > > > |----------------|------------|----------------|-----------------------|--------------|
> > > > | gimple-match.c |    2m42s   |      2m43s     |     49s     |   3.3x  
> > > > |     46s      |
> > > > | insn-emit.c    |    1m35s   |       30s      |     30s     |   3.1x  
> > > > |     32s      |
> > > >
> > > > There was also a reduction from 35m32s to 29m30s when bootstrapping
> > > > GCC when -fparallel-jobs=8 on the Opteron machine.
> > > >
> > > > ## Conclusions
> > > >
> > > > This project can still be improved on a lot of places. For example,
> > > > just by looking at the table above, we can conclude that:
> > > >
> > > > * Promoting statics to globals can increase parallelism opportunity on
> > > > compilation.
> > > > * Even with the IO Streaming overhead, the LTO Emulated Parallelism
> > > > was faster than with our method. This means that either our
> > > > partitioner is slow or we are losing time into something which can
> > > > be improved.
> > > > * This can be used as a way to reduce overall compilation time in
> > > > many-core machines.
> > > >
> > > > ## Fixed bugs from phase 2:
> > > >
> > > > * Make -fparallel-jobs=jobserver automatically detect if jobserver is
> > > > present.
> > > > * Fix a bug which caused the resulting object file to not be the same
> > > > in every compilation, therefore making bootstrap comparison work as
> > > > expected.
> > > > * Fix a bug which caused LTO to crash due to a uninitialized output
> > > > FILE pointer.
> > > >
> > > > ## TODOs:
> > > >
> > > > * Maybe increase minimal partition size (--param=lto-min-partitions).
> > > > * Improve checking for unreasonable bias (unbalanced partitons).
> > > > * Use cost of partitioning symbols into account.
> > > > * Avoid forking for the first (or last) partitions.
> > > > * Modify the driver to use the SPEC language instead of injecting
> > > > commands in the execute () function.
> > > > * Use the main process to write the assembler files instead of the
> > > > worker process.  This removes the requirement for sorting later in
> > > > the driver.
> > > > * Remove hidden "-fPIC" and "fPIE" from the driver.
> > > > * Check for race condition in additional_asm_file, as it is created
> > > > later in the compilation process.
> > > > * Replace all array allocations of size greater or equal the number
> > > > of nodes in the graph from the stack to the heap (i.e. avoid alloca).
> > > > * Use 64-bit integers when computing sizes.
> > > > * Check for ways to reduce the time spent inside the kernel.
> > > > * Add support to every GCC frontend.
> > > > * Rely on ipa_summary instead of setting a custom cost to nodes
> > > > without summary.
> > > > * Merge the current partitioning logic with the LTO logic.
> > > > * Merge compute_boundary function with what is already done in LTO.
> > > >
> > > > Cosmetic:
> > > >
> > > > * Change n to num_nodes when that is related to the number of nodes of
> > > > the graph.
> > > > * Explain better what the compression[] array does in lto-partition.c
> > > >
> > > > [1] - https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552346.html

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

Reply via email to