Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

Richard Biener via Gcc Mon, 31 Aug 2020 05:09:38 -0700

On Mon, Aug 31, 2020 at 1:15 PM Jan Hubicka <[email protected]> wrote:
>
> > On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi
> > <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > This is the final report of the "Automatic Parallel Compilation
> > > Viability" project.  Please notice that this report is pretty
> > > similar to the delivered from the 2nd evaluation, as this phase
> > > consisted of mostly rebasing and bug fixing.
> > >
> > > Please, reply this message for any question or suggestion.
> >
> > Thank you for your great work Giuliano!
>
> Indeed, it is quite amazing work :)
> >
> > It's odd that LTO emulated parallelism is winning here,
> > I'd have expected it to be slower.  One factor might
> > be different partitioning choices and the other might
> > be that the I/O required is faster than the GC induced
> > COW overhead after forking.  Note you can optimize
> > one COW operation by re-using the main process for
> > compiling the last partition.  I suppose you tested
> > this on a system with a fast SSD so I/O overhead is
> > small?
>
> At the time I implemented fork based parallelism for WPA (which I think
> we could recover by bit generalizing guiliano's patches), I had same
> outcome: forked ltranses was simply running slower than those after
> streaming.  This was however tested on Firefox in my estimate sometime
> around 2013. I never tried it on units comparable to insn-emit (which
> would be differnt at that time anyway). I was mostly aiming to get it
> fully transparent with streaming but never quite finished it since, at
> that time, it I tought time is better spent on optimizing LTO data
> layout.
>
> I suppose we want to keep both mechanizms in both WPA and normal
> compilation and make compiler to choose fitting one.


I repeated Giulianos experiment on gimple-match.ii and
producing LTO bytecode takes 5.3s and the link step
9.5s with two jobs, 6.6s with three and 5.0s with four
and 2.4s with 32.

With -fparallel-jobs=N and --param promote-statics=1 I
see 14.8s, 13.9s and 13.5s here.  With 8 jobs the reduction
is to 11s.

It looks like LTO much more aggressively partitions
this - I see 36 partitions generated for gimple-match.c
while -fparallel-jobs creates "only" 27.  -fparallel-jobs
doesn't seem to honor the various lto-partition
--params btw?  The relevant ones would be
--param lto-partitions (the max. number of partitions
to generate) and --param lto-min-partition
(the minimum size for a partition).  I always thought
that lto-min-partition should be higher for
-fparallel-jobs (which I envisioned to be enabled by
default).

I guess before investigating the current state in detail
it might be worth exploring Honzas wish of sharing
the actual partitioning code between LTO and -fparallel-jobs.

Note that larger objects take a bigger hit from the GC COW
issue so at some point that becomes dominant because the
first GC walk for each partition is the same as doing a GC
walk for the whole object.  Eventually it makes sense to
turn off GC completely for smaller partitions.

Richard.

> Honza
> >
> > Thanks again,
> > Richard.
> >
> > > Thank you,
> > > Giuliano.
> > >
> > > --- 8< -----------
> > >
> > > # Automatic Parallel Compilation Viability: Final Report
> > >
> > > ## Complete Tasks
> > >
> > > For the third evaluation, we expected to deliver the product as a
> > > series of patches for trunk.  The patch series were in fact delivered
> > > [1], but several items must be fixed before merge.
> > >
> > >
> > > Overall, the project works and speedups ranges from 0.95x to 3.3x.
> > > Bootstrap is working, and therefore this can be used in an experimental
> > > state.
> > >
> > > ## How to use
> > >
> > > 1. Clone the autopar_devel branch:
> > > ```
> > > git clone --single-branch --branch devel/autopar_devel \
> > >   git://gcc.gnu.org/git/gcc.git gcc_autopar_devel
> > > ```
> > > 2. Follow the standard compilation options provided in the Compiling
> > > GCC page, and install it on some directory. For instance:
> > >
> > > ```
> > > cd gcc_autopar_devel
> > > mkdir build && cd build
> > > ../configure --disable-bootstrap --enable-languages=c,c++
> > > make -j 8
> > > make DESTDIR=/tmp/gcc11_autopar install
> > > ```
> > >
> > > 3. If you want to test whether your version is working, just launch
> > > Gcc with `-fparallel-jobs=2` when compiling a file with -c.
> > >
> > > 5. If you want to compile a project with this version it uses GNU
> > > Makefiles, you must modify the compilation rule command and prepend a
> > > `+` token to it. For example, in Git's Makefile, Change:
> > > ```
> > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > >         $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > $(EXTRA_CPPFLAGS) $<
> > > ```
> > > to:
> > > ```
> > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs)
> > >         +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) 
> > > $(EXTRA_CPPFLAGS) $<
> > > ```
> > > as well as point the CC variable to the installed gcc, and
> > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable.
> > >
> > > # How the parallelism works in this project
> > >
> > > In LTO, the Whole Program Analysis decides how to partition the
> > > callgraph for running the LTRANS stage in parallel.  This project
> > > works very similar to this, however with some changes.
> > >
> > > The first was to modify the LTO structure so that it accepts
> > > the compilation without IR streaming to files.  This avoid an IO
> > > overhead when compiling in parallel.
> > >
> > > The second was to use a custom partitioner to find which nodes
> > > should be in the same partition.  This was mainly done to bring COMDAT
> > > together, as well as symbols that are part of other symbols, and even
> > > private symbols so that we do not output hidden global symbols.
> > >
> > > However, experiment showed that bringing private symbols together did
> > > not yield a interesting speedup on some large files, and therefore
> > > we implemented two modes of partitioning:
> > >
> > > 1. Partition without static promotion. This is the safer method to use,
> > > as we do not modify symbols in the Compilation Unit. This may lead to
> > > speedups in files that have multiple entries points with low
> > > connectivity between then (such as insn-emit.c), however this will not
> > > provide speedups when this hypothesis is not true (gimple-match.c is an
> > > example of this). This is the default mode.
> > >
> > > 2. Partition with static promotion to global. This is a more aggressive
> > > method, as we can decide to promote some functions to global to increase
> > > parallelism opportunity. This also will change the final assembler name
> > > of the promoted function to avoid collision with functions of others
> > > Compilation Units. To use this mode, the user has to manually specify
> > > --param=promote-statics=1, as they must be aware of this.
> > >
> > > Currently, partitioner 2. do not account the number of nodes to be
> > > promoted.  Implementing this certainly will reduce impact on produced
> > > code.
> > >
> > > ## Jobserver Integration
> > >
> > > We implemented a interface to communicate with the GNU Make's Jobserver
> > > that is able to detect when the GNU Make Jobserver is active, thanks to
> > > Nathan Sidwell. This works as follows:
> > >
> > > When -fparallel-jobs=jobserver is provided, GCC will try to detect if
> > > there is a running Jobserver in which we can communicate to. If true,
> > > we return the token that Make originally gave to us, then we wait for
> > > make for a new token that, when provided, will launch a forked child
> > > process with the partition information; or fall back to default
> > > compilation if Jobserver is not detected with a warning. Now if
> > > -fparallel-jobs=auto, this warning will not be provided and the default
> > > number of jobs will be set to 2, unless a single core CPU is detected.
> > > For more information about the implementation, check gcc/jobserver.cc
> > > file in the autopar_devel branch.
> > >
> > > ## Speedups
> > >
> > > Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 8565U,
> > > and from 0.99x to 3.3x when testing on a 32-Cores AMD Opteron 6376.
> > > As a comparison, the theoretical maximum speedup by parallelizing this
> > > part of compilation is 4x, so there are things that can be improved
> > > here.
> > >
> > > Results are shown in the following table. There are three columns:
> > >
> > > * Sequential, which means running the compilation with just "-c -O2".
> > > * Without Static Promotion, which means appending 
> > > -fparallel-jobs=<NUM_THREADS>
> > > * With Static Promotion, which means appending 
> > > -fparallel-jobs=<NUM_THREADS>
> > > and --param=promote-statics=1
> > > * Speedup, which is Sequential / MIN (W/O Static Promotion, Static 
> > > Promotion)
> > > * LTO Emulated Parallelism, which consist of running a two step:
> > >    g++ -c <FILE> -O2 -o t.il.o -flto -fno-fat-lto-objects && \
> > >    g++ -o t.o t.il.o -O2 -r -flinker-output=nolto-rel -flto=<NUM_THREADS>
> > >
> > > The test was the result of a single execution with a previous warm up
> > > execution. The compiled GCC had checking enabled, and therefore release
> > > version might have better timings in both sequential and parallel, but the
> > > speedup may remain the same.
> > >
> > > CPU: Intel Core-i7 8565U (4 Cores, 8 Threads)
> > > |                |            | Without Static | With Static |         | 
> > > LTO Emulated |
> > > | File           | Sequential |    Promotion   |  Promotion  | Speedup |  
> > > Parallelism |
> > > |----------------|------------|----------------|-----------------------|--------------|
> > > | gimple-match.c |     60s    |       63s      |     34s     |   1.7x  |  
> > >    32s      |
> > > | insn-emit.c    |     37s    |       19s      |     20s     |   1.9x  |  
> > >    20s      |
> > >
> > > CPU: 4x AMD Opteron 6376 (32 Cores, 64 Threads)
> > > |                |            | Without Static | With Static |         | 
> > > LTO Emulated |
> > > | File           | Sequential |    Promotion   |  Promotion  | Speedup |  
> > > Parallelism |
> > > |----------------|------------|----------------|-----------------------|--------------|
> > > | gimple-match.c |    2m42s   |      2m43s     |     49s     |   3.3x  |  
> > >    46s      |
> > > | insn-emit.c    |    1m35s   |       30s      |     30s     |   3.1x  |  
> > >    32s      |
> > >
> > > There was also a reduction from 35m32s to 29m30s when bootstrapping
> > > GCC when -fparallel-jobs=8 on the Opteron machine.
> > >
> > > ## Conclusions
> > >
> > > This project can still be improved on a lot of places. For example,
> > > just by looking at the table above, we can conclude that:
> > >
> > > * Promoting statics to globals can increase parallelism opportunity on
> > > compilation.
> > > * Even with the IO Streaming overhead, the LTO Emulated Parallelism
> > > was faster than with our method. This means that either our
> > > partitioner is slow or we are losing time into something which can
> > > be improved.
> > > * This can be used as a way to reduce overall compilation time in
> > > many-core machines.
> > >
> > > ## Fixed bugs from phase 2:
> > >
> > > * Make -fparallel-jobs=jobserver automatically detect if jobserver is
> > > present.
> > > * Fix a bug which caused the resulting object file to not be the same
> > > in every compilation, therefore making bootstrap comparison work as
> > > expected.
> > > * Fix a bug which caused LTO to crash due to a uninitialized output
> > > FILE pointer.
> > >
> > > ## TODOs:
> > >
> > > * Maybe increase minimal partition size (--param=lto-min-partitions).
> > > * Improve checking for unreasonable bias (unbalanced partitons).
> > > * Use cost of partitioning symbols into account.
> > > * Avoid forking for the first (or last) partitions.
> > > * Modify the driver to use the SPEC language instead of injecting
> > > commands in the execute () function.
> > > * Use the main process to write the assembler files instead of the
> > > worker process.  This removes the requirement for sorting later in
> > > the driver.
> > > * Remove hidden "-fPIC" and "fPIE" from the driver.
> > > * Check for race condition in additional_asm_file, as it is created
> > > later in the compilation process.
> > > * Replace all array allocations of size greater or equal the number
> > > of nodes in the graph from the stack to the heap (i.e. avoid alloca).
> > > * Use 64-bit integers when computing sizes.
> > > * Check for ways to reduce the time spent inside the kernel.
> > > * Add support to every GCC frontend.
> > > * Rely on ipa_summary instead of setting a custom cost to nodes
> > > without summary.
> > > * Merge the current partitioning logic with the LTO logic.
> > > * Merge compute_boundary function with what is already done in LTO.
> > >
> > > Cosmetic:
> > >
> > > * Change n to num_nodes when that is related to the number of nodes of
> > > the graph.
> > > * Explain better what the compression[] array does in lto-partition.c
> > >
> > > [1] - https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552346.html

Re: [GSoC] Automatic Parallel Compilation Viability -- Final Report

Reply via email to