Hi,

I am attaching here the first report of the
"Automatic Parallel Viability" project. Please feel free to suggest
improvements to the project.

The content below is presented in markdown format, and you can easily
convert it to PDF with pandoc if you feel it uncomfortable to read in
current format.


```
# Automatic Parallel Compilation Viability: First Report

## Complete Tasks

There were two tasks scheduled to be complete in this stage:

1. Update the driver to pass a hidden `-fsplit-outputs` to the compiler,
writting down the path to each assembler file generated by the compiler.

2. Update the compiler to partition the program, and fork on each
partition, generating one asm file per partition. The compiler should
compile some programs correctly at this point.

Both of these tasks were completed at this point, as we have a version
of GCC that bootstraps and have a small speedup when compiling
insn-emit.c in parallel (with --enable-checking, 42s sequential w/o
partitioning vs. 36s parallel).

## First Implementation

The first working implementation consists of two parts:

- A partitioner designed to handle the Compilation Unit as if it only
  have part of the program information, as opposed as the WPA
  partitioner that have the entire program callgraph. This imposes some
  restriction on us about our partitioner, that is:

    1. Static functions that calls eachother are mapped to the same
    partition to avoid renaming and promotion to globals.

    2. Functions that have references to a same global variable are
    mapped into the same partition to avoid multiple definitions of this
    partition.

    3. Functions which address are taken partitioned together with
    whoever took the address and with calls that has the possibility
    to receive the address as argument.

    4. COMDATs should be in the same partition.

- A way to apply a `ltrans_partition` without dumping the partition
  to disk and reloading it. This is done by looking into the
  partitioner generated encoder and releasing information about
  functions that are not in the partition by releasing function
  bodies, dominator tree, entries at clone tree, and setting the
  variables accordingly.

This implementation lead to very unbalanced partitions, as it generate
one really big partition and several very small partitions. However, the
partitioner showed to be very fast in practice, with preliminary
complexity of O(|E| * log\*(|V|)), which is approximately linear on the
number of edges of the graph, as log\*(2^64) = 5. Sure the worst case
scenario where the callgraph is the complete graph implies |E| =
|V|\*(|V| - 1)/2, but that looks very synthetic to a input program.

## Ongoing Implementation

This implementation is still work on progress, but has showed
significant improvements when compared to the first implementation.
First we relax the condition 1. by not merging calls that were inlined,
and relax condition 2. by not merging functions that references
non-static global variables.

We managed to get this version to bootstrap by commenting some
assertions at ipa-cp optimization pass, probably because `node->local`
is incorrectly set in some cases, and calling `remove_unreachable_nodes`
after applying the partition to the callgraph, but marking nodes inside
the partition to be `force_output` before entering that function so
there is no risk of being removed incorrectly, and unseting
`force_output` afterwards for nodes that wasn't set beforehand.

This methodology yielded a speedup of 2x on a Quad Core Intel Core-i7
8565U CPU when compiling insn-emit.c
(24.509s parallel vs. 45.909s sequential), but no speedup at all when
compiling gimple-match.c, this file looks really problematic on current
methodology because it has 3 entry points and is stuffed with static
functions.

We are also exploring the possibility of promoting non-inlined static
functions to globals in some cases where it might improve compilation
performance, although it may slowdown the produced binary in some
architectures, therefore completely removing restriction 1. Our results
with that restriction relaxed on gimple-match.c showed a 2.12x
improvement over serial compilation (36.184s parallel vs. 1m17s
sequential).

## Conclusions

So far the work has shows to be promising. Compilation of insn-emit.c
is greatly improved with 2x speedup although no improvements to
gimple-match.c was observed without relaxing condition 1. Once this
new partitioner bootstrap, we will proceed to implementing support
for the GNU Jobserver.
```

Reply via email to