On Mon, Aug 31, 2020 at 1:15 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi > > <giuliano.belina...@usp.br> wrote: > > > > > > Hi, > > > > > > This is the final report of the "Automatic Parallel Compilation > > > Viability" project. Please notice that this report is pretty > > > similar to the delivered from the 2nd evaluation, as this phase > > > consisted of mostly rebasing and bug fixing. > > > > > > Please, reply this message for any question or suggestion. > > > > Thank you for your great work Giuliano! > > Indeed, it is quite amazing work :) > > > > It's odd that LTO emulated parallelism is winning here, > > I'd have expected it to be slower. One factor might > > be different partitioning choices and the other might > > be that the I/O required is faster than the GC induced > > COW overhead after forking. Note you can optimize > > one COW operation by re-using the main process for > > compiling the last partition. I suppose you tested > > this on a system with a fast SSD so I/O overhead is > > small? > > At the time I implemented fork based parallelism for WPA (which I think > we could recover by bit generalizing guiliano's patches), I had same > outcome: forked ltranses was simply running slower than those after > streaming. This was however tested on Firefox in my estimate sometime > around 2013. I never tried it on units comparable to insn-emit (which > would be differnt at that time anyway). I was mostly aiming to get it > fully transparent with streaming but never quite finished it since, at > that time, it I tought time is better spent on optimizing LTO data > layout. > > I suppose we want to keep both mechanizms in both WPA and normal > compilation and make compiler to choose fitting one.
I repeated Giulianos experiment on gimple-match.ii and producing LTO bytecode takes 5.3s and the link step 9.5s with two jobs, 6.6s with three and 5.0s with four and 2.4s with 32. With -fparallel-jobs=N and --param promote-statics=1 I see 14.8s, 13.9s and 13.5s here. With 8 jobs the reduction is to 11s. It looks like LTO much more aggressively partitions this - I see 36 partitions generated for gimple-match.c while -fparallel-jobs creates "only" 27. -fparallel-jobs doesn't seem to honor the various lto-partition --params btw? The relevant ones would be --param lto-partitions (the max. number of partitions to generate) and --param lto-min-partition (the minimum size for a partition). I always thought that lto-min-partition should be higher for -fparallel-jobs (which I envisioned to be enabled by default). I guess before investigating the current state in detail it might be worth exploring Honzas wish of sharing the actual partitioning code between LTO and -fparallel-jobs. Note that larger objects take a bigger hit from the GC COW issue so at some point that becomes dominant because the first GC walk for each partition is the same as doing a GC walk for the whole object. Eventually it makes sense to turn off GC completely for smaller partitions. Richard. > Honza > > > > Thanks again, > > Richard. > > > > > Thank you, > > > Giuliano. > > > > > > --- 8< ----------- > > > > > > # Automatic Parallel Compilation Viability: Final Report > > > > > > ## Complete Tasks > > > > > > For the third evaluation, we expected to deliver the product as a > > > series of patches for trunk. The patch series were in fact delivered > > > [1], but several items must be fixed before merge. > > > > > > > > > Overall, the project works and speedups ranges from 0.95x to 3.3x. > > > Bootstrap is working, and therefore this can be used in an experimental > > > state. > > > > > > ## How to use > > > > > > 1. Clone the autopar_devel branch: > > > ``` > > > git clone --single-branch --branch devel/autopar_devel \ > > > git://gcc.gnu.org/git/gcc.git gcc_autopar_devel > > > ``` > > > 2. Follow the standard compilation options provided in the Compiling > > > GCC page, and install it on some directory. For instance: > > > > > > ``` > > > cd gcc_autopar_devel > > > mkdir build && cd build > > > ../configure --disable-bootstrap --enable-languages=c,c++ > > > make -j 8 > > > make DESTDIR=/tmp/gcc11_autopar install > > > ``` > > > > > > 3. If you want to test whether your version is working, just launch > > > Gcc with `-fparallel-jobs=2` when compiling a file with -c. > > > > > > 5. If you want to compile a project with this version it uses GNU > > > Makefiles, you must modify the compilation rule command and prepend a > > > `+` token to it. For example, in Git's Makefile, Change: > > > ``` > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) > > > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) > > > $(EXTRA_CPPFLAGS) $< > > > ``` > > > to: > > > ``` > > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) > > > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) > > > $(EXTRA_CPPFLAGS) $< > > > ``` > > > as well as point the CC variable to the installed gcc, and > > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable. > > > > > > # How the parallelism works in this project > > > > > > In LTO, the Whole Program Analysis decides how to partition the > > > callgraph for running the LTRANS stage in parallel. This project > > > works very similar to this, however with some changes. > > > > > > The first was to modify the LTO structure so that it accepts > > > the compilation without IR streaming to files. This avoid an IO > > > overhead when compiling in parallel. > > > > > > The second was to use a custom partitioner to find which nodes > > > should be in the same partition. This was mainly done to bring COMDAT > > > together, as well as symbols that are part of other symbols, and even > > > private symbols so that we do not output hidden global symbols. > > > > > > However, experiment showed that bringing private symbols together did > > > not yield a interesting speedup on some large files, and therefore > > > we implemented two modes of partitioning: > > > > > > 1. Partition without static promotion. This is the safer method to use, > > > as we do not modify symbols in the Compilation Unit. This may lead to > > > speedups in files that have multiple entries points with low > > > connectivity between then (such as insn-emit.c), however this will not > > > provide speedups when this hypothesis is not true (gimple-match.c is an > > > example of this). This is the default mode. > > > > > > 2. Partition with static promotion to global. This is a more aggressive > > > method, as we can decide to promote some functions to global to increase > > > parallelism opportunity. This also will change the final assembler name > > > of the promoted function to avoid collision with functions of others > > > Compilation Units. To use this mode, the user has to manually specify > > > --param=promote-statics=1, as they must be aware of this. > > > > > > Currently, partitioner 2. do not account the number of nodes to be > > > promoted. Implementing this certainly will reduce impact on produced > > > code. > > > > > > ## Jobserver Integration > > > > > > We implemented a interface to communicate with the GNU Make's Jobserver > > > that is able to detect when the GNU Make Jobserver is active, thanks to > > > Nathan Sidwell. This works as follows: > > > > > > When -fparallel-jobs=jobserver is provided, GCC will try to detect if > > > there is a running Jobserver in which we can communicate to. If true, > > > we return the token that Make originally gave to us, then we wait for > > > make for a new token that, when provided, will launch a forked child > > > process with the partition information; or fall back to default > > > compilation if Jobserver is not detected with a warning. Now if > > > -fparallel-jobs=auto, this warning will not be provided and the default > > > number of jobs will be set to 2, unless a single core CPU is detected. > > > For more information about the implementation, check gcc/jobserver.cc > > > file in the autopar_devel branch. > > > > > > ## Speedups > > > > > > Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 8565U, > > > and from 0.99x to 3.3x when testing on a 32-Cores AMD Opteron 6376. > > > As a comparison, the theoretical maximum speedup by parallelizing this > > > part of compilation is 4x, so there are things that can be improved > > > here. > > > > > > Results are shown in the following table. There are three columns: > > > > > > * Sequential, which means running the compilation with just "-c -O2". > > > * Without Static Promotion, which means appending > > > -fparallel-jobs=<NUM_THREADS> > > > * With Static Promotion, which means appending > > > -fparallel-jobs=<NUM_THREADS> > > > and --param=promote-statics=1 > > > * Speedup, which is Sequential / MIN (W/O Static Promotion, Static > > > Promotion) > > > * LTO Emulated Parallelism, which consist of running a two step: > > > g++ -c <FILE> -O2 -o t.il.o -flto -fno-fat-lto-objects && \ > > > g++ -o t.o t.il.o -O2 -r -flinker-output=nolto-rel -flto=<NUM_THREADS> > > > > > > The test was the result of a single execution with a previous warm up > > > execution. The compiled GCC had checking enabled, and therefore release > > > version might have better timings in both sequential and parallel, but the > > > speedup may remain the same. > > > > > > CPU: Intel Core-i7 8565U (4 Cores, 8 Threads) > > > | | | Without Static | With Static | | > > > LTO Emulated | > > > | File | Sequential | Promotion | Promotion | Speedup | > > > Parallelism | > > > |----------------|------------|----------------|-----------------------|--------------| > > > | gimple-match.c | 60s | 63s | 34s | 1.7x | > > > 32s | > > > | insn-emit.c | 37s | 19s | 20s | 1.9x | > > > 20s | > > > > > > CPU: 4x AMD Opteron 6376 (32 Cores, 64 Threads) > > > | | | Without Static | With Static | | > > > LTO Emulated | > > > | File | Sequential | Promotion | Promotion | Speedup | > > > Parallelism | > > > |----------------|------------|----------------|-----------------------|--------------| > > > | gimple-match.c | 2m42s | 2m43s | 49s | 3.3x | > > > 46s | > > > | insn-emit.c | 1m35s | 30s | 30s | 3.1x | > > > 32s | > > > > > > There was also a reduction from 35m32s to 29m30s when bootstrapping > > > GCC when -fparallel-jobs=8 on the Opteron machine. > > > > > > ## Conclusions > > > > > > This project can still be improved on a lot of places. For example, > > > just by looking at the table above, we can conclude that: > > > > > > * Promoting statics to globals can increase parallelism opportunity on > > > compilation. > > > * Even with the IO Streaming overhead, the LTO Emulated Parallelism > > > was faster than with our method. This means that either our > > > partitioner is slow or we are losing time into something which can > > > be improved. > > > * This can be used as a way to reduce overall compilation time in > > > many-core machines. > > > > > > ## Fixed bugs from phase 2: > > > > > > * Make -fparallel-jobs=jobserver automatically detect if jobserver is > > > present. > > > * Fix a bug which caused the resulting object file to not be the same > > > in every compilation, therefore making bootstrap comparison work as > > > expected. > > > * Fix a bug which caused LTO to crash due to a uninitialized output > > > FILE pointer. > > > > > > ## TODOs: > > > > > > * Maybe increase minimal partition size (--param=lto-min-partitions). > > > * Improve checking for unreasonable bias (unbalanced partitons). > > > * Use cost of partitioning symbols into account. > > > * Avoid forking for the first (or last) partitions. > > > * Modify the driver to use the SPEC language instead of injecting > > > commands in the execute () function. > > > * Use the main process to write the assembler files instead of the > > > worker process. This removes the requirement for sorting later in > > > the driver. > > > * Remove hidden "-fPIC" and "fPIE" from the driver. > > > * Check for race condition in additional_asm_file, as it is created > > > later in the compilation process. > > > * Replace all array allocations of size greater or equal the number > > > of nodes in the graph from the stack to the heap (i.e. avoid alloca). > > > * Use 64-bit integers when computing sizes. > > > * Check for ways to reduce the time spent inside the kernel. > > > * Add support to every GCC frontend. > > > * Rely on ipa_summary instead of setting a custom cost to nodes > > > without summary. > > > * Merge the current partitioning logic with the LTO logic. > > > * Merge compute_boundary function with what is already done in LTO. > > > > > > Cosmetic: > > > > > > * Change n to num_nodes when that is related to the number of nodes of > > > the graph. > > > * Explain better what the compression[] array does in lto-partition.c > > > > > > [1] - https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552346.html