Hi, Richi. On 08/31, Richard Biener wrote: > On Fri, Aug 28, 2020 at 10:32 PM Giuliano Belinassi > <giuliano.belina...@usp.br> wrote: > > > > Hi, > > > > This is the final report of the "Automatic Parallel Compilation > > Viability" project. Please notice that this report is pretty > > similar to the delivered from the 2nd evaluation, as this phase > > consisted of mostly rebasing and bug fixing. > > > > Please, reply this message for any question or suggestion. > > Thank you for your great work Giuliano!
Thank you :) > > It's odd that LTO emulated parallelism is winning here, > I'd have expected it to be slower. One factor might > be different partitioning choices and the other might > be that the I/O required is faster than the GC induced > COW overhead after forking. Note you can optimize > one COW operation by re-using the main process for > compiling the last partition. I suppose you tested > this on a system with a fast SSD so I/O overhead is > small? The Core-i7 machine runs on a fast NVMe SSD. The Opteron machine seems to run on a RAID setup with conventional HDDs, but I am not sure how it is configured, as I have no phisical access to that machine. Thank you, Giuliano > > Thanks again, > Richard. > > > Thank you, > > Giuliano. > > > > --- 8< ----------- > > > > # Automatic Parallel Compilation Viability: Final Report > > > > ## Complete Tasks > > > > For the third evaluation, we expected to deliver the product as a > > series of patches for trunk. The patch series were in fact delivered > > [1], but several items must be fixed before merge. > > > > > > Overall, the project works and speedups ranges from 0.95x to 3.3x. > > Bootstrap is working, and therefore this can be used in an experimental > > state. > > > > ## How to use > > > > 1. Clone the autopar_devel branch: > > ``` > > git clone --single-branch --branch devel/autopar_devel \ > > git://gcc.gnu.org/git/gcc.git gcc_autopar_devel > > ``` > > 2. Follow the standard compilation options provided in the Compiling > > GCC page, and install it on some directory. For instance: > > > > ``` > > cd gcc_autopar_devel > > mkdir build && cd build > > ../configure --disable-bootstrap --enable-languages=c,c++ > > make -j 8 > > make DESTDIR=/tmp/gcc11_autopar install > > ``` > > > > 3. If you want to test whether your version is working, just launch > > Gcc with `-fparallel-jobs=2` when compiling a file with -c. > > > > 5. If you want to compile a project with this version it uses GNU > > Makefiles, you must modify the compilation rule command and prepend a > > `+` token to it. For example, in Git's Makefile, Change: > > ``` > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) > > $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) > > $(EXTRA_CPPFLAGS) $< > > ``` > > to: > > ``` > > $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) > > +$(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(ALL_CFLAGS) > > $(EXTRA_CPPFLAGS) $< > > ``` > > as well as point the CC variable to the installed gcc, and > > append a `-fparallel-jobs=jobserver` on your CFLAGS variable. > > > > # How the parallelism works in this project > > > > In LTO, the Whole Program Analysis decides how to partition the > > callgraph for running the LTRANS stage in parallel. This project > > works very similar to this, however with some changes. > > > > The first was to modify the LTO structure so that it accepts > > the compilation without IR streaming to files. This avoid an IO > > overhead when compiling in parallel. > > > > The second was to use a custom partitioner to find which nodes > > should be in the same partition. This was mainly done to bring COMDAT > > together, as well as symbols that are part of other symbols, and even > > private symbols so that we do not output hidden global symbols. > > > > However, experiment showed that bringing private symbols together did > > not yield a interesting speedup on some large files, and therefore > > we implemented two modes of partitioning: > > > > 1. Partition without static promotion. This is the safer method to use, > > as we do not modify symbols in the Compilation Unit. This may lead to > > speedups in files that have multiple entries points with low > > connectivity between then (such as insn-emit.c), however this will not > > provide speedups when this hypothesis is not true (gimple-match.c is an > > example of this). This is the default mode. > > > > 2. Partition with static promotion to global. This is a more aggressive > > method, as we can decide to promote some functions to global to increase > > parallelism opportunity. This also will change the final assembler name > > of the promoted function to avoid collision with functions of others > > Compilation Units. To use this mode, the user has to manually specify > > --param=promote-statics=1, as they must be aware of this. > > > > Currently, partitioner 2. do not account the number of nodes to be > > promoted. Implementing this certainly will reduce impact on produced > > code. > > > > ## Jobserver Integration > > > > We implemented a interface to communicate with the GNU Make's Jobserver > > that is able to detect when the GNU Make Jobserver is active, thanks to > > Nathan Sidwell. This works as follows: > > > > When -fparallel-jobs=jobserver is provided, GCC will try to detect if > > there is a running Jobserver in which we can communicate to. If true, > > we return the token that Make originally gave to us, then we wait for > > make for a new token that, when provided, will launch a forked child > > process with the partition information; or fall back to default > > compilation if Jobserver is not detected with a warning. Now if > > -fparallel-jobs=auto, this warning will not be provided and the default > > number of jobs will be set to 2, unless a single core CPU is detected. > > For more information about the implementation, check gcc/jobserver.cc > > file in the autopar_devel branch. > > > > ## Speedups > > > > Speedups ranged from 0.95x to 1.9x on a Quad-Core Intel Core-i7 8565U, > > and from 0.99x to 3.3x when testing on a 32-Cores AMD Opteron 6376. > > As a comparison, the theoretical maximum speedup by parallelizing this > > part of compilation is 4x, so there are things that can be improved > > here. > > > > Results are shown in the following table. There are three columns: > > > > * Sequential, which means running the compilation with just "-c -O2". > > * Without Static Promotion, which means appending > > -fparallel-jobs=<NUM_THREADS> > > * With Static Promotion, which means appending -fparallel-jobs=<NUM_THREADS> > > and --param=promote-statics=1 > > * Speedup, which is Sequential / MIN (W/O Static Promotion, Static > > Promotion) > > * LTO Emulated Parallelism, which consist of running a two step: > > g++ -c <FILE> -O2 -o t.il.o -flto -fno-fat-lto-objects && \ > > g++ -o t.o t.il.o -O2 -r -flinker-output=nolto-rel -flto=<NUM_THREADS> > > > > The test was the result of a single execution with a previous warm up > > execution. The compiled GCC had checking enabled, and therefore release > > version might have better timings in both sequential and parallel, but the > > speedup may remain the same. > > > > CPU: Intel Core-i7 8565U (4 Cores, 8 Threads) > > | | | Without Static | With Static | | > > LTO Emulated | > > | File | Sequential | Promotion | Promotion | Speedup | > > Parallelism | > > |----------------|------------|----------------|-----------------------|--------------| > > | gimple-match.c | 60s | 63s | 34s | 1.7x | > > 32s | > > | insn-emit.c | 37s | 19s | 20s | 1.9x | > > 20s | > > > > CPU: 4x AMD Opteron 6376 (32 Cores, 64 Threads) > > | | | Without Static | With Static | | > > LTO Emulated | > > | File | Sequential | Promotion | Promotion | Speedup | > > Parallelism | > > |----------------|------------|----------------|-----------------------|--------------| > > | gimple-match.c | 2m42s | 2m43s | 49s | 3.3x | > > 46s | > > | insn-emit.c | 1m35s | 30s | 30s | 3.1x | > > 32s | > > > > There was also a reduction from 35m32s to 29m30s when bootstrapping > > GCC when -fparallel-jobs=8 on the Opteron machine. > > > > ## Conclusions > > > > This project can still be improved on a lot of places. For example, > > just by looking at the table above, we can conclude that: > > > > * Promoting statics to globals can increase parallelism opportunity on > > compilation. > > * Even with the IO Streaming overhead, the LTO Emulated Parallelism > > was faster than with our method. This means that either our > > partitioner is slow or we are losing time into something which can > > be improved. > > * This can be used as a way to reduce overall compilation time in > > many-core machines. > > > > ## Fixed bugs from phase 2: > > > > * Make -fparallel-jobs=jobserver automatically detect if jobserver is > > present. > > * Fix a bug which caused the resulting object file to not be the same > > in every compilation, therefore making bootstrap comparison work as > > expected. > > * Fix a bug which caused LTO to crash due to a uninitialized output > > FILE pointer. > > > > ## TODOs: > > > > * Maybe increase minimal partition size (--param=lto-min-partitions). > > * Improve checking for unreasonable bias (unbalanced partitons). > > * Use cost of partitioning symbols into account. > > * Avoid forking for the first (or last) partitions. > > * Modify the driver to use the SPEC language instead of injecting > > commands in the execute () function. > > * Use the main process to write the assembler files instead of the > > worker process. This removes the requirement for sorting later in > > the driver. > > * Remove hidden "-fPIC" and "fPIE" from the driver. > > * Check for race condition in additional_asm_file, as it is created > > later in the compilation process. > > * Replace all array allocations of size greater or equal the number > > of nodes in the graph from the stack to the heap (i.e. avoid alloca). > > * Use 64-bit integers when computing sizes. > > * Check for ways to reduce the time spent inside the kernel. > > * Add support to every GCC frontend. > > * Rely on ipa_summary instead of setting a custom cost to nodes > > without summary. > > * Merge the current partitioning logic with the LTO logic. > > * Merge compute_boundary function with what is already done in LTO. > > > > Cosmetic: > > > > * Change n to num_nodes when that is related to the number of nodes of > > the graph. > > * Explain better what the compression[] array does in lto-partition.c > > > > [1] - https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552346.html