On Fri, 13 Mar 2020, Giuliano Belinassi wrote: > Hi, all > > I want to propose and apply for the following GSoC project: Automatic > Detection of Parallel Compilation Viability. > > Here is the proposal, and I am attaching a pdf file for better > readability: > > **Automatic Detection of Parallel Compilation Viability** > > [Giuliano Belinassi]{style="color: darkgreen"}\ > Timezone: GMT$-$3:00\ > University of São Paulo -- Brazil\ > IRC: giulianob in \#gcc\ > Email: [`giuliano.belina...@usp.br`](mailto:giuliano.belina...@usp.br)\ > Github: <https://github.com/giulianobelinassi/>\ > > About Me: Computer Science Bachelor (University of São Paulo), currently > pursuing a Masters Degree in Computer Science at the same institution. > I've always been fascinated by topics such as High-Performance Computing > and Code Optimization, having worked with a parallel implementation of a > Boundary Elements Method software in GPU. I am currently conducting > research on compiler parallelization and developing the > [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc) project, having > already presented it in [GNU Cauldron > 2019](https://www.youtube.com/watch?v=jd6R3IK__1Q). > > **Skills**: Strong knowledge in C, Concurrency, Shared Memory > Parallelism, Multithreaded Debugging and other typical programming > tools. > > Brief Introduction > > In [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc), we showed that > parallelizing the Intra Procedural optimizations improves speed when > compiling huge files by a factor of 1.8x in a 4 cores machine, and also > showed that this takes 75% of compilation time. > > In this project we plan to use the LTO infrastructure to improve > compilation performance in the non-LTO case, with a tradeoff of > generating a binary as good as if LTO is disabled. Here, we will > automatically detect when a single file will benefit from parallelism, > and proceed with the compilation in parallel if so. > > Use of LTO > > The Link Time Optimization (LTO) is a compilation technique that allows > the compiler to analyse the program as a whole, instead of analysing and > compiling one file at time. Therefore, LTO is able to collect more > information about the program and generate a better optimization plan. > LTO is divided in three parts: > > - *LGEN (Local Generation)*: Each file is translated to GIMPLE. This > stage runs sequentially in each file and, therefore, in parallel in > the project compilation. > > - *WPA (Whole Program Analysis)*: Run the Inter Procedural Analysis > (IPA) in the entire program. This state runs serially in the > project. > > - *LTRANS (Local Transformation)*: Execute all Intra Procedural > Optimizations in each partition. This stage runs in parallel. > > Since WPA can bottleneck the compilation because it runs serially in the > entire project, LTO was designed to produce faster binaries, not to > produce binaries fast. > > Here, the proposed use of LTO to address this problem is to run the IPA > for each Translation Unit (TU), instead in the Whole Program, and > automatically detect when to partition the TU into multiple LTRANS to > improve performance. The advantage of this approach is:
"to improve compilation performance" > - It can generate binaries as good as when LTO is disabled. > > - It is faster, as we can partition big files into multiple partitions > and compile these partitions in parallel > > - It can interact with GNU Make Jobserver, improving CPU utilization. The previous already improves CPU utilization, I guess GNU make jobserver integration avoids CPU overcommit. > Planned Tasks > > I plan to use the GSoC time to develop the following topics: > > - Week \[1, 3\] -- April 27 to May 15:\ > Update `cc1`, `cc1plus`, `f771`, ..., to partition the data after > IPA analysis directly into multiple LTRANS partitions, instead of > generating a temporary GIMPLE file. To summarize in my own words: After IPA analysis partition the CU into possibly multiple LTRANS partitions even for non-LTO compilations. Invoke LTRANS compilation for partitions 2..n without writing intermediate IL through mechanisms like forking. I might say that you could run into "issues" here with asm_out_file already opened and partially written to. Possibly easier (but harder on the driver side) would be to stream LTO LTRANS IL for partitions 2..n and handle those like with regular LTO operation. But I guess I'd try w/o writing IL first and only if it turns out too difficult go the IL writing way. > - Week \[4, 7\] -- May 18 to June 12:\ > Update the `gcc` driver to take these multiple LTRANS partitions, > then call the compiler and assembler for each of them, and merge the > results into one object file. Here I will use the LTO LTRANS object > streaming, therefore it should interact with GNU Make Jobserver. Hmm, so if you indeed want to do that as second step the first step would still need driver modifications to invoke the assembler. I think in previous discussions I suggested to have the driver signal cc1 and friends via a special -fsplit-tu-to-asm-outputs=<tempfile> argument that splitting is desirable and that the used output assembler files should be written to <tempfile> so the driver can pick them up for assembling and linking. You also miss the fact that the driver also needs to invoke the linker to merge the N LTRANS objects back to one. I suggest you first ignore the jobserver and try doing without LTRANS IL streaming. I think meanwhile lto1 got jobserver support for the WPA -> LTRANS streaming so you can reuse that for jobserver aware "forking" (and later assembling in the driver). Using a named pipe or some other mechanism might also allow to pick up assembler output for the individual units as it becomes ready rather than waiting for the slowest LTRANS unit to finish compiling. > - Week 8 -- June 15 to 19: **First Evaluation**\ > Deliver a non-optimized version of the project. Some programs ought > to be compiled correctly, but probably there will be a huge overhead > because so far there will not be any criteria about when to > partition. Some tests are also planned for this evaluation. > > - Week \[9, 11\] -- June 22 to July 10:\ > Implement a criteria about when to partition, and interactively > improve it based on data. I think this should be already there (though we error on the side of generating "more" partitions). For non-LTO parallelizing operation we maybe want to tune the various --params that are available though (lto-min-partition and lto-partitions). So I'd suggest to concentrate on the jobserver integration for the second phase? Otherwise the proposal looks good and I'm confident we can deliver something that will be ready for real-world usage for GCC 11! Thanks, Richard. > - Week 12 -- July 13 to 17: **Second Evaluation**\ > Deliver a more optimized version of the project. Here we should > filter files that would compile fast from files that would require > partitioning, and therefore we should see some speedup. > > - Week \[13, 15\] -- July 20 to August 10:\ > Develop adequate tests coverage and address unexpected issues so > that this feature can be merged to trunk for the next GCC release. > > - Week 16: **Final evaluation**\ > Deliver the final product as a series of patches for trunk. > > > Thank you > Giuliano. > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)