Re: [GSoC 2020] Automatic Detection of Parallel Compilation Viability

Richard Biener Wed, 18 Mar 2020 04:45:40 -0700

On Tue, 17 Mar 2020, Giuliano Belinassi wrote:

> Hi, Richi
> 
> Thank you for your review!
> 
> On 03/16, Richard Biener wrote:
> > On Fri, 13 Mar 2020, Giuliano Belinassi wrote:
> > 
> > > Hi, all
> > > 
> > > I want to propose and apply for the following GSoC project: Automatic
> > > Detection of Parallel Compilation Viability.
> > > 
> > > Here is the proposal, and I am attaching a pdf file for better
> > > readability:
> > > 
> > > **Automatic Detection of Parallel Compilation Viability**
> > > 
> > > [Giuliano Belinassi]{style="color: darkgreen"}\
> > > Timezone: GMT$-$3:00\
> > > University of São Paulo -- Brazil\
> > > IRC: giulianob in \#gcc\
> > > Email: [`[email protected]`](mailto:[email protected])\
> > > Github: <https://github.com/giulianobelinassi/>\
> > > 
> > > About Me: Computer Science Bachelor (University of São Paulo), currently
> > > pursuing a Masters Degree in Computer Science at the same institution.
> > > I've always been fascinated by topics such as High-Performance Computing
> > > and Code Optimization, having worked with a parallel implementation of a
> > > Boundary Elements Method software in GPU. I am currently conducting
> > > research on compiler parallelization and developing the
> > > [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc) project, having
> > > already presented it in [GNU Cauldron
> > > 2019](https://www.youtube.com/watch?v=jd6R3IK__1Q).
> > > 
> > > **Skills**: Strong knowledge in C, Concurrency, Shared Memory
> > > Parallelism, Multithreaded Debugging and other typical programming
> > > tools.
> > > 
> > > Brief Introduction
> > > 
> > > In [ParallelGcc](https://gcc.gnu.org/wiki/ParallelGcc), we showed that
> > > parallelizing the Intra Procedural optimizations improves speed when
> > > compiling huge files by a factor of 1.8x in a 4 cores machine, and also
> > > showed that this takes 75% of compilation time.
> > > 
> > > In this project we plan to use the LTO infrastructure to improve
> > > compilation performance in the non-LTO case, with a tradeoff of
> > > generating a binary as good as if LTO is disabled. Here, we will
> > > automatically detect when a single file will benefit from parallelism,
> > > and proceed with the compilation in parallel if so.
> > > 
> > > Use of LTO
> > > 
> > > The Link Time Optimization (LTO) is a compilation technique that allows
> > > the compiler to analyse the program as a whole, instead of analysing and
> > > compiling one file at time. Therefore, LTO is able to collect more
> > > information about the program and generate a better optimization plan.
> > > LTO is divided in three parts:
> > > 
> > > -   *LGEN (Local Generation)*: Each file is translated to GIMPLE. This
> > >     stage runs sequentially in each file and, therefore, in parallel in
> > >     the project compilation.
> > > 
> > > -   *WPA (Whole Program Analysis)*: Run the Inter Procedural Analysis
> > >     (IPA) in the entire program. This state runs serially in the
> > >     project.
> > > 
> > > -   *LTRANS (Local Transformation)*: Execute all Intra Procedural
> > >     Optimizations in each partition. This stage runs in parallel.
> > > 
> > > Since WPA can bottleneck the compilation because it runs serially in the
> > > entire project, LTO was designed to produce faster binaries, not to
> > > produce binaries fast.
> > > 
> > > Here, the proposed use of LTO to address this problem is to run the IPA
> > > for each Translation Unit (TU), instead in the Whole Program, and
> > > automatically detect when to partition the TU into multiple LTRANS to
> > > improve performance. The advantage of this approach is:
> > 
> > "to improve compilation performance"
> > 
> > > -   It can generate binaries as good as when LTO is disabled.
> > > 
> > > -   It is faster, as we can partition big files into multiple partitions
> > >     and compile these partitions in parallel
> > > 
> > > -   It can interact with GNU Make Jobserver, improving CPU utilization.
> > 
> > The previous already improves CPU utilization, I guess GNU make jobserver
> > integration avoids CPU overcommit.
> > 
> > > Planned Tasks
> > > 
> > > I plan to use the GSoC time to develop the following topics:
> > > 
> > > -   Week \[1, 3\] -- April 27 to May 15:\
> > >     Update `cc1`, `cc1plus`, `f771`, ..., to partition the data after
> > >     IPA analysis directly into multiple LTRANS partitions, instead of
> > >     generating a temporary GIMPLE file.
> > 
> > To summarize in my own words:
> > 
> >   After IPA analysis partition the CU into possibly multiple LTRANS 
> >   partitions even for non-LTO compilations. Invoke LTRANS compilation
> >   for partitions 2..n without writing intermediate IL through mechanisms
> >   like forking.
> > 
> > I might say that you could run into "issues" here with asm_out_file
> > already opened and partially written to.  Possibly easier (but harder
> > on the driver side) would be to stream LTO LTRANS IL for partitions
> > 2..n and handle those like with regular LTO operation.  But I guess
> > I'd try w/o writing IL first and only if it turns out too difficult
> > go the IL writing way.
> 
> Ok. I changed the application text based on that.
> 
> > 
> > > -   Week \[4, 7\] -- May 18 to June 12:\
> > >     Update the `gcc` driver to take these multiple LTRANS partitions,
> > >     then call the compiler and assembler for each of them, and merge the
> > >     results into one object file. Here I will use the LTO LTRANS object
> > >     streaming, therefore it should interact with GNU Make Jobserver.
> > 
> > Hmm, so if you indeed want to do that as second step the first step
> > would still need driver modifications to invoke the assembler.  I think
> > in previous discussions I suggested to have the driver signal cc1 and 
> > friends via a special -fsplit-tu-to-asm-outputs=<tempfile> argument that 
> > splitting is desirable and that the used output assembler files should
> > be written to <tempfile> so the driver can pick them up for assembling
> > and linking.
> > 
> > You also miss the fact that the driver also needs to invoke the linker
> > to merge the N LTRANS objects back to one.
> 
> Actually I tried to avoid entering in such technical detail here, but
> it indeed makes the proposal more concrete.
> 
> > 
> > I suggest you first ignore the jobserver and try doing without
> > LTRANS IL streaming.  I think meanwhile lto1 got jobserver support
> > for the WPA -> LTRANS streaming so you can reuse that for jobserver
> > aware "forking" (and later assembling in the driver).  Using
> > a named pipe or some other mechanism might also allow to pick up
> > assembler output for the individual units as it becomes ready rather
> > than waiting for the slowest LTRANS unit to finish compiling.
> 
> Ok. I updated the proposal with this information.
> 
> > 
> > > -   Week 8 -- June 15 to 19: **First Evaluation**\
> > >     Deliver a non-optimized version of the project. Some programs ought
> > >     to be compiled correctly, but probably there will be a huge overhead
> > >     because so far there will not be any criteria about when to
> > >     partition. Some tests are also planned for this evaluation.
> > > 
> > > -   Week \[9, 11\] -- June 22 to July 10:\
> > >     Implement a criteria about when to partition, and interactively
> > >     improve it based on data.
> > 
> > I think this should be already there (though we error on the side of
> > generating "more" partitions).  For non-LTO parallelizing operation
> > we maybe want to tune the various --params that are available
> > though (lto-min-partition and lto-partitions).
> 
> This is interesting. Here at the Lab we have a student which developed
> experiment design model to predict how some parameters can impact in the
> final result. Could you please give me more details about these
> parameters?


--param lto-partitions specifies the maximum number of partitions
to create and --param lto-min-partition specifies the minimum size
a partition needs to have to be considered for splitting.

> > 
> > So I'd suggest to concentrate on the jobserver integration for the
> > second phase?
> 
> I just changed the proposal to focus in jobserver integration at this
> stage.
> 
> > 
> > Otherwise the proposal looks good and I'm confident we can deliver
> > something that will be ready for real-world usage for GCC 11!
> 
> Thank you :)
> Giuliano.
> 
> > 
> > Thanks,
> > Richard.
> > 
> > > -   Week 12 -- July 13 to 17: **Second Evaluation**\
> > >     Deliver a more optimized version of the project. Here we should
> > >     filter files that would compile fast from files that would require
> > >     partitioning, and therefore we should see some speedup.
> > > 
> > > -   Week \[13, 15\] -- July 20 to August 10:\
> > >     Develop adequate tests coverage and address unexpected issues so
> > >     that this feature can be merged to trunk for the next GCC release.
> > >
> > > -   Week 16: **Final evaluation**\
> > >     Deliver the final product as a series of patches for trunk.
> > > 
> > > 
> > > Thank you
> > > Giuliano.
> > > 
> > 
> > -- 
> > Richard Biener <[email protected]>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> > Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Re: [GSoC 2020] Automatic Detection of Parallel Compilation Viability

Reply via email to