SoC Project: Incremental Parsing (of C++)

Simon Brenner Tue, 20 Mar 2007 00:07:28 -0800

--Rationale-- (to be expanded in the actual application)


* Parsing of C++ is slow
* On large projects, compile times often dominate the compile/edit/debug cycle
* Changes in source are often small and isolated
* Preprocessed compilation units are often huge - esp. compared to the
size of changes.
=> Recompiling only the changed part(s) should yield better compile times.

--Proposal--

I propose to implement incremental parsing in C++ (through hand-waving, C++ has
been decided to be the language most in need of incremental parsing), similar to
the approach in article [1]. Basically, a diff algorithm is run on the character
or token stream, producing a list of insertions, modifications, and deletions,
and this list of changed characters/tokens is compared to the previous parse
tree to decide what parts of the tree need reparsing and recompilation.

In this project, I wouldn't even try to attack incremental code generation, but
just run the entire middle/back end on an updated tree. A future extension
would be to use the information about which trees have changed to perform
minimal code generation. This approach obviously limits the improvement in
compile time, and an important initial part of the project would be to measure
the time spent in lexing, parsing and everything after parsing to see what the
potential gain would be.

--Implementation Ideas--

My implementation would store the tree representation and the token stream from
a source file in the object file as a custom section or in an auxiliary file,
and on each incremental compilation, the new source would be tokenized and a
diff algorithm run on the old and the new token stream. This tree and
token stream
would be marked with compiler version (and probably build too, depending on the
way in which they are serialized), and in case of a version mismatch
the compiler
would revert to a complete recompilation.

For starters, I would only consider top-level constructs (i.e. if any token in
a function, type or global-scope variable has changed, the entire declaration/
definition would be reparsed), in order to make the implementation simple.
Presumably, the time spent on needlessly parsing unchanged portions of functions
or types is very small due to the small number of changed declarations.

The parser would be given the two token streams and the old tree, and every time
a declaration or other top-level construct is parsed, you would check if any
tokens in the declaration has changed and re-parsing is needed.

--Observations--

A number of changes in source affect more than the tree of the construct itself
(inline functions, function default values, templates etc.. see Problems below).
In these cases, the initial implementation would just identify the dangerousness
of the changes, abort, and initiate a complete recompilation of the file. A part
of this project would be to identify *all* these cases, and handle
them correctly.
Would it be feasible to construct a dependency map of the tree, to handle these
cases with minimal recompilation? How large would such a dependency map have
to be?

For checking, the initial implementation should also provide for a way to
compare the result of the incremental reparse against a complete recompilation.

Some of the information that is saved and updated in the aux file or object file
is probably the same as what is saved in a GCH file. Would incremental update of
GCH files be possible/interesting? Should this all be integrated into the
precompiled header framework? I can't say I know enough about GCC's header
precompilation to know what to do with it, though.

--Problems and other thoughts--

* Changing a declaration (function arguments, default values), also affects all
uses of the same declaration.
* Adding and removing a template specialization changes all uses of the template
after the declaration.
* If code inside an inlined function body is changed, all (inlined) uses of the
function also change.
* What other cases like these have not yet been considered?
* How much space does a serialized token stream take?
* How much space does a serialized tree representation take?
* How much time is actually spent in the parser?
* Future: Incremental code generation

--References--

[1] Incremental scanning and parsing with Galaxy. IEEE. Trans. Softw.
Eng. 17, 7 (July), 641-651.
(http://doi.ieeecomputersociety.org/10.1109/32.83901)


Anyone interested in mentoring this project? Any other thoughts?

SoC Project: Incremental Parsing (of C++)

Reply via email to