--Rationale-- (to be expanded in the actual application)
* Parsing of C++ is slow * On large projects, compile times often dominate the compile/edit/debug cycle * Changes in source are often small and isolated * Preprocessed compilation units are often huge - esp. compared to the size of changes. => Recompiling only the changed part(s) should yield better compile times. --Proposal-- I propose to implement incremental parsing in C++ (through hand-waving, C++ has been decided to be the language most in need of incremental parsing), similar to the approach in article [1]. Basically, a diff algorithm is run on the character or token stream, producing a list of insertions, modifications, and deletions, and this list of changed characters/tokens is compared to the previous parse tree to decide what parts of the tree need reparsing and recompilation. In this project, I wouldn't even try to attack incremental code generation, but just run the entire middle/back end on an updated tree. A future extension would be to use the information about which trees have changed to perform minimal code generation. This approach obviously limits the improvement in compile time, and an important initial part of the project would be to measure the time spent in lexing, parsing and everything after parsing to see what the potential gain would be. --Implementation Ideas-- My implementation would store the tree representation and the token stream from a source file in the object file as a custom section or in an auxiliary file, and on each incremental compilation, the new source would be tokenized and a diff algorithm run on the old and the new token stream. This tree and token stream would be marked with compiler version (and probably build too, depending on the way in which they are serialized), and in case of a version mismatch the compiler would revert to a complete recompilation. For starters, I would only consider top-level constructs (i.e. if any token in a function, type or global-scope variable has changed, the entire declaration/ definition would be reparsed), in order to make the implementation simple. Presumably, the time spent on needlessly parsing unchanged portions of functions or types is very small due to the small number of changed declarations. The parser would be given the two token streams and the old tree, and every time a declaration or other top-level construct is parsed, you would check if any tokens in the declaration has changed and re-parsing is needed. --Observations-- A number of changes in source affect more than the tree of the construct itself (inline functions, function default values, templates etc.. see Problems below). In these cases, the initial implementation would just identify the dangerousness of the changes, abort, and initiate a complete recompilation of the file. A part of this project would be to identify *all* these cases, and handle them correctly. Would it be feasible to construct a dependency map of the tree, to handle these cases with minimal recompilation? How large would such a dependency map have to be? For checking, the initial implementation should also provide for a way to compare the result of the incremental reparse against a complete recompilation. Some of the information that is saved and updated in the aux file or object file is probably the same as what is saved in a GCH file. Would incremental update of GCH files be possible/interesting? Should this all be integrated into the precompiled header framework? I can't say I know enough about GCC's header precompilation to know what to do with it, though. --Problems and other thoughts-- * Changing a declaration (function arguments, default values), also affects all uses of the same declaration. * Adding and removing a template specialization changes all uses of the template after the declaration. * If code inside an inlined function body is changed, all (inlined) uses of the function also change. * What other cases like these have not yet been considered? * How much space does a serialized token stream take? * How much space does a serialized tree representation take? * How much time is actually spent in the parser? * Future: Incremental code generation --References-- [1] Incremental scanning and parsing with Galaxy. IEEE. Trans. Softw. Eng. 17, 7 (July), 641-651. (http://doi.ieeecomputersociety.org/10.1109/32.83901) Anyone interested in mentoring this project? Any other thoughts?