Speedup area - Strings/Identifiers

Ben White Tue, 07 Aug 2007 10:08:33 -0700

From the GCC Wiki, Speedup Areas page, Strings/identifiers section:

/3. Replace identifier hash table with a better data structure (havealready tried a ternary tree, it's not faster; could try to code it evencleverer than it already was; B* trees might be worth looking into)/


A different approach:

Separate the "handle identifier characteristics" process from the "whichidentifier is this string?" process.

Identifier characteristics are usually handled as a table, withcharacteristics as fields; e.g. undefined/defined,float/double/int/long, structure/not-structure. The table key isusually the row number itself.

As I understand it, in GCC now, when the parser encounters an identifierstring, it first looks up the string using a hash table, returning theinternal key for the identifier. Then it uses the key and theidentifier table to handle characteristics.

This is costly. Maintaing a hash table using insertions and deletionsis costly.Stopping all parsing, and using memory storage for hashing and hashtables, is also expensive. If they are used, they will be brought intothe cache. Information needed when the identification is over will bepaged out. This is true, no matter which method is used - hashing,ternary trees, or B* trees.


A two-pass approach might be significantly faster.

In the first pass, each identifier is parsed out, and appended to theunsorted list of all identifier instances, matched with its location -(a location id might be simply the count of identifier instances, e.g.the 1,024th identifier instance , or physical location in the sourcetext.)The list of identifier instances, with their matching locations, arethen sorted en-mass.

A unique, numeric, ID is assigned to each unique identifier string, anda cross-reference table is populated. For each identifier instance inthe source, there is an entry containing the unique identifier ID.

In the second pass, the identifier instance ID is the key to return theunique identifier ID. By its nature, this table will be processed insequence, once.

Collisions are guaranteed. They are not show stoppers. When anidentifier is re-defined in a sub-section, the characteristics for theprevious definition will have to be temporarily stored somewhere else,and restored at the conclusion of the subsection.


Benefits:

1. All the sorting can be done at one time, using the most efficientsort - Radix/Bucket.2. The second parsing pass will be considerably more efficient becauseit will not have to be continually interrupted to do ahash-search-insertion.3. The larger the program, e.g. whole Kernel compiles, or GCC compiles,the more efficient the sort will be. Compiler perfomance is oftenjudged by how fast it compiles the largest programs.


Drawbacks:

1. The unique ID table will be large.

2. Two parsing passes will be assumed to be less efficient, before thisis ever tried.


Ben White

Speedup area - Strings/Identifiers

Reply via email to