From the GCC Wiki, Speedup Areas page, Strings/identifiers section:
/3. Replace identifier hash table with a better data structure (have
already tried a ternary tree, it's not faster; could try to code it even
cleverer than it already was; B* trees might be worth looking into)/
A different approach:
Separate the "handle identifier characteristics" process from the "which
identifier is this string?" process.
Identifier characteristics are usually handled as a table, with
characteristics as fields; e.g. undefined/defined,
float/double/int/long, structure/not-structure. The table key is
usually the row number itself.
As I understand it, in GCC now, when the parser encounters an identifier
string, it first looks up the string using a hash table, returning the
internal key for the identifier. Then it uses the key and the
identifier table to handle characteristics.
This is costly. Maintaing a hash table using insertions and deletions
is costly.
Stopping all parsing, and using memory storage for hashing and hash
tables, is also expensive. If they are used, they will be brought into
the cache. Information needed when the identification is over will be
paged out. This is true, no matter which method is used - hashing,
ternary trees, or B* trees.
A two-pass approach might be significantly faster.
In the first pass, each identifier is parsed out, and appended to the
unsorted list of all identifier instances, matched with its location -
(a location id might be simply the count of identifier instances, e.g.
the 1,024th identifier instance , or physical location in the source
text.)
The list of identifier instances, with their matching locations, are
then sorted en-mass.
A unique, numeric, ID is assigned to each unique identifier string, and
a cross-reference table is populated. For each identifier instance in
the source, there is an entry containing the unique identifier ID.
In the second pass, the identifier instance ID is the key to return the
unique identifier ID. By its nature, this table will be processed in
sequence, once.
Collisions are guaranteed. They are not show stoppers. When an
identifier is re-defined in a sub-section, the characteristics for the
previous definition will have to be temporarily stored somewhere else,
and restored at the conclusion of the subsection.
Benefits:
1. All the sorting can be done at one time, using the most efficient
sort - Radix/Bucket.
2. The second parsing pass will be considerably more efficient because
it will not have to be continually interrupted to do a
hash-search-insertion.
3. The larger the program, e.g. whole Kernel compiles, or GCC compiles,
the more efficient the sort will be. Compiler perfomance is often
judged by how fast it compiles the largest programs.
Drawbacks:
1. The unique ID table will be large.
2. Two parsing passes will be assumed to be less efficient, before this
is ever tried.
Ben White