Hi Sam. Impressive. Is this all due to no object creation overhead?
Ter
On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote:
> Hi Dr. Parr,
>
> I revisited my old “slim parsing” work to again measure the performance
> difference against Lexer/CommonToken. Currently, SlimLexer/SlimToken has a
> limitation that it only stores type, channel, startIndex, and stopIndex. Each
> of these is limited to 16 bits. Originally I planned to use this for syntax
> highlighting, where I can work within those bounds. Now the basic metrics.
> These were tested on the following 4-function calculator lexer.
>
> tokens {
> MUL='*';
> DIV='/';
> MOD='%';
> ADD='+';
> SUB='-';
> }
>
> IDENTIFIER
> : ('a'..'z' | 'A'..'Z' | '_')
> ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
> ;
>
> NUMBER
> : '0'..'9'+
> ;
>
> WS
> : (' ' | '\t' | '\n' | '\r' | '\f')*
> {$channel = Hidden;}
> ;
>
> Memory – CommonToken (32-bit system):
> · 8 bytes overhead for being a class
> · 36 bytes overhead for member variables
>
> Memory – CommonToken (64-bit system):
> · 16 bytes overhead for being a class (I believe that’s the object
> header size)
> · 44 bytes overhead for members
>
> Memory – SlimToken (32- or 64-bit systems):
> · 8 bytes total storage, and no allocations since it’s a value type.
>
> Lexer speed – CommonToken:
> · Total time: 10.34s
> · Rate: 2.71 mil tokens/sec
>
> Lexer speed – SlimToken:
> · Total time 2.87s
> · Rate: 9.76 mil tokens/sec
>
> My goal is to add enough CommonToken features back to SlimToken to make it
> usable without breaking its performance characteristics. To do so, I’m
> working on a new revision of SlimLexer that holds a ShortToken (backed by
> 32-bit int) or LongToken (backed by 64-bit int) (the lexer is generic in C#).
> The token itself stores its type (low 8-bits of ShortToken, 16-bits of
> LongToken), a flag of whether it’s on the default channel or not (+/-), and
> 23- or 47-bits for the token index). As the lexer runs, it builds B-tree
> indexes for line lengths, token offset and (with token lengths derived). It
> also holds a map from Token->string so that it only has to track text when
> necessary. This gives O(1) access to the values that drive decision making
> (with (value & 0xF) giving the token type for ShortToken), and O(log_b(n))
> access to other values. I expect to see a great improvement in performance
> with a very practical token for real parsing tasks.
>
> Sam
_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev