Re: [antlr-dev] Alternative token storage mechanisms

Terence Parr Wed, 01 Dec 2010 12:50:38 -0800

Hi Sam. Impressive.  Is this all due to no object creation overhead?
Ter
On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote:


> Hi Dr. Parr,
>  
> I revisited my old “slim parsing” work to again measure the performance 
> difference against Lexer/CommonToken. Currently, SlimLexer/SlimToken has a 
> limitation that it only stores type, channel, startIndex, and stopIndex. Each 
> of these is limited to 16 bits. Originally I planned to use this for syntax 
> highlighting, where I can work within those bounds. Now the basic metrics. 
> These were tested on the following 4-function calculator lexer.
>  
> tokens {
>         MUL='*';
>         DIV='/';
>         MOD='%';
>         ADD='+';
>         SUB='-';
> }
>  
> IDENTIFIER
>         :       ('a'..'z' | 'A'..'Z' | '_')
>                 ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
>         ;
>  
> NUMBER
>         :       '0'..'9'+
>         ;
>  
> WS
>         :       (' ' | '\t' | '\n' | '\r' | '\f')*
>                 {$channel = Hidden;}
>         ;
>  
> Memory – CommonToken (32-bit system):
> ·         8 bytes overhead for being a class
> ·         36 bytes overhead for member variables
>  
> Memory – CommonToken (64-bit system):
> ·         16 bytes overhead for being a class (I believe that’s the object 
> header size)
> ·         44 bytes overhead for members
>  
> Memory – SlimToken (32- or 64-bit systems):
> ·         8 bytes total storage, and no allocations since it’s a value type.
>  
> Lexer speed – CommonToken:
> ·         Total time: 10.34s
> ·         Rate: 2.71 mil tokens/sec
>  
> Lexer speed – SlimToken:
> ·         Total time 2.87s
> ·         Rate: 9.76 mil tokens/sec
>  
> My goal is to add enough CommonToken features back to SlimToken to make it 
> usable without breaking its performance characteristics. To do so, I’m 
> working on a new revision of SlimLexer that holds a ShortToken (backed by 
> 32-bit int) or LongToken (backed by 64-bit int) (the lexer is generic in C#). 
> The token itself stores its type (low 8-bits of ShortToken, 16-bits of 
> LongToken), a flag of whether it’s on the default channel or not (+/-), and 
> 23- or 47-bits for the token index). As the lexer runs, it builds B-tree 
> indexes for line lengths, token offset and (with token lengths derived). It 
> also holds a map from Token->string so that it only has to track text when 
> necessary. This gives O(1) access to the values that drive decision making 
> (with (value & 0xF) giving the token type for ShortToken), and O(log_b(n)) 
> access to other values. I expect to see a great improvement in performance 
> with a very practical token for real parsing tasks.
>  
> Sam

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

Re: [antlr-dev] Alternative token storage mechanisms

Reply via email to