Re: [PEG] [ANN] ParseKit

Juancarlo Añez Sat, 18 May 2013 11:57:51 -0700

Ter,


 Just a note on why I reluctantly had to opt out of ANTLR and seek PEG for
>> my ongoing projects:
>>
>>    - Software AG Natural allows relevant language words to be used as
>>    identifiers. In fact, words are keywords or not depending on context.
>>
>> Hi! What specific weakness in v3 causes this to be invalid? Is it just a
> predicate?
>

In ANTLR, one you define a word in the tokenizer, it always behaves as a
keyword.

Alternatives like using:

: WORD {WORD.text == "MY_KEYWORD"}?


Make the language too ambiguous in ANTLR terms.

In PEG, you can do exactly the above, with a semantic action, because
there's no ambiguity by PEG's definition.




>>    -
>>    - Natural also makes many of the relevant language words optional.
>>    The predictive ANTLR *v3* grammar was inundated with lookaheads to
>>    resolve the ambiguity caused by that.
>>
>> I see. Wasn't the PEG during the same thing? Note that if you set k=1
> option then ANTLR v3 more or less have to backtrack constantly, leading it
> to behave exactly like a PEG.
>

It's not just the backtracking which distinguishes PEG prom predicting
parser. It is the ability to have the parser behave differently depending
on context.

PEG easily allows for embedded languages within one grammar (the case for
Natural FORMAT statement, and COBOL's PICTURE statement). In a predictive
parser, one language would interfere with the parsing of the other, unless
you do heavy magic with the lexer to separate the sub-languages, and build
a different grammar for each sub-language.


>
>>    - Natural has exactly one statement that is not led by a keyword:
>>    assignment. And most statements lack terminators, so ANTLR would consume
>>    the left hand side of an assignment unless a negative lookahead was added
>>    to most rules.
>>    - COBOL programs embed subprograms in CICS and SQL, and those must be
>>    parsed to have a complete translation.
>>
>>
>  this sounds like you need a scannerless parser. v4 would handle this no
> problem, if you're willing to treat each character as a token.
>

I actually need a scanner that takes care of white-space, comments,
case-insensitivity, and word boundaries. The grammars would be very
convoluted without one, and we're talking about grammars which are several
thousand lines, even when they don't contain semantic actions.


>
>
>> It was time-consuming and difficult to deal with the above with a
>> mandated tokenizer, and with v3's allergy to apparent ambiguity (I couldn't
>> try v4 because the Python runtime wasn't ready).
>>
>> I know. They are both stupid programming languages... but they are real,
>> and there are millions of lines of source code written in them and still
>> alive. PEG deals with both easily. You may be right in that cases like
>> these are rare, though.
>>
>
>  I'm not clear on how a PEG has an advantage here; could you be more
> specific?
>

This is a simple rule from the Natural grammar in ANTLR:

lvalue
    :
    indexed
    (
        (line_ref) => line_ref
        |
        (label_ref) => label_ref
        |
        () -> indexed
    )
    ;


I think you can imagine what the need to explicitely remove the ambiguity
reported by ANTLR leads to in more complex statements. This is how the
original rule looked like:

lvalue
    :
    indexed (line_ref|label_ref)?
    ;


I don't know the ANTLR's internals, but I speculate that what happens is
that all the optional stuff in Natural throws off the calculation of FOLLOW
sets, and that forces you to provide the predictive sequences explicit.

This is what the rules in Grako's grammar syntax:

lvalue
    =
    indexed [line_ref|label_ref]
    ;


Another problem was that:

   1. Statements Natural have optional termination tokens,
   2. Many statements act on lists of expressions.
   3. There an assignment statement of the form: lvalue ':=' expression.

In PEG, you solve the problem caused by the above by introducing a negative
lookeahead at the end of lists. In ANTLR that doesn't work because the
parser is already committed to continue the list by the time it sees the
negative lookahead and/or because negatives do consume input.

atom_list
    :
    (~(assignment) atom)+
    ;


The above is not possible, so one winds up with many rules like this:

end_transaction_params
    :
    (lvalue ':=')=> ()
    |
    (atom)=> atom end_transaction_params
    |
    ()
    ;

In Grako, the above would be:

end_transaction_params
    =
    {atom !(lvalue ':=')}
    ;


Cheers,

-- 
Juancarlo *Añez*

_______________________________________________
PEG mailing list
[email protected]
https://lists.csail.mit.edu/mailman/listinfo/peg

Re: [PEG] [ANN] ParseKit

Reply via email to