[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

2022-11-03 Thread Pablo Galindo Salgado
Just be aware that the C tokenizer interface is NOT a public interface and is there only so we can test the C tokenizer itself. This can and will break at any point without previous warning in any way.Pablo Galindo SalgadoOn 3 Nov 2022, at 18:10, David J W wrote:Following up, Pablo spotted my pro

[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

2022-11-03 Thread David J W
Following up, Pablo spotted my problem with the mixup of NL & NEWLINE tokens. I was using tokenize.py in cPython's stdlib with a simple python script to build ridiculously strict unit tests. My solution to that problem was originally to figure out how to access cPython's internal c tokenizer but

[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

2022-10-26 Thread Matthieu Dartiailh
If you look at pegen, that uses the stdlib tokenizer as input, you will see that the obejct us3d to implement memoization on top of a token stream simply swallow NL ( https://github.com/we-like-parsers/pegen/blob/main/src/pegen/tokenizer.py#L49). This is safe since NL has no syntactic meaning only

[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

2022-10-26 Thread Matthias Görgens
Hi David, Could you share what you have so far, perhaps ok GitHub or so? That way it's easier to diagnose your problems. I'm reasonably familiar with Rust. Perhaps also add a minimal crashing example? Cheers, Matthias. On Thu, 27 Oct 2022, 04:52 David J W, wrote: > Pablo, > Nl and Newline

[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

2022-10-26 Thread Pablo Galindo Salgado
Hummm… he is also mentioning NL and Newline tokens and if I recall correctly those are tokens that only appear in the Python tokenizer and are emitted differently from the C one (and therefore they are not used in the grammar).Pablo Galindo SalgadoOn 26 Oct 2022, at 21:57, Guido van Rossum wrote:

[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

2022-10-26 Thread Guido van Rossum
I wonder if David may be struggling with the rule that a newline is significant in the grammar unless it appears inside matching brackets/parentheses/braces? I think that's in the lexer. Similarly, multiple newlines are collapsed. On Wed, Oct 26, 2022 at 1:19 PM Pablo Galindo Salgado wrote: > Hi

[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

2022-10-26 Thread David J W
Pablo, Nl and Newline are tokens but I am interested in NEWLINE's behavior in the Python grammar, note the casing. For example in simple_stmts @ https://github.com/python/cpython/blob/main/Grammar/python.gram#L107 Is that NEWLINE some sort of built in rule to the grammar? In my project I am

[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

2022-10-26 Thread Pablo Galindo Salgado
Hi, As I mentioned, NEWLINE is a token. All uppercase words in the grammar are tokens and therefore are produced by the lexer, not the parser. Is not a built-in rule. In particular, that token is produced here: https://github.com/python/cpython/blob/6777e09166fc384ea0a4b50202c7b0bd7a23330c/Parser

[Python-Dev] Re: NEWLINE sentinel behavior in CPython's PEG grammar

2022-10-26 Thread Pablo Galindo Salgado
Hi, I am not sure I understand exactly what you are asking but NEWLINE is a token, not a parser rule. What decides when NEWLINE is emitted is the lexer that has nothing to do with PEG. Normally PEG parsers also acts as tokenizers but the one in cpython does not. Also notice that CPython’s pars