Just be aware that the C tokenizer interface is NOT a public interface and is there only so we can test the C tokenizer itself. This can and will break at any point without previous warning in any way.Pablo Galindo SalgadoOn 3 Nov 2022, at 18:10, David J W wrote:Following up, Pablo spotted my pro
Following up, Pablo spotted my problem with the mixup of NL & NEWLINE
tokens. I was using tokenize.py in cPython's stdlib with a simple python
script to build ridiculously strict unit tests.
My solution to that problem was originally to figure out how to access
cPython's internal c tokenizer but
If you look at pegen, that uses the stdlib tokenizer as input, you will see
that the obejct us3d to implement memoization on top of a token stream
simply swallow NL (
https://github.com/we-like-parsers/pegen/blob/main/src/pegen/tokenizer.py#L49).
This is safe since NL has no syntactic meaning only
Hi David,
Could you share what you have so far, perhaps ok GitHub or so? That way
it's easier to diagnose your problems. I'm reasonably familiar with Rust.
Perhaps also add a minimal crashing example?
Cheers,
Matthias.
On Thu, 27 Oct 2022, 04:52 David J W, wrote:
> Pablo,
> Nl and Newline
Hummm… he is also mentioning NL and Newline tokens and if I recall correctly those are tokens that only appear in the Python tokenizer and are emitted differently from the C one (and therefore they are not used in the grammar).Pablo Galindo SalgadoOn 26 Oct 2022, at 21:57, Guido van Rossum wrote:
I wonder if David may be struggling with the rule that a newline is
significant in the grammar unless it appears inside matching
brackets/parentheses/braces? I think that's in the lexer. Similarly,
multiple newlines are collapsed.
On Wed, Oct 26, 2022 at 1:19 PM Pablo Galindo Salgado
wrote:
> Hi
Pablo,
Nl and Newline are tokens but I am interested in NEWLINE's behavior in
the Python grammar, note the casing.
For example in simple_stmts @
https://github.com/python/cpython/blob/main/Grammar/python.gram#L107
Is that NEWLINE some sort of built in rule to the grammar? In my project
I am
Hi,
As I mentioned, NEWLINE is a token. All uppercase words in the grammar are
tokens and therefore are produced by the lexer, not the parser. Is not a
built-in rule. In particular, that token is produced here:
https://github.com/python/cpython/blob/6777e09166fc384ea0a4b50202c7b0bd7a23330c/Parser
Hi,
I am not sure I understand exactly what you are asking but NEWLINE is a token,
not a parser rule. What decides when NEWLINE is emitted is the lexer that has
nothing to do with PEG. Normally PEG parsers also acts as tokenizers but the
one in cpython does not.
Also notice that CPython’s pars