Ok, we have implemented a rough prototype and we have decided not to go
with this for the following reasons:

* Is almost the same storage cost as the solution we already have. Since
storing the node id cost 4 bytes (an integer) per instruction
and our solution needs 2+ bytes per instruction for the offsets (normally
just 2 as offsets are generally < 256) + the compressed end line,
which is very small since is almost always the same as the start line.
* It actually doesn't have more advantages. The current solution in the PEP
can do exactly the same as this solution if you allow reparsing when
displaying tracebacks. This is because with the start line, end line, start
offset and end offset and the original file, you can extract the source that
is associated with the instruction, parse it (and this
is much faster because you just need to parse the tiny fragment) and then
you get an AST node that you can use for whatever you want.
* The proposed solution forces to reparse several times entire files just
to extract a single AST node. Even worse: for frames in the same file it
forces
to reparse those again and again unless you complicate the implementation
by adding AST caching. But even that it won't be free as it will incur in
quite a lot of extra memory and this is problematic, especially when
displaying exceptions as memory can be low, which the current design takes
into account.
* Nodes are created and modified after the optimization pass, so the AST
produced by the parser is not enough to reconstruct the actual
information, we need to also run the optimization passes, but
unfortunately, this is (by design) not don't in the Python API (to preserve
all possible information about the original code), so this complicates
quite a lot the API to get this, as `ast.parse` is not enough to get the
original tree.
* The implementation is quite more complex then the one the PEP has since
to do it right implies having to hash the source files, implement node id
propagation in the parser and a node visitor to find the correct node +
reparsing in the traceback.
* It makes the AST (both the C one and the Python one) bigger, which
increases the memory costs of parsing and very slightly affects performance
(we measured a 4% decrease in perf to add the new field).
* It makes the API for 3rd party tools very non-straightforward, forcing
reparsing and finding the AST nodes. Even if we provide
some new functionality in the ast module or similar to make this easier, it
quite a lot of overhead just to get the position information.

Even if is possible to solve many of these problems, we think the
complexity is not worth it, especially since it actually doesn't give more
functionality.

On Tue, 18 May 2021 at 13:47, Pablo Galindo Salgado <pablog...@gmail.com>
wrote:

> > One integer is actually not enough to assign IDs.
>
> Actually, disregard this particular problem. I think that we could
> perfectly stop assigning IDs if we reach the overflow limit and call it a
> day
> since you need to have a truly horrendous file to reach 4,294,967,295 AST 
> nodes
> (I did some tests to check this).
>
> On Tue, 18 May 2021 at 13:25, Pablo Galindo Salgado <pablog...@gmail.com>
> wrote:
>
>> Yet another problem that I found:
>>
>> One integer is actually not enough to assign IDs. One unsigned integer
>> can cover 4,294,967,295 AST nodes, but is technically possible
>> to have more than that in a single file. While in PEP 657 we are tracking
>> offsets that are normally very low < 100 typically or end lines
>> that are easily compressible (as end lines are normally equal to the
>> start of the line), node IDs can be arbitrarily big. Which is worse:
>> the parser creates some AST nodes that throw away if the parser fails
>> (that's how PEG parsers work), so there will be a lot of IDs that
>> don't get assigned (and the parser doesn't have an easy way to know how
>> many IDs has thrown away). This forces us to either use something
>> bigger than an integer (probably a size_t) or to deal with overflow.
>>
>> On Tue, 18 May 2021 at 10:23, Pablo Galindo Salgado <pablog...@gmail.com>
>> wrote:
>>
>>> Hummmm, actually another problem of this approach:
>>>
>>> Nodes are created and modified after the optimization pass, so the AST
>>> produced by the parser is not enough to reconstruct the actual
>>> information, we need to also run the optimization passes, but
>>> unfortunately, this is (by design) not don't in the Python API (to preserve
>>> all possible information about the original code), so this complicates
>>> quite a lot the API to get this, as `ast.parse` is not enough to get the
>>> original tree.
>>>
>>> On Tue, 18 May 2021 at 08:53, Pablo Galindo Salgado <pablog...@gmail.com>
>>> wrote:
>>>
>>>> Hi Nathaniel,
>>>>
>>>> Thanks a lot for your suggestion! I like the idea although I still
>>>> think is more complex than our current proposal, but on the other hand it
>>>> allows for a much richer results so I'm quite excited to try it out. We are
>>>> going to give it a go to explore it with a prototype and if we are
>>>> convinced we will update the PEP.
>>>>
>>>> One thing I'm not a fan of is that tools that want to leverage this
>>>> information (notice our pep also offers this info for tools as an important
>>>> aspect of it) will need to reparse the file AND search the AST node, which
>>>> also involves transforming the full tree to Python objects, which is even
>>>> slower. This is unfortunately a worse API than just get the full location
>>>> given an instruction offset. Also, while displaying tracebacks may be a
>>>> good scenario for the speed tradeoff, it may not be for other tools.
>>>> Finally, if there is no file available all this information is lost,
>>>> although one could argue that then is not extremely useful...
>>>>
>>>> Regards from sunny London,
>>>> Pablo Galindo Salgado
>>>>
>>>> On Tue, 18 May 2021, 01:43 Nathaniel Smith, <n...@pobox.com> wrote:
>>>>
>>>>> On Mon, May 17, 2021 at 6:18 AM Mark Shannon <m...@hotpy.org> wrote:
>>>>> > 2. Repeated binary operations on the same line.
>>>>> >
>>>>> > A single location can also be clearer when all the code is on one
>>>>> line.
>>>>> >
>>>>> > i1 + i2 + s1
>>>>> >
>>>>> > PEP 657:
>>>>> >
>>>>> > i1 + i2 + s1
>>>>> > ^^^^^^^^^^^^
>>>>> >
>>>>> > Using a single location:
>>>>> >
>>>>> > i1 + i2 + s1
>>>>> >          ^
>>>>>
>>>>> It's true this case is a bit confusing with the whole operation span
>>>>> highlighted, but I'm not sure the single location version is much better. 
>>>>> I
>>>>> feel like a Really Good UI would like, highlight the two operands in
>>>>> different colors or something, or at least underline the two separate 
>>>>> items
>>>>> whose type is incompatible separately:
>>>>>
>>>>> TypeError: unsupported operand type(s) for +: 'int' + 'str':
>>>>> i1 + i2 + s1
>>>>> ^^^^^^^   ~~
>>>>>
>>>>> More generally, these error messages are the kind of thing where the
>>>>> UI can always be tweaked to improve further, and those tweaks can make 
>>>>> good
>>>>> use of any rich source information that's available.
>>>>>
>>>>> So, here's another option to consider:
>>>>>
>>>>> - When parsing, assign each AST node a unique, deterministic id (e.g.
>>>>> sequentially across the AST tree from top-to-bottom, left-to-right).
>>>>> - For each bytecode offset, store the corresponding AST node id in an
>>>>> lnotab-like table
>>>>> - When displaying a traceback, we already need to go find and read the
>>>>> original .py file to print source code at all. Re-parse it, and use the 
>>>>> ids
>>>>> to find the original AST node, in context with full structure. Let the
>>>>> traceback formatter do whatever clever stuff it wants with this info.
>>>>>
>>>>> Of course if the .py and .pyc files don't match, this might produce
>>>>> gibberish. We already have that problem with showing source lines, but it
>>>>> might be even more confusing if we get some random unrelated AST node. 
>>>>> This
>>>>> could be avoided by storing some kind of hash in the code object, so that
>>>>> we can validate the .py file we find hasn't changed (sha512 if we're
>>>>> feeling fancy, crc32 if we want to save space, either way is probably 
>>>>> fine).
>>>>>
>>>>> This would make traceback printing more expensive, but only if you
>>>>> want the fancy features, and traceback printing is already expensive (it
>>>>> does file I/O!). Usually by the time you're rendering a traceback it's 
>>>>> more
>>>>> important to optimize for human time than CPU time. It would take less
>>>>> memory than PEP 657, and the same as Mark's proposal (both only track one
>>>>> extra integer per bytecode offset). And it would allow for arbitrarily 
>>>>> rich
>>>>> traceback display.
>>>>>
>>>>> (I guess in theory you could make this even cheaper by using it to
>>>>> replace lnotab, instead of extending it. But I think keeping lnotab around
>>>>> is a good idea, as a fallback for cases where you can't find the original
>>>>> source but still want some hint at location information.)
>>>>>
>>>>> -n
>>>>>
>>>>> --
>>>>> Nathaniel J. Smith -- https://vorpus.org
>>>>> _______________________________________________
>>>>> Python-Dev mailing list -- python-dev@python.org
>>>>> To unsubscribe send an email to python-dev-le...@python.org
>>>>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>>>>> Message archived at
>>>>> https://mail.python.org/archives/list/python-dev@python.org/message/BUXFOSAEBXLIHH432PKBCXOGXUAHQIVP/
>>>>> Code of Conduct: http://python.org/psf/codeofconduct/
>>>>>
>>>>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/AG7YYKZ2TOKRRKJU66Q4X3A5O4TBKP47/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to