Re: ipython2 does not work anymore

2017-01-20 Thread dieter
Cecil Westerhof  writes:
> ...
>> If you do mean 'pathlib', it was introduced in Python 3.4.
>
> It is about python2.

I can remember to have seen announcements for enhanced "path" modules
in this list. Your previously posted traceback shows that the problem
comes from the package "pickleshare" which expects a "path" module
(likely one of those enhanced modules) and cannot find one.

I would try to find out what "path" module "pickleshare" requires
and then install it.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python corrupted double-linked list error

2017-01-20 Thread dieter
Xristos Xristoou  writes:

> i am a python 2.7 and ubuntu 16.04 user.

While analysing a problem upgrading to Ubuntu 16.04 (unrelated to Python)
I found messages reporting about a problem with Python an Ubuntu 16.04[.0]
(leading to memory corruption - something you are seeing).
The proposed solution has been to upgrade to Ubuntu 16.04.1.

-- 
https://mail.python.org/mailman/listinfo/python-list


RE: Must target be only one bit one such as 0001, 0010, 0100, 1000 In supervised neural learning f(w*p+b) with perceptron rule w = w + e for linear case?

2017-01-20 Thread Deborah Swanson
Ho Yeung Lee wrote, on January 19, 2017 12:05 AM
> 
> Must target be only one bit one such as 0001,0010,0100,1000 
> In supervised neural learning f(w*p+b) with perceptron rule w 
> = w + e for linear case?
> 
> with neural network design
> 
> does it means that can not use two or more one as target such 
> as 0011,0110,1100,1010, 0111,1110,1101, etc when training weight?

With all respect, this sounds like a question you should be asking your
professor, or one of the teaching assistants (TAs) for the course, or
maybe your local computer science tutoring volunteers, if none of the
above are available.

This list is for questions and discussion about python. Maybe you're
trying to code this problem in python, or maybe you're planning to at
some stage, but you   need to understand the fundamentals of the problem
first.

It's possible someone on this list would recognize the problem you're
having, but it's unlikely.

Come back with a clearly stated python problem, and I'm sure lots of
people would help you with it.

Deborah

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Using python to start programs after logging in

2017-01-20 Thread John Gordon
In <[email protected]> Cecil Westerhof  writes:

> > I think using your window manager's built-in facilities for starting
> > programs would be better.  Why are you using Python instead?

> Because when you use the window managers builtin facilities then all
> programs will be started on the same virtual desktop and I want to
> start them on different ones.

The window manager doesn't allow you to specify a target desktop?  That
seems like a pretty heinous feature omission.

-- 
John Gordon   A is for Amy, who fell down the stairs
[email protected]  B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"

-- 
https://mail.python.org/mailman/listinfo/python-list


PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Pete Forman
Can anyone point me at a rationale for PEP 393 being incorporated in
Python 3.3 over using UTF-8 as an internal string representation? I've
found good articles by Nick Coghlan, Armin Ronacher and others on the
matter. What I have not found is discussion of pros and cons of
alternatives to the old narrow or wide implementation of Unicode
strings.

ISTM that most operations on strings are via iterators and thus agnostic
to variable or fixed width encodings. How important is it to be able to
get to part of a string with a simple index? Just because old skool
strings could be treated as a sequence of characters, is that a reason
to shoehorn the subtleties of Unicode into that model?

-- 
Pete Forman
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Kaynor
On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman  wrote:
> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation? I've
> found good articles by Nick Coghlan, Armin Ronacher and others on the
> matter. What I have not found is discussion of pros and cons of
> alternatives to the old narrow or wide implementation of Unicode
> strings.

The PEP itself has the rational for the problems with the narrow/wide
idea, the quote from https://www.python.org/dev/peps/pep-0393/:
There are two classes of complaints about the current implementation
of the unicode type:on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).

Basically, narrow builds had very odd behavior with non-BMP
characters, namely that indexing into the string could easily produce
mojibake. Wide builds used quite a bit more memory, which generally
translates to reduced performance.

> ISTM that most operations on strings are via iterators and thus agnostic
> to variable or fixed width encodings. How important is it to be able to
> get to part of a string with a simple index? Just because old skool
> strings could be treated as a sequence of characters, is that a reason
> to shoehorn the subtleties of Unicode into that model?

I think you are underestimating the indexing usages of strings. Every
operation on a string using UTF8 that contains larger characters must
be completed by starting at index 0 - you can never start anywhere
else safely. rfind/rsplit/rindex/rstrip and the other related reverse
functions would require walking the string from start to end, rather
than short-circuiting by reading from right to left. With indexing
becoming linear time, many simple algorithms need to be written with
that in mind, to avoid n*n time. Such performance regressions can
often go unnoticed by developers, who are likely to be testing with
small data, and thus may cause (accidental) DOS attacks when used on
real data. The exact same problems occur with the old narrow builds
(UTF16; note that this was NOT implemented in those builds, however,
which caused the mojibake problems) as well - only a UTF32 or PEP393
implementation can avoid those problems.

Note that from a user (including most developers, if not almost all),
PEP393 strings can be treated as if they were UTF32, but with many of
the benefits of UTF8. As far as I'm aware, it is only developers
writing extension modules that need to care - and only then if they
need maximum performance, and thus cannot convert every string they
access to UTF32 or UTF8.

--
Chris Kaynor
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Thomas Nyberg

On 01/20/2017 03:06 PM, Chris Kaynor wrote:


[...snip...]

--
Chris Kaynor



I was able to delete my response which was a wholly contained subset of 
this one. :)



But I have one extra question. Is string indexing guaranteed to be 
constant-time for python? I thought so, but I couldn't find it 
documented anywhere. (Not that I think it practically matters, since it 
couldn't really change if it weren't for all the reasons you mentioned.) 
I found this which at details (if not explicitly "guarantees") the 
complexity properties of other datatypes:


https://wiki.python.org/moin/TimeComplexity

Cheers,
Thomas
--
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 10:15 AM, Thomas Nyberg  wrote:
> But I have one extra question. Is string indexing guaranteed to be
> constant-time for python? I thought so, but I couldn't find it documented
> anywhere. (Not that I think it practically matters, since it couldn't really
> change if it weren't for all the reasons you mentioned.) I found this which
> at details (if not explicitly "guarantees") the complexity properties of
> other datatypes:
>

No, it isn't; this question came up in the context of MicroPython,
which chose to go UTF-8 internally instead of PEP 393. But the
considerations for uPy are different - it's not designed to handle
gobs of data, so constant-time vs linear isn't going to have as much
impact. But in normal work, it's important enough to have predictable
string performance. You can't afford to deploy a web application, test
it, and then have someone send a large amount of data at it, causing
massive O(n^2) blowouts.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Kaynor
.
On Fri, Jan 20, 2017 at 3:15 PM, Thomas Nyberg  wrote:
> On 01/20/2017 03:06 PM, Chris Kaynor wrote:
>>
>>
>> [...snip...]
>>
>> --
>> Chris Kaynor
>>
>
> I was able to delete my response which was a wholly contained subset of this
> one. :)
>
>
> But I have one extra question. Is string indexing guaranteed to be
> constant-time for python? I thought so, but I couldn't find it documented
> anywhere. (Not that I think it practically matters, since it couldn't really
> change if it weren't for all the reasons you mentioned.) I found this which
> at details (if not explicitly "guarantees") the complexity properties of
> other datatypes:
>
> https://wiki.python.org/moin/TimeComplexity

As far as I'm aware, the language does not guarantee it. In fact, I
believe it was decided that MicroPython could use UTF8 strings with
linear indexing while still calling itself Python. This was very
useful for MicroPython due to the platforms it supports (embedded),
and needing to keep the memory footprint very small.

I believe Guido (on Python-ideas) has stated that constant-time string
indexing is a guarantee of CPython, however.

The only reference I found in my (very quick) search is the Python-Dev
thread at 
https://groups.google.com/forum/#!msg/dev-python/3lfXwljNLj8/XxO2s0TGYrYJ
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread MRAB

On 2017-01-20 23:06, Chris Kaynor wrote:

On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman  wrote:

Can anyone point me at a rationale for PEP 393 being incorporated in
Python 3.3 over using UTF-8 as an internal string representation? I've
found good articles by Nick Coghlan, Armin Ronacher and others on the
matter. What I have not found is discussion of pros and cons of
alternatives to the old narrow or wide implementation of Unicode
strings.


The PEP itself has the rational for the problems with the narrow/wide
idea, the quote from https://www.python.org/dev/peps/pep-0393/:
There are two classes of complaints about the current implementation
of the unicode type:on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).

Basically, narrow builds had very odd behavior with non-BMP
characters, namely that indexing into the string could easily produce
mojibake. Wide builds used quite a bit more memory, which generally
translates to reduced performance.


ISTM that most operations on strings are via iterators and thus agnostic
to variable or fixed width encodings. How important is it to be able to
get to part of a string with a simple index? Just because old skool
strings could be treated as a sequence of characters, is that a reason
to shoehorn the subtleties of Unicode into that model?


I think you are underestimating the indexing usages of strings. Every
operation on a string using UTF8 that contains larger characters must
be completed by starting at index 0 - you can never start anywhere
else safely. rfind/rsplit/rindex/rstrip and the other related reverse
functions would require walking the string from start to end, rather
than short-circuiting by reading from right to left. With indexing
becoming linear time, many simple algorithms need to be written with
that in mind, to avoid n*n time. Such performance regressions can
often go unnoticed by developers, who are likely to be testing with
small data, and thus may cause (accidental) DOS attacks when used on
real data. The exact same problems occur with the old narrow builds
(UTF16; note that this was NOT implemented in those builds, however,
which caused the mojibake problems) as well - only a UTF32 or PEP393
implementation can avoid those problems.

You could implement rsplit and rstrip easily enough, but rfind and 
rindex return the index, so you'd need to scan the string to return that.



Note that from a user (including most developers, if not almost all),
PEP393 strings can be treated as if they were UTF32, but with many of
the benefits of UTF8. As far as I'm aware, it is only developers
writing extension modules that need to care - and only then if they
need maximum performance, and thus cannot convert every string they
access to UTF32 or UTF8.

As someone who has written an extension, I can tell you that I much 
prefer dealing with a fixed number of bytes per codepoint than a 
variable number of bytes per codepoint, especially as I'm also 
supporting earlier versions of Python where that was the case.


--
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Pete Forman
Chris Kaynor  writes:

> On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman  wrote:
>> Can anyone point me at a rationale for PEP 393 being incorporated in
>> Python 3.3 over using UTF-8 as an internal string representation?
>> I've found good articles by Nick Coghlan, Armin Ronacher and others
>> on the matter. What I have not found is discussion of pros and cons
>> of alternatives to the old narrow or wide implementation of Unicode
>> strings.
>
> The PEP itself has the rational for the problems with the narrow/wide
> idea, the quote from https://www.python.org/dev/peps/pep-0393/: There
> are two classes of complaints about the current implementation of the
> unicode type:on systems only supporting UTF-16, users complain that
> non-BMP characters are not properly supported. On systems using UCS-4
> internally (and also sometimes on systems using UCS-2), there is a
> complaint that Unicode strings take up too much memory - especially
> compared to Python 2.x, where the same code would often use ASCII
> strings (i.e. ASCII-encoded byte strings). With the proposed approach,
> ASCII-only Unicode strings will again use only one byte per character;
> while still allowing efficient indexing of strings containing non-BMP
> characters (as strings containing them will use 4 bytes per
> character).
>
> Basically, narrow builds had very odd behavior with non-BMP
> characters, namely that indexing into the string could easily produce
> mojibake. Wide builds used quite a bit more memory, which generally
> translates to reduced performance.

I'm taking as a given that the old way was often sub-optimal in many
scenarios. My questions were about the alternatives, and why PEP 393 was
chosen over other approaches.

>> ISTM that most operations on strings are via iterators and thus
>> agnostic to variable or fixed width encodings. How important is it to
>> be able to get to part of a string with a simple index? Just because
>> old skool strings could be treated as a sequence of characters, is
>> that a reason to shoehorn the subtleties of Unicode into that model?
>
> I think you are underestimating the indexing usages of strings. Every
> operation on a string using UTF8 that contains larger characters must
> be completed by starting at index 0 - you can never start anywhere
> else safely. rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left. With indexing
> becoming linear time, many simple algorithms need to be written with
> that in mind, to avoid n*n time. Such performance regressions can
> often go unnoticed by developers, who are likely to be testing with
> small data, and thus may cause (accidental) DOS attacks when used on
> real data. The exact same problems occur with the old narrow builds
> (UTF16; note that this was NOT implemented in those builds, however,
> which caused the mojibake problems) as well - only a UTF32 or PEP393
> implementation can avoid those problems.

I was asserting that most useful operations on strings start from index
0. The r* operations would not be slowed down that much as UTF-8 has the
useful property that attempting to interpret from a byte that is not at
the start of a sequence (in the sense of a code point rather than
Python) is invalid and so quick to move over while working backwards
from the end.

The only significant use of an index dereference that I could come up
with was the result of a find() or index(). I put out this public
question so that I could be enclued as to other uses. My personal
experience is that in most cases where I might consider find() that I
end up using re and use the return from match groups which has copies of
the (sub)strings that I want.

> Note that from a user (including most developers, if not almost all),
> PEP393 strings can be treated as if they were UTF32, but with many of
> the benefits of UTF8. As far as I'm aware, it is only developers
> writing extension modules that need to care - and only then if they
> need maximum performance, and thus cannot convert every string they
> access to UTF32 or UTF8.

PEP 393 already says that "the specification chooses UTF-8 as the
recommended way of exposing strings to C code".

-- 
Pete Forman
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Pete Forman
MRAB  writes:

> As someone who has written an extension, I can tell you that I much
> prefer dealing with a fixed number of bytes per codepoint than a
> variable number of bytes per codepoint, especially as I'm also
> supporting earlier versions of Python where that was the case.

At the risk of sounding harsh, if supporting variable bytes per
codepoint is a pain you should roll with it for the greater good of
supporting users.

PEP 393 / Python 3.3 required extension writers to revisit their access
to strings. My explicit question was about why PEP 393 was adopted to
replace the deficient old implementations rather than another approach.
The implicit question is whether a UTF-8 internal representation should
replace that of PEP 393.

-- 
Pete Forman
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman  wrote:
> I was asserting that most useful operations on strings start from index
> 0. The r* operations would not be slowed down that much as UTF-8 has the
> useful property that attempting to interpret from a byte that is not at
> the start of a sequence (in the sense of a code point rather than
> Python) is invalid and so quick to move over while working backwards
> from the end.

Let's take one very common example: decoding JSON. A ton of web
servers out there will call json.loads() on user-supplied data. The
bulk of the work is in the scanner, which steps through the string and
does the actual parsing. That function is implemented in Python, so
it's a good example. (There is a C accelerator, but we can ignore that
and look at the pure Python one.)

So, how could you implement this function? The current implementation
maintains an index - an integer position through the string. It
repeatedly requests the next character as string[idx], and can also
slice the string (to check for keywords like "true") or use a regex
(to check for numbers). Everything's clean, but it's lots of indexing.
Alternatively, it could remove and discard characters as they're
consumed. It would maintain a string that consists of all the unparsed
characters. All indexing would be at or near zero, but after every
tiny piece of parsing, the string would get sliced.

With immutable UTF-8 strings, both of these would be O(n^2). Either
indexing is linear, so parsing the tail of the string means scanning
repeatedly; or slicing is linear, so parsing the head of the string
means slicing all the rest away.

The only way for it to be fast enough would be to have some sort of
retainable string iterator, which means exposing an opaque "position
marker" that serves no purpose other than parsing. Every string parse
operation would have to be reimplemented this way, lest it perform
abysmally on large strings. It'd mean some sort of magic "thing" that
probably has a reference to the original string, so you don't get the
progressive RAM refunds that slicing gives, and you'd still have to
deal with lots of the other consequences. It's probably doable, but it
would be a lot of pain.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 11:51 AM, Pete Forman  wrote:
> MRAB  writes:
>
>> As someone who has written an extension, I can tell you that I much
>> prefer dealing with a fixed number of bytes per codepoint than a
>> variable number of bytes per codepoint, especially as I'm also
>> supporting earlier versions of Python where that was the case.
>
> At the risk of sounding harsh, if supporting variable bytes per
> codepoint is a pain you should roll with it for the greater good of
> supporting users.

That hasn't been demonstrated, though. There's plenty of evidence
regarding cache usage that shows that direct indexing is incredibly
beneficial on large strings. What are the benefits of variable-sized
encodings? AFAIK, the only real benefit is that you can use less
memory for strings that contain predominantly ASCII but a small number
of astral characters (plus *maybe* a faster encode-to-UTF-8; you
wouldn't get a faster decode-from-UTF-8, because you still need to
check that the byte sequence is valid). Can you show a use-case that
would be materially improved by UTF-8?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread MRAB

On 2017-01-21 00:51, Pete Forman wrote:

MRAB  writes:


As someone who has written an extension, I can tell you that I much
prefer dealing with a fixed number of bytes per codepoint than a
variable number of bytes per codepoint, especially as I'm also
supporting earlier versions of Python where that was the case.


At the risk of sounding harsh, if supporting variable bytes per
codepoint is a pain you should roll with it for the greater good of
supporting users.

Or I could decide not bother and leave it to someone else to continue 
the project. After all, it's not like I'm not getting paid for the work, 
it's purely voluntary.



PEP 393 / Python 3.3 required extension writers to revisit their access
to strings. My explicit question was about why PEP 393 was adopted to
replace the deficient old implementations rather than another approach.
The implicit question is whether a UTF-8 internal representation should
replace that of PEP 393.

I already had to handle 1-byte bytestrings and 2/4-byte (narrow/wide) 
Unicode strings, so switching to 1/2/4 strings wasn't too bad. Switching 
to a completely different, variable-width system would've been a lot 
more work.


--
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Paul Rubin
Chris Angelico  writes:
> decoding JSON... the scanner, which steps through the string and
> does the actual parsing. ...
> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing.

Python already has that type of iterator:
   x = "foo"
   for c in x: 

> It'd mean some sort of magic "thing" that probably has a reference to
> the original string

It's a regular old string iterator unless I'm missing something.  Of
course a json parser should use it, though who uses the non-C json
parser anyway these days?

[Chris Kaynor writes:]
> rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left. 

UTF-8 can be read from right to left because you can recognize when a
codepoint begins by looking at the top 2 bits of each byte as you scan
backwards.  Any combination except for 11 is a leading byte, and 11 is
always a continuation byte.  This "prefix property" of UTF8 is a design
feature and not a trick someone noticed after the fact.

Also if you really want O(1) random access, you could put an auxiliary
table into long strings, giving the byte offset of every 256th codepoint
or something like that.  Then you'd go to the nearest table entry and
scan from there.  This would usually be in-cache scanning so quite fast.
Or use the related representation of "ropes" which are also very easy to
concatenate if they can be nested.  Erlang does something like that
with what it calls "binaries".
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Chris Angelico
On Sat, Jan 21, 2017 at 5:01 PM, Paul Rubin  wrote:
> Chris Angelico  writes:
>> decoding JSON... the scanner, which steps through the string and
>> does the actual parsing. ...
>> The only way for it to be fast enough would be to have some sort of
>> retainable string iterator, which means exposing an opaque "position
>> marker" that serves no purpose other than parsing.
>
> Python already has that type of iterator:
>x = "foo"
>for c in x: 
>
>> It'd mean some sort of magic "thing" that probably has a reference to
>> the original string
>
> It's a regular old string iterator unless I'm missing something.  Of
> course a json parser should use it, though who uses the non-C json
> parser anyway these days?

You can't do a look-ahead with a vanilla string iterator. That's
necessary for a lot of parsers.

> Also if you really want O(1) random access, you could put an auxiliary
> table into long strings, giving the byte offset of every 256th codepoint
> or something like that.  Then you'd go to the nearest table entry and
> scan from there.  This would usually be in-cache scanning so quite fast.
> Or use the related representation of "ropes" which are also very easy to
> concatenate if they can be nested.  Erlang does something like that
> with what it calls "binaries".

Yes, which gives a two-level indexing (first find the strand, then the
character), and that's going to play pretty badly with CPU caches. I'd
be curious to know how an alternate Python with that implementation
would actually perform.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-20 Thread Jussi Piitulainen
Chris Angelico writes:

> On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman wrote:

>> I was asserting that most useful operations on strings start from
>> index 0. The r* operations would not be slowed down that much as
>> UTF-8 has the useful property that attempting to interpret from a
>> byte that is not at the start of a sequence (in the sense of a code
>> point rather than Python) is invalid and so quick to move over while
>> working backwards from the end.
>
> Let's take one very common example: decoding JSON. A ton of web
> servers out there will call json.loads() on user-supplied data. The
> bulk of the work is in the scanner, which steps through the string and
> does the actual parsing. That function is implemented in Python, so
> it's a good example. (There is a C accelerator, but we can ignore that
> and look at the pure Python one.)
>
> So, how could you implement this function? The current implementation
> maintains an index - an integer position through the string. It
> repeatedly requests the next character as string[idx], and can also
> slice the string (to check for keywords like "true") or use a regex
> (to check for numbers). Everything's clean, but it's lots of indexing.
> Alternatively, it could remove and discard characters as they're
> consumed. It would maintain a string that consists of all the unparsed
> characters. All indexing would be at or near zero, but after every
> tiny piece of parsing, the string would get sliced.
>
> With immutable UTF-8 strings, both of these would be O(n^2). Either
> indexing is linear, so parsing the tail of the string means scanning
> repeatedly; or slicing is linear, so parsing the head of the string
> means slicing all the rest away.
>
> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing. Every string parse
> operation would have to be reimplemented this way, lest it perform
> abysmally on large strings. It'd mean some sort of magic "thing" that
> probably has a reference to the original string, so you don't get the
> progressive RAM refunds that slicing gives, and you'd still have to
> deal with lots of the other consequences. It's probably doable, but it
> would be a lot of pain.

Julia does this. It has immutable UTF-8 strings, and there is a JSON
parser. The "opaque position marker" is just the byte index. An attempt
to use an invalid index throws an error. A substring type points to an
underlying string. An iterator, called graphemes, even returns
substrings that correspond to what people might consider a character.

I offer Julia as evidence.

My impression is that Julia's UTF-8-based system works and is not a
pain. I wrote a toy function once to access the last line of a large
memory-mapped text file, so I have just this little bit of personal
experience of it, so far. Incidentally, can Python memory-map a UTF-8
file as a string?

http://docs.julialang.org/en/stable/manual/strings/
https://github.com/JuliaIO/JSON.jl
-- 
https://mail.python.org/mailman/listinfo/python-list