date:20140604

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread martin



Zitat von Steven D'Aprano :


* Having a build-time option to restrict all strings to ASCII-only.

(I think what they mean by that is that strings will be like Python 2
strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)


An ASCII-plus-arbitrary-bytes type called "str" would prevent claiming
"Python 3.4 compatibility" for sure.

Restricting strings to ASCII (as Chris apparently actually suggested)
would allow to claim compatibility with a stretch: existing Python
code might not run on such an implementation. However, since a lot
of existing Python code wouldn't run on MicroPython, anyway, one
might claim to implement a Python 3.4 subset.


* Implementing Unicode internally as UTF-8, and giving up O(1)
indexing operations.

Would either of these trade-offs be acceptable while still claiming
"Python 3.4 compatibility"?

My own feeling is that O(1) string indexing operations are a quality of
implementation issue, not a deal breaker to call it a Python. I can't
see any requirement in the docs that str[n] must take O(1) time, but
perhaps I have missed something.


I agree. It's an open question whether such an implementation would be
practical, both in terms of existing Python code, and in terms of existing
C extension modules that people might want to port to MicroPython.

There are more things to consider for the internal implementation,
in particular how the string length is implemented. Several alternatives
exist:
1. store the UTF-8 length (i.e. memory size)
2. store the number of code points (i.e. Python len())
3. store both
4. store neither, but use null termination instead

Variant 3 is most run-time efficient, but could easily use 8 bytes
just for the length, which could outweigh the storage of the actual
data. Variants 1 and 2 lose on some operations (1 loses on computing
len(), 2 loses on string concatenation). 3 would add the restriction
of not allowing U+ in a string (which would be reasonable IMO),
and make all length computations inefficient. However, it wouldn't
be worse than standard C.

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Wed, Jun 4, 2014 at 3:23 PM, Guido van Rossum  wrote:
> On Tue, Jun 3, 2014 at 7:32 PM, Chris Angelico  wrote:
>>
>> On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano 
>> wrote:
>> > * Having a build-time option to restrict all strings to ASCII-only.
>> >
>> >   (I think what they mean by that is that strings will be like Python 2
>> >   strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
>>
>> What I was actually suggesting along those lines was that the str type
>> still be notionally a Unicode string, but that any codepoints >127
>> would either raise an exception or blow an assertion, and all the code
>> to handle multibyte representations would be compiled out.
>
>
> That would be a pretty lousy option.
>
>> So there'd
>> still be a difference between strings of text and streams of bytes,
>> but all encoding and decoding to/from ASCII-compatible encodings would
>> just point to the same bytes in RAM.
>
> I suppose this is why you propose to reject 128-255?

Correct. It would allow small devices to guarantee that strings are
compact (MicroPython is aimed primarily at an embedded controller),
guarantee identity transformations in several common encodings (and
maybe this sort of build wouldn't ship with any non-ASCII-compat
encodings at all), and never demonstrate behaviour different from
CPython's except by explicitly failing.

>> Risk: Someone would implement that with assertions, then compile with
>> assertions disabled, test only with ASCII, and have lurking bugs.
>
>
> Never mind disabling assertions -- even with enabled assertions you'd have
> to expect most Python programs to fail with non-ASCII input.

Right, which is why I don't like the idea. But you don't need
non-ASCII characters to blink an LED or turn a servo, and there is
significant resistance to the notion that appending a non-ASCII
character to a long ASCII-only string requires the whole string to be
copied and doubled in size (lots of heap space used).

> Then again the UTF-8 option would be pretty devastating too for anything
> manipulating strings (especially since many Python APIs are defined using
> indexes, e.g. the re module).

That's what I thought, too, but a quick poll on python-list suggests
that indexing isn't nearly as common as I had thought it to be. On a
smallish device, you won't have megabytes of string to index, so even
O(N) indexing can't get pathological. (This would be an acknowledged
limitation of micropython as a Unix Python - "it's designed for small
programs, and it's performance-optimized for small programs, so it
might get pathologically slow on certain large data manipulations".)

> Why not support variable-width strings like CPython 3.4?

That was my first recommendation, and in fact I started writing code
to implement parts of PEP 393, with a view to basically doing it the
same way in both Pythons. But discussion on the tracker issue showed a
certain amount of hostility toward the potential expansion of strings,
particularly in the worst-case example of appending a single SMP
character onto a long ASCII string.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Wed, Jun 4, 2014 at 5:02 PM,   wrote:
> There are more things to consider for the internal implementation,
> in particular how the string length is implemented. Several alternatives
> exist:
> 1. store the UTF-8 length (i.e. memory size)
> 2. store the number of code points (i.e. Python len())
> 3. store both
> 4. store neither, but use null termination instead
>
> Variant 3 is most run-time efficient, but could easily use 8 bytes
> just for the length, which could outweigh the storage of the actual
> data. Variants 1 and 2 lose on some operations (1 loses on computing
> len(), 2 loses on string concatenation). 3 would add the restriction
> of not allowing U+ in a string (which would be reasonable IMO),
> and make all length computations inefficient. However, it wouldn't
> be worse than standard C.

The current implementation stores a 16-bit length, which is both the
memory size and the len(). As far as I can see, the memory size is
never needed, so I'd just go for option 2; string concatenation is
already known to be one of those operations that can be slow if you do
it badly, and an optimized str.join() would cover the recommended
use-case.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread dw+python-dev

On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:

> There's a general expectation that indexing will be O(1) because all
> the builtin containers that support that syntax use it for O(1) lookup
> operations.

Depending on your definition of built in, there is at least one standard
library container that does not - collections.deque.

Given the specialized kinds of application this Python implementation is
targetted at, it seems UTF-8 is ideal considering the huge memory
savings resulting from the compressed representation, and the reduced
likelihood of there being any real need for serious text processing on
the device.

It is also unlikely to find software or libraries like Django or
Werkzeug running on a microcontroller, more likely all the Python code
would be custom, in which case, replacing string indexing with
iteration, or temporary conversion to a list is easily done.

In this context, while a fixed-width encoding may be the correct choice
it would also likely be the wrong choice.

David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Jeff Allen

Jython uses UTF-16 internally -- probably the only sensible choice in a 
Python that can call Java. Indexing is O(N), fundamentally. By 
"fundamentally", I mean for those strings that have not yet noticed that 
they contain no supplementary (>0x) characters.


I've toyed with making this O(1) universally. Like Steven, I understand 
this to be a freedom afforded to implementers, rather than an issue of 
conformity.


Jeff Allen

On 04/06/2014 02:17, Steven D'Aprano wrote:

There is a discussion over at MicroPython about the internal
representation of Unicode strings.

...

My own feeling is that O(1) string indexing operations are a quality of
implementation issue, not a deal breaker to call it a Python. I can't
see any requirement in the docs that str[n] must take O(1) time, but
perhaps I have missed something.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Stephen J. Turnbull

dw+python-...@hmmz.org writes:

 > Given the specialized kinds of application this Python
 > implementation is targetted at, it seems UTF-8 is ideal considering
 > the huge memory savings resulting from the compressed
 > representation,

I think you really need to check what the applications are in detail.
UTF-8 costs about 35% more storage for Japanese, and even more for
Chinese, than does UTF-16.  So if you might be using a lot of Asian
localized strings, it might even be worth implementing PEP-393 to get
the best of both worlds for most strings.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Juraj Sukop

On Wed, Jun 4, 2014 at 11:36 AM, Stephen J. Turnbull 
wrote:

>
> I think you really need to check what the applications are in detail.
> UTF-8 costs about 35% more storage for Japanese, and even more for
> Chinese, than does UTF-16.


"UTF-8 can be smaller even for Asian languages, e.g.: front page of
Wikipedia Japan: 83 kB in UTF-8, 144 kB in UTF-16"
>From http://www.lua.org/wshop12/Ierusalimschy.pdf (p. 12)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Some notes about MicroPython from an observer

2014-06-04 Thread Daniel Holth

- micropython is designed to run on a machine with 192 kilobytes of
RAM and perhaps a megabyte of FLASH. The controller can execute
read-only code directly from FLASH. There is no dynamic linker in this
environment. (It also has a UNIX port).
- However it does include a full Python parser and REPL, so the board
can be programmed without a separate computer as opposed to, say,
having to upload bytecode compiled on a regular computer.
- It's definitely going to be a subset of Python. For example,
func.__name__ is not supported - to make it more micro?
- They have a C API. It is much different than the CPython C API.
- It mas more than one code emitter. A certain decorator causes a
function to be compiled to ARM Thumb code instead of bytecode.
- It even has an inline assembler than translates Python-syntax ARM
assembly (to re-use the same parser) into machine code.

Most information from
https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts
and http://micropython.org/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky  wrote:
> That's another reason why people don't like Unicode enforced upon them
> - all the talk about supporting all languages and scripts is demagogy
> and hypocrisy, given a choice, Unicode zealots would rather limit
> people to Latin script then give up on their arbitrarily chosen,
> one-among-thousands,
> soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.

Wrong. I use and recommend Unicode, with UTF-8 for transmission, and I
do not ever want to limit people to Latin-1 or any other such subset.
Even though English is the only language I speak, I am *frequently*
using non-ASCII characters (eg when I discuss mathematics on a MUD),
and if I could be absolutely sure that everyone in the conversation
correctly comprehended Unicode, I could do this with a lot more
confidence. Unfortunately, the server I use just passes bytes in and
out, and some clients assume CP-1252, others assume Latin-1, and
others (including my Gypsum) try UTF-8 first and fall back on an
eight-bit encoding (currently CP-1252 because of the first group). But
in an ideal world, server and clients would all speak Unicode
everywhere, and transmit and receive UTF-8. This is not hypocrisy,
this is the way to work reliably.

> Once again, my claim is what MicroPython implements now is more correct
> - in a sense wider than technical - handling. We don't provide Unicode
> encoding support, because it's highly bloated, but let people use any
> encoding they like. That comes at some price, like length of strings in
> characters are not know to runtime, only in bytes, but quite a lot of
> applications can be written by having just that.

The current implementation is flat-out lying, actually. It claims that
it's storing Unicode codepoints (as per the Python spec) while
actually storing bytes, and then it transmits those bytes to the
console etc as-is. This is a bug. It needs to be fixed. The only
question is, what form will the fix take? Will it be PEP 393's
flexible fixed-width representation? UTF-8? UTF-16 (I hope not!)? A
hybrid of Latin-1 where possible and UTF-8 otherwise? But something
has to be done.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky  wrote:
> And I'm saying that not to discourage Unicode addition to MicroPython,
> but to hint that "force-force" approach implemented by CPython3 and
> causing rage and split in the community is not appreciated.

FWIW, it's Python 3 (the language) and not CPython 3.x (the
implementation) that specifies Unicode strings in this way. I don't
know why it has to cause a split in the community; this is the one way
to make sure *everyone's* strings work perfectly, rather than having
ASCII strings work fine and others start tripping over problems in
various APIs.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 4 Jun 2014 17:03:22 +1000
Chris Angelico  wrote:

[]

> > Why not support variable-width strings like CPython 3.4?
> 
> That was my first recommendation, and in fact I started writing code
> to implement parts of PEP 393, with a view to basically doing it the
> same way in both Pythons. But discussion on the tracker issue showed a
> certain amount of hostility toward the potential expansion of strings,
> particularly in the worst-case example of appending a single SMP
> character onto a long ASCII string.

An alternative view is that the discussion on the tracker showed Python
developers' mind-fixation on implementing something the way CPython does
it. And I didn't yet go to that argument, but in the end, MicroPython
does not try to rewrite CPython or compete with it. So, having few
choices with pros and cons leading approximately to the tie among them,
it's the least productive to make the same choice as CPython did.

Even having "rule of thumb" of choosing not-a-CPython way would be more
productive than having the same rule of thumb for blindly choosing
CPython way. (Of course, actually it should be technical discussion
based on the target requirements, like we hopefully did, with strong
arguments against using something else but the de-facto standard
transfer encoding for Unicode).

> 
> ChrisA

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 4 Jun 2014 12:32:12 +1000
Chris Angelico  wrote:

> On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano
>  wrote:
> > * Having a build-time option to restrict all strings to ASCII-only.
> >
> >   (I think what they mean by that is that strings will be like
> > Python 2 strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
> 
> What I was actually suggesting along those lines was that the str type
> still be notionally a Unicode string, but that any codepoints >127
> would either raise an exception or blow an assertion,

That's another reason why people don't like Unicode enforced upon them
- all the talk about supporting all languages and scripts is demagogy
and hypocrisy, given a choice, Unicode zealots would rather limit
people to Latin script then give up on their arbitrarily chosen,
one-among-thousands,
soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.

Once again, my claim is what MicroPython implements now is more correct
- in a sense wider than technical - handling. We don't provide Unicode
encoding support, because it's highly bloated, but let people use any
encoding they like. That comes at some price, like length of strings in
characters are not know to runtime, only in bytes, but quite a lot of
applications can be written by having just that.

And I'm saying that not to discourage Unicode addition to MicroPython,
but to hint that "force-force" approach implemented by CPython3 and
causing rage and split in the community is not appreciated.

> and all the code
> to handle multibyte representations would be compiled out. So there'd
> still be a difference between strings of text and streams of bytes,
> but all encoding and decoding to/from ASCII-compatible encodings would
> just point to the same bytes in RAM.
> 
> Risk: Someone would implement that with assertions, then compile with
> assertions disabled, test only with ASCII, and have lurking bugs.
> 
> ChrisA


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Tue, 3 Jun 2014 22:23:07 -0700
Guido van Rossum  wrote:

[]
> Never mind disabling assertions -- even with enabled assertions you'd
> have to expect most Python programs to fail with non-ASCII input.
> 
> Then again the UTF-8 option would be pretty devastating too for
> anything manipulating strings (especially since many Python APIs are
> defined using indexes, e.g. the re module).

If the Unicode is slow (*), then obvious choice is not using Unicode
when not needed. Too bad that's a bit hard in Python3, as it enforces
Unicode everywhere, and dealing with efficient strings requires
prefixing them with funny characters like "b", etc.

* If Unicode if slow because it causes heap to bloat and go swap, the
choice is still the same.

> 
> Why not support variable-width strings like CPython 3.4?

Because, like good deal of community, we hope that Python4 will get
back to reality, and strings will be efficient (both for processing and
storage) by default, and niche and marginal "Unicode string" type will
be used explicitly (using funny prefixes, etc.), only when really
needed.

Ah, all these not so funny geek jokes about internals of language
implementation, hope they didn't make somebody's day dull!

> 
> -- 
> --Guido van Rossum (python.org/~guido)

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Wed, Jun 4, 2014 at 9:12 PM, Paul Sokolovsky  wrote:
> An alternative view is that the discussion on the tracker showed Python
> developers' mind-fixation on implementing something the way CPython does
> it. And I didn't yet go to that argument, but in the end, MicroPython
> does not try to rewrite CPython or compete with it. So, having few
> choices with pros and cons leading approximately to the tie among them,
> it's the least productive to make the same choice as CPython did.

I'm not a CPython dev, nor a Python dev, and I don't think any of the
big names of CPython or Python has showed up on that tracker as yet.
But why is "be different from CPython" such a valuable choice? CPython
works. It's had many hours of dev time put into it. Problems have been
identified and avoided. Throwing that out means throwing away a
freely-given shoulder to stand on, in an Isaac Newton way.

http://www.joelonsoftware.com/articles/fog69.html

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Daniel Holth

Can of worms, opened.
On Jun 4, 2014 7:20 AM, "Chris Angelico"  wrote:

> On Wed, Jun 4, 2014 at 9:12 PM, Paul Sokolovsky  wrote:
> > An alternative view is that the discussion on the tracker showed Python
> > developers' mind-fixation on implementing something the way CPython does
> > it. And I didn't yet go to that argument, but in the end, MicroPython
> > does not try to rewrite CPython or compete with it. So, having few
> > choices with pros and cons leading approximately to the tie among them,
> > it's the least productive to make the same choice as CPython did.
>
> I'm not a CPython dev, nor a Python dev, and I don't think any of the
> big names of CPython or Python has showed up on that tracker as yet.
> But why is "be different from CPython" such a valuable choice? CPython
> works. It's had many hours of dev time put into it. Problems have been
> identified and avoided. Throwing that out means throwing away a
> freely-given shoulder to stand on, in an Isaac Newton way.
>
> http://www.joelonsoftware.com/articles/fog69.html
>
> ChrisA
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 4 Jun 2014 20:53:46 +1000
Chris Angelico  wrote:

> On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky 
> wrote:
> > And I'm saying that not to discourage Unicode addition to
> > MicroPython, but to hint that "force-force" approach implemented by
> > CPython3 and causing rage and split in the community is not
> > appreciated.
> 
> FWIW, it's Python 3 (the language) and not CPython 3.x (the
> implementation) that specifies Unicode strings in this way. 

Yeah, but it's CPython what dictates how language evolves (some people
even think that it dictates how language should be implemented!), so all
good parts belong to Python3, and all bad parts - to CPython3,
right? ;-)

> I don't
> know why it has to cause a split in the community; this is the one way
> to make sure *everyone's* strings work perfectly, rather than having
> ASCII strings work fine and others start tripping over problems in
> various APIs.

It did cause split in the community, that's the fact, that's why
Python2 and Python3 are at the respective positions. Anyway, I'm not
interested in participating in that split, I did not yet uttered my
opinion on that publicly enough, so I seized a chance to drop some
witty remarks, but I don't want to start yet another Unicode flame.

So, let's please be back to Unicode storage representation in
MicroPython. So, https://github.com/micropython/micropython/issues/657
discussed technical aspects, in a recent mail on this list I expressed
my opinion why following CPython way is not productive (for development
satisfaction and evolution of Python community, to be explicit).

Final argument I would have is that you certainly can implement Unicode
support the PEP393 way - it would be enormous help and would be gladly
accepted. The question, how useful it will be for MicroPython. It
certainly will be useful to report passing of testsuites. But will it
be *really* used?

For microcontroller board, it might be too heavy (put simple, with it,
people will be able to do less (== heap running out sooner)), than
without it, so one may expect it to be disabled by default. Then POSIX
port is there surely not to let people replace "python" command
with "micropython" and run Django, but to let people develop and debug
their apps with more comfort than on embedded board. So, it should
behave close to MCU version, and would follow with MCU choice
re: Unicode.

That's actually the reason why I keep up this discussion - not for the
sake of argument or to bash Python3's Unicode choices. With recent
MicroPython announcement, we surely looked for more people to
contribute to its development. But then we (or at least I can speak for
myself), would like to make sure that these contribution are actually
the most useful ones (for both MicroPython, and Python community in
general, which gets more choices, rather than just getting N% smaller
CPython rewrite).

So, you're not sure how O(N) string indexing will work? But MicroPython
offers a great opportunity to try! And it's something new and exciting,
which surely will be useful (== will save people memory), not just
something old and boring ;-).

> 
> ChrisA

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Kristján Valur Jónsson

For those that haven't seen this:

http://www.utf8everywhere.org/

> -Original Message-
> From: Python-Dev [mailto:python-dev-
> bounces+kristjan=ccpgames@python.org] On Behalf Of Donald Stufft
> Sent: 4. júní 2014 01:46
> To: Steven D'Aprano
> Cc: python-dev@python.org
> Subject: Re: [Python-Dev] Internal representation of strings and
> Micropython
> 
> I think UTF8 is the best option.
> 

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Daniel Holth

If we're voting I think representing Unicode internally in micropython
as utf-8 with O(N) indexing is a great idea, partly because I'm not
sure indexing into strings is a good idea - lots of Unicode code
points don't make sense by themselves; see also grapheme clusters. It
would probably work great.

On Wed, Jun 4, 2014 at 7:49 AM, Paul Sokolovsky  wrote:
> Hello,
>
> On Wed, 4 Jun 2014 20:53:46 +1000
> Chris Angelico  wrote:
>
>> On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky 
>> wrote:
>> > And I'm saying that not to discourage Unicode addition to
>> > MicroPython, but to hint that "force-force" approach implemented by
>> > CPython3 and causing rage and split in the community is not
>> > appreciated.
>>
>> FWIW, it's Python 3 (the language) and not CPython 3.x (the
>> implementation) that specifies Unicode strings in this way.
>
> Yeah, but it's CPython what dictates how language evolves (some people
> even think that it dictates how language should be implemented!), so all
> good parts belong to Python3, and all bad parts - to CPython3,
> right? ;-)
>
>> I don't
>> know why it has to cause a split in the community; this is the one way
>> to make sure *everyone's* strings work perfectly, rather than having
>> ASCII strings work fine and others start tripping over problems in
>> various APIs.
>
> It did cause split in the community, that's the fact, that's why
> Python2 and Python3 are at the respective positions. Anyway, I'm not
> interested in participating in that split, I did not yet uttered my
> opinion on that publicly enough, so I seized a chance to drop some
> witty remarks, but I don't want to start yet another Unicode flame.
>
>
>
> So, let's please be back to Unicode storage representation in
> MicroPython. So, https://github.com/micropython/micropython/issues/657
> discussed technical aspects, in a recent mail on this list I expressed
> my opinion why following CPython way is not productive (for development
> satisfaction and evolution of Python community, to be explicit).
>
> Final argument I would have is that you certainly can implement Unicode
> support the PEP393 way - it would be enormous help and would be gladly
> accepted. The question, how useful it will be for MicroPython. It
> certainly will be useful to report passing of testsuites. But will it
> be *really* used?
>
> For microcontroller board, it might be too heavy (put simple, with it,
> people will be able to do less (== heap running out sooner)), than
> without it, so one may expect it to be disabled by default. Then POSIX
> port is there surely not to let people replace "python" command
> with "micropython" and run Django, but to let people develop and debug
> their apps with more comfort than on embedded board. So, it should
> behave close to MCU version, and would follow with MCU choice
> re: Unicode.
>
> That's actually the reason why I keep up this discussion - not for the
> sake of argument or to bash Python3's Unicode choices. With recent
> MicroPython announcement, we surely looked for more people to
> contribute to its development. But then we (or at least I can speak for
> myself), would like to make sure that these contribution are actually
> the most useful ones (for both MicroPython, and Python community in
> general, which gets more choices, rather than just getting N% smaller
> CPython rewrite).
>
> So, you're not sure how O(N) string indexing will work? But MicroPython
> offers a great opportunity to try! And it's something new and exciting,
> which surely will be useful (== will save people memory), not just
> something old and boring ;-).
>
>
>>
>> ChrisA
>
>
> --
> Best regards,
>  Paul  mailto:pmis...@gmail.com
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 4 Jun 2014 21:17:12 +1000
Chris Angelico  wrote:

> On Wed, Jun 4, 2014 at 9:12 PM, Paul Sokolovsky 
> wrote:
> > An alternative view is that the discussion on the tracker showed
> > Python developers' mind-fixation on implementing something the way
> > CPython does it. And I didn't yet go to that argument, but in the
> > end, MicroPython does not try to rewrite CPython or compete with
> > it. So, having few choices with pros and cons leading approximately
> > to the tie among them, it's the least productive to make the same
> > choice as CPython did.
> 
> I'm not a CPython dev, nor a Python dev, and I don't think any of the
> big names of CPython or Python has showed up on that tracker as yet.
> But why is "be different from CPython" such a valuable choice? CPython
> works. It's had many hours of dev time put into it. 

Exactly, CPython (already) exists, and it works, so people can just use
it. MicroPython's aim is to go where CPython didn't, and couldn't, go.
For that, it's got to be different, or it literally won't fit there,
like CPython doesn't.

[]

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steve Dower

I'm agree with Daniel. Directly indexing into text suggests an attempted 
optimization that is likely to be incorrect for a set of strings. Splitting, 
regex, concatenation and formatting are really the main operations that matter, 
and MicroPython can optimize their implementation of these easily enough for 
O(N) indexing.

Cheers,
Steve

Top-posted from my Windows Phone

From: Daniel Holth
Sent: ‎6/‎4/‎2014 5:17
To: Paul Sokolovsky
Cc: python-dev
Subject: Re: [Python-Dev] Internal representation of strings and Micropython

If we're voting I think representing Unicode internally in micropython
as utf-8 with O(N) indexing is a great idea, partly because I'm not
sure indexing into strings is a good idea - lots of Unicode code
points don't make sense by themselves; see also grapheme clusters. It
would probably work great.

On Wed, Jun 4, 2014 at 7:49 AM, Paul Sokolovsky  wrote:
> Hello,
>
> On Wed, 4 Jun 2014 20:53:46 +1000
> Chris Angelico  wrote:
>
>> On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky 
>> wrote:
>> > And I'm saying that not to discourage Unicode addition to
>> > MicroPython, but to hint that "force-force" approach implemented by
>> > CPython3 and causing rage and split in the community is not
>> > appreciated.
>>
>> FWIW, it's Python 3 (the language) and not CPython 3.x (the
>> implementation) that specifies Unicode strings in this way.
>
> Yeah, but it's CPython what dictates how language evolves (some people
> even think that it dictates how language should be implemented!), so all
> good parts belong to Python3, and all bad parts - to CPython3,
> right? ;-)
>
>> I don't
>> know why it has to cause a split in the community; this is the one way
>> to make sure *everyone's* strings work perfectly, rather than having
>> ASCII strings work fine and others start tripping over problems in
>> various APIs.
>
> It did cause split in the community, that's the fact, that's why
> Python2 and Python3 are at the respective positions. Anyway, I'm not
> interested in participating in that split, I did not yet uttered my
> opinion on that publicly enough, so I seized a chance to drop some
> witty remarks, but I don't want to start yet another Unicode flame.
>
>
>
> So, let's please be back to Unicode storage representation in
> MicroPython. So, https://github.com/micropython/micropython/issues/657
> discussed technical aspects, in a recent mail on this list I expressed
> my opinion why following CPython way is not productive (for development
> satisfaction and evolution of Python community, to be explicit).
>
> Final argument I would have is that you certainly can implement Unicode
> support the PEP393 way - it would be enormous help and would be gladly
> accepted. The question, how useful it will be for MicroPython. It
> certainly will be useful to report passing of testsuites. But will it
> be *really* used?
>
> For microcontroller board, it might be too heavy (put simple, with it,
> people will be able to do less (== heap running out sooner)), than
> without it, so one may expect it to be disabled by default. Then POSIX
> port is there surely not to let people replace "python" command
> with "micropython" and run Django, but to let people develop and debug
> their apps with more comfort than on embedded board. So, it should
> behave close to MCU version, and would follow with MCU choice
> re: Unicode.
>
> That's actually the reason why I keep up this discussion - not for the
> sake of argument or to bash Python3's Unicode choices. With recent
> MicroPython announcement, we surely looked for more people to
> contribute to its development. But then we (or at least I can speak for
> myself), would like to make sure that these contribution are actually
> the most useful ones (for both MicroPython, and Python community in
> general, which gets more choices, rather than just getting N% smaller
> CPython rewrite).
>
> So, you're not sure how O(N) string indexing will work? But MicroPython
> offers a great opportunity to try! And it's something new and exciting,
> which surely will be useful (== will save people memory), not just
> something old and boring ;-).
>
>
>>
>> ChrisA
>
>
> --
> Best regards,
>  Paul  mailto:pmis...@gmail.com
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/steve.dower%40microsoft.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Mark Lawrence


On 04/06/2014 11:53, Paul Sokolovsky wrote:

Hello,

On Tue, 3 Jun 2014 22:23:07 -0700
Guido van Rossum  wrote:

[]

Never mind disabling assertions -- even with enabled assertions you'd
have to expect most Python programs to fail with non-ASCII input.

Then again the UTF-8 option would be pretty devastating too for
anything manipulating strings (especially since many Python APIs are
defined using indexes, e.g. the re module).


If the Unicode is slow (*), then obvious choice is not using Unicode
when not needed. Too bad that's a bit hard in Python3, as it enforces
Unicode everywhere, and dealing with efficient strings requires
prefixing them with funny characters like "b", etc.

* If Unicode if slow because it causes heap to bloat and go swap, the
choice is still the same.


Where is your evidence that (presumably) CPython unicode is slow?  What 
is your response to this message 
http://bugs.python.org/issue16061#msg171413 from the bug tracker?






Why not support variable-width strings like CPython 3.4?


Because, like good deal of community, we hope that Python4 will get
back to reality, and strings will be efficient (both for processing and
storage) by default, and niche and marginal "Unicode string" type will
be used explicitly (using funny prefixes, etc.), only when really
needed.


Where is your evidence that supports the above claim?




Ah, all these not so funny geek jokes about internals of language
implementation, hope they didn't make somebody's day dull!



--
--Guido van Rossum (python.org/~guido)







--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Nick Coghlan

On 4 June 2014 15:39,   wrote:
> On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:
>
>> There's a general expectation that indexing will be O(1) because all
>> the builtin containers that support that syntax use it for O(1) lookup
>> operations.
>
> Depending on your definition of built in, there is at least one standard
> library container that does not - collections.deque.
>
> Given the specialized kinds of application this Python implementation is
> targetted at, it seems UTF-8 is ideal considering the huge memory
> savings resulting from the compressed representation, and the reduced
> likelihood of there being any real need for serious text processing on
> the device.

Right - I wasn't clear that I think storing text internally as UTF-8
sounds fine for MicroPython. Anything where the O(N) nature of
indexing by code point matters probably won't be run in that
environment anyway.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka


04.06.14 04:17, Steven D'Aprano написав(ла):

Would either of these trade-offs be acceptable while still claiming
"Python 3.4 compatibility"?

My own feeling is that O(1) string indexing operations are a quality of
implementation issue, not a deal breaker to call it a Python. I can't
see any requirement in the docs that str[n] must take O(1) time, but
perhaps I have missed something.


I think than breaking O(1) expectation for indexing makes the 
implementation significant incompatible with Python. Virtually all 
string operations in Python operates with indices.


O(1) indexing operations can be kept with minimal memory requirements if 
implement Unicode internally as modified UTF-8 plus optional array of 
offsets for every, say, 32th character (which even can be compressed to 
an array of 16-bit or 32-bit integers).


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Daniel Holth

MicroPython is going to be significantly incompatible with Python
anyway. But you should be able to run your mp code on regular Python.

On Wed, Jun 4, 2014 at 9:39 AM, Serhiy Storchaka  wrote:
> 04.06.14 04:17, Steven D'Aprano написав(ла):
>
>> Would either of these trade-offs be acceptable while still claiming
>> "Python 3.4 compatibility"?
>>
>> My own feeling is that O(1) string indexing operations are a quality of
>> implementation issue, not a deal breaker to call it a Python. I can't
>> see any requirement in the docs that str[n] must take O(1) time, but
>> perhaps I have missed something.
>
>
> I think than breaking O(1) expectation for indexing makes the implementation
> significant incompatible with Python. Virtually all string operations in
> Python operates with indices.
>
> O(1) indexing operations can be kept with minimal memory requirements if
> implement Unicode internally as modified UTF-8 plus optional array of
> offsets for every, say, 32th character (which even can be compressed to an
> array of 16-bit or 32-bit integers).
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Moore

On 4 June 2014 14:39, Serhiy Storchaka  wrote:
> I think than breaking O(1) expectation for indexing makes the implementation
> significant incompatible with Python. Virtually all string operations in
> Python operates with indices.

I don't use indexing on strings except in rare situations. Sure I use
lots of operations that may well use indexing *internally* but that's
the point. MicroPython can optimise those operations without needing
to guarantee O(1) indexing, and I'd be fine with that.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steven D'Aprano

On Wed, Jun 04, 2014 at 01:14:04PM +, Steve Dower wrote:
> I'm agree with Daniel. Directly indexing into text suggests an 
> attempted optimization that is likely to be incorrect for a set of 
> strings. 

I'm afraid I don't understand this argument. The language semantics says 
that a string is an array of code points. Every index relates to a 
single code point, no code point extends over two or more indexes. 
There's a 1:1 relationship between code points and indexes. How is 
direct indexing "likely to be incorrect"?

e.g.

s = "---ÿ---"
offset = s.index('ÿ')
assert s[offset] == 'ÿ'

That cannot fail with Python's semantics.

[Aside: it does fail in Python 2, showing that the idea that "strings 
are bytes" is fatally broken. Fortunately Python has moved beyond that.]

> Splitting, regex, concatenation and formatting are really the 
> main operations that matter, and MicroPython can optimize their 
> implementation of these easily enough for O(N) indexing.

Really? Well, it will be a nice experiment. Fortunately MicroPython runs 
under Linux as well as on embedded systems (a clever decision, by the 
way) so I look forward to seeing how their internal-utf8 implementation 
stacks up against CPython's FSR implementation.

Out of curiosity, when the FSR was proposed, did anyone consider an 
internal UTF-8 representation? If so, why was it rejected?

-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka


04.06.14 10:03, Chris Angelico написав(ла):

Right, which is why I don't like the idea. But you don't need
non-ASCII characters to blink an LED or turn a servo, and there is
significant resistance to the notion that appending a non-ASCII
character to a long ASCII-only string requires the whole string to be
copied and doubled in size (lots of heap space used).


But you need non-ASCII characters to display a title of MP3 track.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka  wrote:
> 04.06.14 10:03, Chris Angelico написав(ла):
>
>> Right, which is why I don't like the idea. But you don't need
>> non-ASCII characters to blink an LED or turn a servo, and there is
>> significant resistance to the notion that appending a non-ASCII
>> character to a long ASCII-only string requires the whole string to be
>> copied and doubled in size (lots of heap space used).
>
>
> But you need non-ASCII characters to display a title of MP3 track.

Agreed. IMO, any Python, no matter how micro, needs full Unicode
support; but there is resistance from uPy's devs.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steven D'Aprano

On Wed, Jun 04, 2014 at 01:38:57PM +0300, Paul Sokolovsky wrote:

> That's another reason why people don't like Unicode enforced upon them

Enforcing design and language decisions is the job of the programming 
language. You might as well complain that Python forces C doubles as the 
floating point type, or that it forces Bignums as the integer type, or 
that it forces significant indentation, or "class" as a keyword. Or that 
C forces you to use braces and manage your own memory. That's the 
purpose of the language, to make those decisions as to what features to 
provide and what not to provide.

> - all the talk about supporting all languages and scripts is demagogy
> and hypocrisy, given a choice, Unicode zealots would rather limit
> people to Latin script 

I have no words to describe how ridiculous this accusation is.

> then give up on their arbitrarily chosen, one-among-thousands,
> soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.

> Once again, my claim is what MicroPython implements now is more correct
> - in a sense wider than technical - handling. We don't provide Unicode
> encoding support, because it's highly bloated, but let people use any
> encoding they like. That comes at some price, like length of strings in
> characters are not know to runtime, only in bytes

What's does uPy return for the length of '∞'? If the answer is anything 
but 1, that's a bug.

-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Thu, 5 Jun 2014 00:26:10 +1000
Chris Angelico  wrote:

> On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka
>  wrote:
> > 04.06.14 10:03, Chris Angelico написав(ла):
> >
> >> Right, which is why I don't like the idea. But you don't need
> >> non-ASCII characters to blink an LED or turn a servo, and there is
> >> significant resistance to the notion that appending a non-ASCII
> >> character to a long ASCII-only string requires the whole string to
> >> be copied and doubled in size (lots of heap space used).
> >
> >
> > But you need non-ASCII characters to display a title of MP3 track.

Yes, but to display a title, you don't need to do codepoint access at
random - you need to either take a block of memory (length in bytes) and
do something with it (pass to a C function, transfer over some bus,
etc.), or *iterate in order* over codepoints in a string. All these
operations are as efficient (O-notation) for UTF-8 as for UTF-32.

Some operations are not going to be as fast, so - oops - avoid doing
them without good reason. And kindly drop expectations that doing
arbitrary operations on *Unicode* are as efficient as you imagined.
(Note the *Unicode* in general, not particular flavor of which you got
used to, up to thinking it's the one and only "right" flavor.)

> Agreed. IMO, any Python, no matter how micro, needs full Unicode
> support; but there is resistance from uPy's devs.

FUD ;-).

> 
> ChrisA

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Thu, Jun 5, 2014 at 12:49 AM, Paul Sokolovsky  wrote:
>> > But you need non-ASCII characters to display a title of MP3 track.
>
> Yes, but to display a title, you don't need to do codepoint access at
> random - you need to either take a block of memory (length in bytes) and
> do something with it (pass to a C function, transfer over some bus,
> etc.), or *iterate in order* over codepoints in a string. All these
> operations are as efficient (O-notation) for UTF-8 as for UTF-32.

Suppose you have a long title, and you need to abbreviate it by
dropping out words (delimited by whitespace), such that you keep the
first word (always) and the last (if possible) and as many as possible
in between. How are you going to write that? With PEP 393 or UTF-32
strings, you can simply record the index of every whitespace you find,
count off lengths, and decide what to keep and what to ellipsize.

> Some operations are not going to be as fast, so - oops - avoid doing
> them without good reason. And kindly drop expectations that doing
> arbitrary operations on *Unicode* are as efficient as you imagined.
> (Note the *Unicode* in general, not particular flavor of which you got
> used to, up to thinking it's the one and only "right" flavor.)

Not sure what you mean by flavors of Unicode. Unicode is a mapping of
codepoints to characters, not an in-memory representation. And I've
been working with Python 3.3 since before it came out, and with Pike
(which has a very similar model) for longer, and in both of them, I
casually perform operations on Unicode strings in the same way that I
used to perform operations on REXX strings (which were eight-bit in
the current system codepage - 437 for us). I do expect those
operations to be efficient, and I get what I expect.

Maybe they won't be in uPy, but that would be a limitation of uPy, not
a fundamental problem with Unicode.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka


04.06.14 17:02, Paul Moore написав(ла):

On 4 June 2014 14:39, Serhiy Storchaka  wrote:

I think than breaking O(1) expectation for indexing makes the implementation
significant incompatible with Python. Virtually all string operations in
Python operates with indices.


I don't use indexing on strings except in rare situations. Sure I use
lots of operations that may well use indexing *internally* but that's
the point. MicroPython can optimise those operations without needing
to guarantee O(1) indexing, and I'd be fine with that.


Any non-trivial text parsing uses indices or regular expressions (and 
regular expressions themself use indices internally).


It would be interesting to collect a statistic about how many indexing 
operations happened during the life of a string in typical (Micro)Python 
program.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Daniel Holth

On Wed, Jun 4, 2014 at 10:12 AM, Steven D'Aprano  wrote:
> On Wed, Jun 04, 2014 at 01:14:04PM +, Steve Dower wrote:
>> I'm agree with Daniel. Directly indexing into text suggests an
>> attempted optimization that is likely to be incorrect for a set of
>> strings.
>
> I'm afraid I don't understand this argument. The language semantics says
> that a string is an array of code points. Every index relates to a
> single code point, no code point extends over two or more indexes.
> There's a 1:1 relationship between code points and indexes. How is
> direct indexing "likely to be incorrect"?

"Useful" is probably a better word. When you get into the complicated
languages and you want to know how wide something is, and you might
have y with two dots on it as one code point or two and left-to-right
and right-to-left indicators and who knows what else... then looking
at individual code points only works sometimes. I get the slicing
idea.

I like the idea that encoding to utf-8 would be the fastest thing you
can do with a string. You could consider doing regexps in that domain,
and other implementation specific optimizations in exactly the same
way that any Python implementation has them.

None of this would make it harder to move a servo.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steve Dower

Steven D'Aprano wrote:
> The language semantics says that a string is an array of code points. Every
> index relates to a single code point, no code point extends over two or more
> indexes.
> There's a 1:1 relationship between code points and indexes. How is direct
> indexing "likely to be incorrect"?

We're discussing the behaviour under a different (hypothetical) design decision 
than a 1:1 relationship between code points and indexes, so arguing from that 
stance doesn't make much sense.

> e.g.
> 
> s = "---ÿ---"
> offset = s.index('ÿ')
> assert s[offset] == 'ÿ'
> 
> That cannot fail with Python's semantics.

Agreed, and it shouldn't (I was actually referring to the optimization being 
incorrect for the goal, not the language semantics). What you'd probably find 
is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is 
also correct.

But what are you trying to achieve (why are you writing this code)? All this 
example really shows is that you're only using indexing for trivial purposes.

Chris's example of an actual case where it may look like a good idea to use 
indexing for optimization makes this more obvious IMHO:

Chris Angelico wrote:
> Suppose you have a long title, and you need to abbreviate it by dropping out
> words (delimited by whitespace), such that you keep the first word (always) 
> and
> the last (if possible) and as many as possible in between. How are you going 
> to
> write that? With PEP 393 or UTF-32 strings, you can simply record the index of
> every whitespace you find, count off lengths, and decide what to keep and what
> to ellipsize.

"Recording the index" is where the optimization comes in. With a 
variable-length encoding - heck, even with a fixed-length one - I'd just use 
str.split(' ') (or re.split('\\s', string), depending on how much I care about 
the type of delimiter) and manipulate the list.

If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', 
string) also provides the same behaviour and gives me the sliced string, so 
there's no need to index for anything.

The downside is that it isn't as easy to teach as the 1:1 relationship, and 
currently it doesn't perform as well *in CPython*. But if MicroPython is 
focusing on size over speed, I don't see any reason why they shouldn't permit 
different performance characteristics and require a slightly different approach 
to highly-optimized coding.

In any case, this is an interesting discussion with a genuine effect on the 
Python interpreter ecosystem. Jython and IronPython already have different 
string implementations from CPython - having official (and hopefully flexible) 
guidance on deviations from the reference implementation would I think help 
other implementations provide even more value, which is only a good thing for 
Python.

Cheers,
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 04 Jun 2014 17:40:14 +0300
Serhiy Storchaka  wrote:

> 04.06.14 17:02, Paul Moore написав(ла):
> > On 4 June 2014 14:39, Serhiy Storchaka  wrote:
> >> I think than breaking O(1) expectation for indexing makes the
> >> implementation significant incompatible with Python. Virtually all
> >> string operations in Python operates with indices.
> >
> > I don't use indexing on strings except in rare situations. Sure I
> > use lots of operations that may well use indexing *internally* but
> > that's the point. MicroPython can optimise those operations without
> > needing to guarantee O(1) indexing, and I'd be fine with that.
> 
> Any non-trivial text parsing uses indices or regular expressions (and 
> regular expressions themself use indices internally).

I keep hearing this stuff, and unfortunately so far don't have enough
time to collect all that stuff and provide detailed response. So,
here's spur of the moment response - hopefully we're in the same
context so it is easy to understand.

So, gentlemen, you keep mixing up character-by-character random access
to string and taking substrings of a string.

Character-by-character random access imply that you would need to scan
thru (possibly almost) all chars in a string. That's O(N) (N-length of
string). With varlength encoding (taking O(N) to index arbitrary char),
there's thus concern that this would be O(N^2) op.

But show me real-world case for that. Common usecase is scanning string
left-to-right, that should be done using iterator and thus O(N).
Right-to-left scanning would be order(s) of magnitude less frequent, as
and also handled by iterator.

What's next? You're doing some funky anagrams and need to swap each 2
adjacent chars? Sorry, naive implementation will be slow. If you're in
serious anagram business, you'll need to code C extension. No, wait!
Instead you should learn Python better. You should run a string
windowing iterator which will return adjacent pair and swap those
constant-len strings.

More cases anyone? Implementing DES and doing arbitrary permutations?
Kindly drop doing that on strings, do it on bytes or lists.

Hopefully, the idea is clear - if you *scan* thru string using indexes
in *random* order, you're doing weird thing and *want* weird
performance. Doing stuff is s[0] ot s[-1] - there's finite (and small)
number of such operation per strings.

Now about taking substrings of strings (which in Python often expressed
by slice indexing). Well, this is quite different from scanning each
character of a strings. Just like s[0]/s[-1] this usually happens
finite number of times for a particular string, independent of its
length, i.e. O(1) times (ex, you take a string and split it in 3
parts), or maybe number of substrings is not bound-fixed, but has
different growth order, O(M) (for example, you split string in tokens,
tokens can be long, but there're usually external limits on how many
it's sensible to have on one line).

So, again, you're not going to get quadric time unless you're unlucky
or sloppy. And just again, you should brush up your Python skills and
use regex functions shich return iterators to get your parsed tokens,
etc.

(To clarify the obvious - "you" here is abstract pronoun, not referring
to respectable Python developers who actually made it possible to write
efficient Python programs).

So, hopefully the point is conveyed - you can write inefficient Python
programs. CPython goes out of the way to hide many inefficiencies (using
unbelievably bloated heap usage - from uPy's point of view, which
starts up in 2K heap). You just shouldn't write inefficient programs,
voila. But if you want, you can keep writing inefficient programs, they
just will be inefficient. Peace.

> It would be interesting to collect a statistic about how many
> indexing operations happened during the life of a string in typical
> (Micro)Python program.

Yup.

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steve Dower

Paul Sokolovsky wrote:
> You just shouldn't write inefficient programs, voila. But if you want, you 
> can keep writing inefficient programs, they just will be inefficient. Peace.

Can I nominate this for QOTD? :)

Cheers,
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Mark Lawrence


On 04/06/2014 16:32, Steve Dower wrote:


If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', 
string) also provides the same behaviour and gives me the sliced string, so 
there's no need to index for anything.



Out of idle curiosity is there anything that stops MicroPython, or any 
other implementation for that matter, from providing views of a string 
rather than copying every time?  IIRC memoryviews in CPython rely on the 
buffer protocol at the C API level, so since strings don't support this 
protocol you can't take a memoryview of them.  Could this actually be 
implemented in the future, is the underlying C code just too 
complicated, or what?


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Thu, 5 Jun 2014 01:00:52 +1000
Chris Angelico  wrote:

> On Thu, Jun 5, 2014 at 12:49 AM, Paul Sokolovsky 
> wrote:
> >> > But you need non-ASCII characters to display a title of MP3
> >> > track.
> >
> > Yes, but to display a title, you don't need to do codepoint access
> > at random - you need to either take a block of memory (length in
> > bytes) and do something with it (pass to a C function, transfer
> > over some bus, etc.), or *iterate in order* over codepoints in a
> > string. All these operations are as efficient (O-notation) for
> > UTF-8 as for UTF-32.
> 
> Suppose you have a long title, and you need to abbreviate it by
> dropping out words (delimited by whitespace), such that you keep the
> first word (always) and the last (if possible) and as many as possible
> in between. How are you going to write that? With PEP 393 or UTF-32
> strings, you can simply record the index of every whitespace you find,
> count off lengths, and decide what to keep and what to ellipsize.

I'll submit angry bugreport along the lines of "WWWHAT, it's 3.5 and
there's still no str.isplit()??!!11", then do it with re.finditer()
(while submitting another report on inconsistent naming scheme).

[]

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread INADA Naoki

For Jython and IronPython, UTF-16 may be best internal encoding.

Recent languages (Swiffy, Golang, Rust) chose UTF-8 as internal encoding.
Using utf-8 is simple and efficient. For example, no need for utf-8
copy of the string when writing to file
and serializing to JSON.

When implementing Python using these languages, UTF-8 will be best
internal encoding.

To allow Python implementations other than CPython can use UTF-8 or
UTF-16 as internal encoding efficiently,
I think adding internal position based API is the best solution.

>>> s = "\U0010x"
>>> len(s)
2
>>> s[1:]
'x'
>>> s.find('x')
1
>>> # s.isize() # Internal length. 5 for utf-8, 3 for utf-16
>>> # s.ifind('x') # Internal position, 4 for utf-8, 2 for utf-16
>>> # s.islice(s.ifind('x')) => 'x'


(I like design of golang and Rust. I hope CPython uses utf-8 as
internal encoding in the future.
But this is off-topic.)


On Wed, Jun 4, 2014 at 4:41 PM, Jeff Allen  wrote:
> Jython uses UTF-16 internally -- probably the only sensible choice in a
> Python that can call Java. Indexing is O(N), fundamentally. By
> "fundamentally", I mean for those strings that have not yet noticed that
> they contain no supplementary (>0x) characters.
>
> I've toyed with making this O(1) universally. Like Steven, I understand this
> to be a freedom afforded to implementers, rather than an issue of
> conformity.
>
> Jeff Allen
>
>
> On 04/06/2014 02:17, Steven D'Aprano wrote:
>>
>> There is a discussion over at MicroPython about the internal
>> representation of Unicode strings.
>
> ...
>
>> My own feeling is that O(1) string indexing operations are a quality of
>> implementation issue, not a deal breaker to call it a Python. I can't
>> see any requirement in the docs that str[n] must take O(1) time, but
>> perhaps I have missed something.
>>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com



-- 
INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka


04.06.14 18:38, Paul Sokolovsky написав(ла):

Any non-trivial text parsing uses indices or regular expressions (and
regular expressions themself use indices internally).


I keep hearing this stuff, and unfortunately so far don't have enough
time to collect all that stuff and provide detailed response. So,
here's spur of the moment response - hopefully we're in the same
context so it is easy to understand.

So, gentlemen, you keep mixing up character-by-character random access
to string and taking substrings of a string.

Character-by-character random access imply that you would need to scan
thru (possibly almost) all chars in a string. That's O(N) (N-length of
string). With varlength encoding (taking O(N) to index arbitrary char),
there's thus concern that this would be O(N^2) op.

But show me real-world case for that. Common usecase is scanning string
left-to-right, that should be done using iterator and thus O(N).
Right-to-left scanning would be order(s) of magnitude less frequent, as
and also handled by iterator.


html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't 
use iterators. They use indices, str.find and/or regular expressions. 
Common use case is quickly find substring starting from current position 
using str.find or re.search, process found token, advance position and 
repeat.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread MRAB


On 2014-06-04 14:33, Nick Coghlan wrote:

On 4 June 2014 15:39,   wrote:

On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:


There's a general expectation that indexing will be O(1) because
all the builtin containers that support that syntax use it for
O(1) lookup operations.


Depending on your definition of built in, there is at least one
standard library container that does not - collections.deque.

Given the specialized kinds of application this Python
implementation is targetted at, it seems UTF-8 is ideal considering
the huge memory savings resulting from the compressed
representation, and the reduced likelihood of there being any real
need for serious text processing on the device.


Right - I wasn't clear that I think storing text internally as UTF-8
sounds fine for MicroPython. Anything where the O(N) nature of
indexing by code point matters probably won't be run in that
environment anyway.


In order to avoid indexing, you could use some kind of 'cursor' class to
step forwards and backwards along strings. The cursor could include
both the codepoint index and the byte index.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 04 Jun 2014 19:49:18 +0300
Serhiy Storchaka  wrote:

[]
> > But show me real-world case for that. Common usecase is scanning
> > string left-to-right, that should be done using iterator and thus
> > O(N). Right-to-left scanning would be order(s) of magnitude less
> > frequent, as and also handled by iterator.
> 
> html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
> don't use iterators. They use indices, str.find and/or regular
> expressions. Common use case is quickly find substring starting from
> current position using str.find or re.search, process found token,
> advance position and repeat.

That's sad, I agree.


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka


04.06.14 19:52, MRAB написав(ла):

In order to avoid indexing, you could use some kind of 'cursor' class to
step forwards and backwards along strings. The cursor could include
both the codepoint index and the byte index.


So you need different string library and different regular expression 
library.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka


04.06.14 17:49, Paul Sokolovsky написав(ла):

On Thu, 5 Jun 2014 00:26:10 +1000
Chris Angelico  wrote:

On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka
 wrote:

04.06.14 10:03, Chris Angelico написав(ла):

Right, which is why I don't like the idea. But you don't need
non-ASCII characters to blink an LED or turn a servo, and there is
significant resistance to the notion that appending a non-ASCII
character to a long ASCII-only string requires the whole string to
be copied and doubled in size (lots of heap space used).

But you need non-ASCII characters to display a title of MP3 track.


Yes, but to display a title, you don't need to do codepoint access at
random - you need to either take a block of memory (length in bytes) and
do something with it (pass to a C function, transfer over some bus,
etc.), or *iterate in order* over codepoints in a string. All these
operations are as efficient (O-notation) for UTF-8 as for UTF-32.


Several previous comments discuss first option, ASCII-only strings.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka


04.06.14 20:05, Paul Sokolovsky написав(ла):

On Wed, 04 Jun 2014 19:49:18 +0300
Serhiy Storchaka  wrote:

html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
don't use iterators. They use indices, str.find and/or regular
expressions. Common use case is quickly find substring starting from
current position using str.find or re.search, process found token,
advance position and repeat.


That's sad, I agree.


Other languages (Go, Rust) can be happy without O(1) indexing of 
strings. All string and regex operations work with iterators or cursors, 
and I believe this approach is not significant worse than implementing 
strings as O(1)-indexable arrays of characters (for some applications it 
can be worse, for other it can be better). But Python is different 
language, it has different operations for strings and different idioms. 
A language which doesn't support O(1) indexing is not Python, it is only 
Python-like language.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Stephen J. Turnbull

Serhiy Storchaka writes:

 > It would be interesting to collect a statistic about how many indexing 
 > operations happened during the life of a string in typical (Micro)Python 
 > program.

Probably irrelevant (I doubt anybody is going to be writing
programmers' editors in MicroPython), but by far the most frequently
called functions in XEmacs are byte_to_char_index and its inverse.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Guido van Rossum

This thread has devolved into a flame war. I think we should trust the
Micropython implementers (whoever they are -- are they participating here?)
to know their users and let them do what feels right to them. We should
just ask them not to claim full compatibility with any particular Python
version -- that seems the most contentious point. Realistically, most
Python code that works on Python 3.4 won't work on Micropython (for various
reasons, not just the string behavior) and neither does it need to.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 04 Jun 2014 20:52:14 +0300
Serhiy Storchaka  wrote:

[]
> > That's sad, I agree.
> 
> Other languages (Go, Rust) can be happy without O(1) indexing of 
> strings. All string and regex operations work with iterators or
> cursors, and I believe this approach is not significant worse than
> implementing strings as O(1)-indexable arrays of characters (for some
> applications it can be worse, for other it can be better). But Python
> is different language, it has different operations for strings and
> different idioms. A language which doesn't support O(1) indexing is
> not Python, it is only Python-like language.

Sorry, but that's just your personal opinion, not shared by other
developers, as this thread showed. And let's not pretend we live in
happy-ever world of Python 1.5.2 which doesn't need anything more
because it's perfect as it is. Somebody added all those iterators and
iterator-returning functions to Pythons. And then the problem Python
has is a typical "last mile" problem, that iterators were not applied
completely everywhere. There's little choice but to move in that
direction, though.

What you call "idioms", other people call "sloppy programming
practices". There's common suggestion how to be at peace with Python's
indentation for those who find it a problem - "get over it". Well,
somehow it itches to say same for people who think that Python3 should
be used the same way as Python1: Get over the fact that Python is no
longer little funny language being laughed at by Perl crowd for being
order of magnitude slower at processing text files. While you still can
do little funny tricks we all love Python for, it now also offers
framework to do it right, and it makes little sense saying that doing it
little funny way is the definitive trait of Python.

(And for me it's easy to be such categorical - the only way I could
subscribe to idea of running Python on an MCU and not be laughable is by
trusting Python to provide framework for being efficient. I quit
working on another language because I have trusted that iterator,
generator, buffer protocols are not little funny things but thoroughly
engineered efficient concepts, and I don't feel betrayed.)

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Should standard library modules optimize for CPython?

2014-06-04 Thread Stefan Behnel

Sturla Molden, 03.06.2014 22:51:
> Stefan Behnel wrote:
>> So the
>> argument in favour is mostly a pragmatic one. If you can have 2-5x faster
>> code essentially for free, why not just go for it?
> 
> I would be easier if the GIL or Cython's use of it was redesigned. Cython
> just grabs the GIL and holds on to it until it is manually released. The
> standard lib cannot have packages that holds the GIL forever, as a Cython
> compiled module would do. Cython has to start sharing access the GIL like
> the interpreter does.

Granted. This shouldn't be all that difficult to add as a special case when
compiling .py (not .pyx) files. Properly tuning it (i.e. avoiding to inject
the GIL release-acquire cycle in the wrong spots) may take a while, but
that can be improved over time.

(It's not required in .pyx files because users should rather explicitly
write "with nogil: pass" there to manually enable thread switches in safe
and desirable places.)

Stefan


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steven D'Aprano

On Wed, Jun 04, 2014 at 03:32:25PM +, Steve Dower wrote:
> Steven D'Aprano wrote:
> > The language semantics says that a string is an array of code points. Every
> > index relates to a single code point, no code point extends over two or more
> > indexes.
> > There's a 1:1 relationship between code points and indexes. How is direct
> > indexing "likely to be incorrect"?
> 
> We're discussing the behaviour under a different (hypothetical) design 
> decision than a 1:1 relationship between code points and indexes, so 
> arguing from that stance doesn't make much sense.

I'm open to different implementations. I earlier even suggested that the 
choice of O(1) indexing versus O(N) indexing was a quality of 
implementation issue, not a make-or-break issue for whether something 
can call itself Python (or even 99% compatible with Python").

But I don't believe that exposing that implementation at the Python 
level is valid: regardless of whether it is efficient or not, I should 
be able to write code like this:

a = [mystring[i] for i in range(len(mystring))]
b = list(mystring)
assert a == b

That is not the case if you expose the underlying byte-level 
implementation at the Python level, and treat strings as an array of 
*bytes*. Paul seems to want to do this, or at least he wants Python 4 
to do this. I think it is *completely* inappropriate to do so.

I *think* you may agree with me, (correct me if I'm wrong) because you 
go on to agree with me:

> > e.g.
> > 
> > s = "---ÿ---"
> > offset = s.index('ÿ')
> > assert s[offset] == 'ÿ'
> > 
> > That cannot fail with Python's semantics.
> 
> Agreed, and it shouldn't 

but I'm not actually sure.

> (I was actually referring to the optimization 
> being incorrect for the goal, not the language semantics). What you'd 
> probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may 
> be surprising, but is also correct.

You don't seem to be taking about sys.getsizeof, so I guess you're 
talking about something at the C level (or other underlying 
implementation), ignoring the object overhead. I don't know why you 
think I'd find that surprising -- one cannot fit 0x10 Unicode code 
points in a single byte, so whether you use UTF-32, UTF-16, UTF-8, 
Python 3.3's FSR or some other implementation, at least some code points 
are going to use more than one byte.

> But what are you trying to achieve (why are you writing this code)? 
> All this example really shows is that you're only using indexing for 
> trivial purposes.

I'm trying to understand what point you are trying to make, because I'm 
afraid I don't quite get it.

[...]
> If copying into a separate list is a problem (memory-wise), 
> re.finditer('\\S+', string) also provides the same behaviour and gives 
> me the sliced string, so there's no need to index for anything.

finditer returns a bunch of MatchObjects, which give you the indexes 
of the found substring. Whether you do it yourself, or get the re 
module to do it, you're indexing somewhere.

> The downside is that it isn't as easy to teach as the 1:1 
> relationship, and currently it doesn't perform as well *in CPython*. 
> But if MicroPython is focusing on size over speed, I don't see any 
> reason why they shouldn't permit different performance characteristics 
> and require a slightly different approach to highly-optimized coding.

I don't have a problem with different implementations, so long as that 
implementation isn't exposed at the Python level with changes of 
semantics such as breaking the promise that a string is an array of code 
points, not of bytes.

> In any case, this is an interesting discussion with a genuine effect 
> on the Python interpreter ecosystem. Jython and IronPython already 
> have different string implementations from CPython - having official 
> (and hopefully flexible) guidance on deviations from the reference 
> implementation would I think help other implementations provide even 
> more value, which is only a good thing for Python.

Yes, agreed.

-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Glenn Linderman


On 6/4/2014 6:14 AM, Steve Dower wrote:
I'm agree with Daniel. Directly indexing into text suggests an 
attempted optimization that is likely to be incorrect for a set of 
strings. Splitting, regex, concatenation and formatting are really the 
main operations that matter, and MicroPython can optimize their 
implementation of these easily enough for O(N) indexing.


Cheers,
Steve

Top-posted from my Windows Phone

From: Daniel Holth 
Sent: ‎6/‎4/‎2014 5:17
To: Paul Sokolovsky 
Cc: python-dev 
Subject: Re: [Python-Dev] Internal representation of strings and 
Micropython


If we're voting I think representing Unicode internally in micropython
as utf-8 with O(N) indexing is a great idea, partly because I'm not
sure indexing into strings is a good idea - lots of Unicode code
points don't make sense by themselves; see also grapheme clusters. It
would probably work great.


I think native UTF-8 support is the most promising route for a 
micropython Unicode support.


It would be an interesting proof-of-concept to implement an alternative 
CPython with PEP-393 replaced by UTF-8 internally... doing conversions 
for APIs that require a different encoding, but always maintaining and 
computing with the UTF-8 representation.


1) The first proof-of-concept implementation should implement codepoint 
indexing as a O(N) operation, searching from the beginning of the string 
for the Nth codepoint.


Other Proof-of-concept implementation could implement a codepoint 
boundary cache, there could be a variety of caching algorithms.


2) (Least space efficient) An array that could be indexed by codepoint 
position and result in byte position. (This would use more space than a 
UTF-32 representation!)


3) (Most space efficient) One cached entry, that caches the last 
codepoint/byte position referenced. UTF-8 is able to be traversed in 
either direction, so "next/previous" codepoint access would be 
relatively fast (and such are very common operations, even when indexing 
notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)


4) (Fixed size caches)  N entries, one for the last codepoint, and 
others at Codepoint_Length/N intervals.  N could be tunable.


5) (Fixed size caches)  Like 4, plus an extra entry like 3.

6) (Variable size caches)  Like 2, but only indexing every  Nth code 
point.  N could be tunable.


7) (Variable size caches)  Like 6, plus an extra entry like 3.

8) (Content specific variable size caches)  Index each codepoint that is 
a different byte size than the previous codepoint, allowing indexing to 
be used in the intervals. Worst case size is like 2, best case size is a 
single entry for the end, when all code points are represented by the 
same number of bytes.


9) (Content specific variable size caches)  Like 8, only cache entries 
could indicate fixed or variable size characters in the next interval, 
with a scheme like 4 or 6 used to prevent one interval from covering the 
whole string.


Other hybrid schemes may present themselves as useful once experience is 
gained with some of these. It might be surprising how few algorithms 
need more than algorithm 3 to get reasonable performance.


Glenn
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 4 Jun 2014 11:25:51 -0700
Guido van Rossum  wrote:

> This thread has devolved into a flame war. I think we should trust the
> Micropython implementers (whoever they are -- are they participating
> here?) 

I'm a regular contributor. I'm not sure if the author, Damien George,
is on the list. In either case, he's a nice guy who prefer to do
development rather than participate in flame wars ;-). And for the
record, all opinions expressed are solely mine, and not official
position of MicroPython project.

> to know their users and let them do what feels right to them.
> We should just ask them not to claim full compatibility with any
> particular Python version -- that seems the most contentious point.

"Full" compatibility is never claimed, and understanding it as such is
optimistic, "between the lines" reading of some users. All of:
announcement posted on python-list (which prompted current inflow of
MicroPython-related discussions), README at
https://github.com/micropython/micropython , and detailed differences
doc https://github.com/micropython/micropython/wiki/Differences make it
clear there's no talk about "full" compatibility, and only specific
compatibility (and incompatibility) points are claimed.

That said, and unlike previous attempts to develop a small Python
implementations (which of course existed), we're striving to be exactly
a Python language implementation, not a Python-like language
implementation. As there's no formal, implementation-independent
language spec, what constitutes a compatible language implementation is
subject to opinions, and we welcome and appreciate independent review,
like this thread did.

> Realistically, most Python code that works on Python 3.4 won't work
> on Micropython (for various reasons, not just the string behavior)
> and neither does it need to.

That's true. However, as was said, we're striving to provide a
compatible implementation, and compatibility claims must be validated.
While we have simple "in-house" testsuite, more serious compatibility
validation requires running a testsuite for reference implementation
(CPython), and that's gradually being approached.

> 
> -- 
> --Guido van Rossum (python.org/~guido)

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy


On 6/4/2014 3:41 AM, Jeff Allen wrote:

Jython uses UTF-16 internally -- probably the only sensible choice in a
Python that can call Java. Indexing is O(N), fundamentally. By
"fundamentally", I mean for those strings that have not yet noticed that
they contain no supplementary (>0x) characters.

I've toyed with making this O(1) universally. Like Steven, I understand
this to be a freedom afforded to implementers, rather than an issue of
conformity.

Jeff Allen

On 04/06/2014 02:17, Steven D'Aprano wrote:

There is a discussion over at MicroPython about the internal
representation of Unicode strings.

...

My own feeling is that O(1) string indexing operations are a quality of
implementation issue, not a deal breaker to call it a Python. I can't
see any requirement in the docs that str[n] must take O(1) time, but
perhaps I have missed something.






--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy


On 6/4/2014 3:41 AM, Jeff Allen wrote:

Jython uses UTF-16 internally -- probably the only sensible choice in a
Python that can call Java. Indexing is O(N), fundamentally. By
"fundamentally", I mean for those strings that have not yet noticed that
they contain no supplementary (>0x) characters.


Indexing can be made O(log(k)) where k is the number of astral chars, 
and is usually small.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman  wrote:
> 8) (Content specific variable size caches)  Index each codepoint that is a
> different byte size than the previous codepoint, allowing indexing to be
> used in the intervals. Worst case size is like 2, best case size is a single
> entry for the end, when all code points are represented by the same number
> of bytes.

Conceptually interesting, and I'd love to know how well that'd perform
in real-world usage. Would do very nicely on blocks of text that are
all from the same range of codepoints, but if you intersperse high and
low codepoints it'll be like 2 but with significantly more complicated
lookups (imagine a "name=value\nname=value\n" stream where the names
and values are all in the same language - you'll have a lot of
transitions).

Chrisa
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread R. David Murray

On Thu, 05 Jun 2014 00:14:32 +0300, Paul Sokolovsky  wrote:
> That said, and unlike previous attempts to develop a small Python
> implementations (which of course existed), we're striving to be exactly
> a Python language implementation, not a Python-like language
> implementation. As there's no formal, implementation-independent
> language spec, what constitutes a compatible language implementation is
> subject to opinions, and we welcome and appreciate independent review,
> like this thread did.

The language reference is also the language specification.  I don't
know what you mean by 'formal', so presumably it doesn't qualify
:)  That said, if there are places that are not correctly marked as
implementation specific, those are bugs in the reference and should
be fixed.  There almost certainly are still such bugs, and I suspect
MicroPython can help us fix them, just as PyPy did/does.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Glenn Linderman


On 6/4/2014 2:28 PM, Chris Angelico wrote:

On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman  wrote:

8) (Content specific variable size caches)  Index each codepoint that is a
different byte size than the previous codepoint, allowing indexing to be
used in the intervals. Worst case size is like 2, best case size is a single
entry for the end, when all code points are represented by the same number
of bytes.

Conceptually interesting, and I'd love to know how well that'd perform
in real-world usage.


So would I :)


Would do very nicely on blocks of text that are
all from the same range of codepoints, but if you intersperse high and
low codepoints it'll be like 2 but with significantly more complicated
lookups (imagine a "name=value\nname=value\n" stream where the names
and values are all in the same language - you'll have a lot of
transitions).


Lookup is binary search on code point index or a search for same in some 
tree structure, I would think.


"like 2 but ..." well, the data structure would be bigger than for 2, 
but your example shows 4-5 high codepoints per low codepoint (for some 
languages).


I did just think of another refinement to this technique (my list was 
not intended to be all-inclusive... just a bunch of variations I thought 
of then).


10) (Content specific variable size caches) Like 8, but the last 
character in a run is allowed (but not required) to be a different 
number of bytes than prior characters, because the offset calculation 
will still work for the first character of a different size.


So #10 would halve the size of your imagined stream that intersperses 
one low-byte charater with each sequence of high-byte characters.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy


On 6/4/2014 5:14 PM, Paul Sokolovsky wrote:


That said, and unlike previous attempts to develop a small Python
implementations (which of course existed), we're striving to be exactly
a Python language implementation, not a Python-like language
implementation. As there's no formal, implementation-independent
language spec, what constitutes a compatible language implementation is
subject to opinions, and we welcome and appreciate independent review,
like this thread did.


Realistically, most Python code that works on Python 3.4 won't work
on Micropython (for various reasons, not just the string behavior)
and neither does it need to.


That's true. However, as was said, we're striving to provide a
compatible implementation, and compatibility claims must be validated.
While we have simple "in-house" testsuite, more serious compatibility
validation requires running a testsuite for reference implementation
(CPython), and that's gradually being approached.


I would call what you are doing a 'Python 3.n subset, with limitations', 
where n should be a specific number, which I would urge should be at 
least 3, if not 4 ('yield from'). To me, that would mean that every 
Micropython program (that does not use a clearly non-Python addon like 
inline assembly) would run the same* on CPython 3.n. Conversely, a 
Python 3.n program should either run the same* on MicroPython as 
CPython, or raise. What most to avoid is giving different* answers.


*'same' does not include timing differences or normal float variations 
or bug fixes in MicroPython not in CPython.


As for unicode: I would see ascii-only (very limited codepoints) or bare 
utf-8 (limited speed == expanded time) as possibly fitting the 
definition above. Just be clear what the limitations are. And accept 
that there will be people who do not bother to read the limitations and 
then complain when they bang into them.


PS. You do not seem to be aware of how well the current PEP393 
implementation works. If you are going to write any more about it, I 
suggest you run Tools/Stringbench/stringbench.py for timings.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Eric Snow

On Wed, Jun 4, 2014 at 3:14 PM, Paul Sokolovsky  wrote:
> That said, and unlike previous attempts to develop a small Python
> implementations (which of course existed), we're striving to be exactly
> a Python language implementation, not a Python-like language
> implementation. As there's no formal, implementation-independent
> language spec, what constitutes a compatible language implementation is
> subject to opinions, and we welcome and appreciate independent review,
> like this thread did.

Actually, there is a "formal, implementation-independent language spec":

https://docs.python.org/3/reference/

>
>> Realistically, most Python code that works on Python 3.4 won't work
>> on Micropython (for various reasons, not just the string behavior)
>> and neither does it need to.
>
> That's true. However, as was said, we're striving to provide a
> compatible implementation, and compatibility claims must be validated.
> While we have simple "in-house" testsuite, more serious compatibility
> validation requires running a testsuite for reference implementation
> (CPython), and that's gradually being approached.

To a large extent the test suite in
http://hg.python.org/cpython/file/default/Lib/test effectively
validates (full) compliance with the corresponding release (change
"default" to the release branch of your choice).  With that goal, no
small effort has been made to mark implementation-specific tests as
such.  So uPy could consider using the test suite (and explicitly skip
the tests for features that uPy doesn't support).

-eric
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 04 Jun 2014 18:04:52 -0400
Terry Reedy  wrote:

> On 6/4/2014 5:14 PM, Paul Sokolovsky wrote:
> 
> > That said, and unlike previous attempts to develop a small Python
> > implementations (which of course existed), we're striving to be
> > exactly a Python language implementation, not a Python-like language
> > implementation. As there's no formal, implementation-independent
> > language spec, what constitutes a compatible language
> > implementation is subject to opinions, and we welcome and
> > appreciate independent review, like this thread did.
> >
> >> Realistically, most Python code that works on Python 3.4 won't work
> >> on Micropython (for various reasons, not just the string behavior)
> >> and neither does it need to.
> >
> > That's true. However, as was said, we're striving to provide a
> > compatible implementation, and compatibility claims must be
> > validated. While we have simple "in-house" testsuite, more serious
> > compatibility validation requires running a testsuite for reference
> > implementation (CPython), and that's gradually being approached.
> 
> I would call what you are doing a 'Python 3.n subset, with

Thanks, that's what we call it ourselves in the docs linked in the
original message, and use n=4. Note that being a subset is not a design
requirement, but there's higher-priority requirement of staying lean,
so realistically uPy will always stay a subset.

> limitations', where n should be a specific number, which I would urge
> should be at least 3, if not 4 ('yield from'). To me, that would mean
> that every Micropython program (that does not use a clearly
> non-Python addon like inline assembly) would run the same* on CPython
> 3.n. Conversely, a Python 3.n program should either run the same* on
> MicroPython as CPython, or raise. What most to avoid is giving
> different* answers.

That's nice aim, to implement which we don't have enough resources, so
would appreciate any help from interested parties.

> *'same' does not include timing differences or normal float
> variations or bug fixes in MicroPython not in CPython.
> 
> As for unicode: I would see ascii-only (very limited codepoints) or
> bare utf-8 (limited speed == expanded time) as possibly fitting the 
> definition above. Just be clear what the limitations are. And accept 
> that there will be people who do not bother to read the limitations
> and then complain when they bang into them.
> 
> PS. You do not seem to be aware of how well the current PEP393 
> implementation works. If you are going to write any more about it, I 
> suggest you run Tools/Stringbench/stringbench.py for timings.

"Well" is subjective (or should be defined formally based on the
requirements). With my MicroPython hat on, an implementation which
receives a string, transcodes it, leading to bigger size, just to
immediately transcode back and send out - is awful, environment
unfriendly implementation ;-).


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka


05.06.14 01:04, Terry Reedy написав(ла):

PS. You do not seem to be aware of how well the current PEP393
implementation works. If you are going to write any more about it, I
suggest you run Tools/Stringbench/stringbench.py for timings.


AFAIK stringbench is ASCII-only, so it likely is compatible with current 
and any future MicroPython implementations, but unlikely will expose 
non-ASCII limitations or performance.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Thu, Jun 5, 2014 at 8:52 AM, Paul Sokolovsky  wrote:
> "Well" is subjective (or should be defined formally based on the
> requirements). With my MicroPython hat on, an implementation which
> receives a string, transcodes it, leading to bigger size, just to
> immediately transcode back and send out - is awful, environment
> unfriendly implementation ;-).

Be careful of confusing correctness and performance, though. The
transcoding you describe is inefficient, but (presumably) correct;
something that's fast but wrong is straight-up buggy. You can always
fix inefficiency in a later release, but buggy behaviour sometimes is
relied on (which is why ECMAScript still exposes UTF-16 to scripts,
and why Windows window messages have a WPARAM and an LPARAM, and why
Python's threading module has duplicate names for a lot of functions,
because it's just not worth changing). I'd be much more comfortable
releasing something where "everything works fine, but if you use
astral characters in your strings, memory usage blows out by a factor
of four" (or "... the len() function takes O(N) time") than one where
"everything works fine as long as you use BMP only, but SMP characters
result in tests failing".

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Wed, 4 Jun 2014 16:12:23 -0600
Eric Snow  wrote:

> On Wed, Jun 4, 2014 at 3:14 PM, Paul Sokolovsky 
> wrote:
> > That said, and unlike previous attempts to develop a small Python
> > implementations (which of course existed), we're striving to be
> > exactly a Python language implementation, not a Python-like language
> > implementation. As there's no formal, implementation-independent
> > language spec, what constitutes a compatible language
> > implementation is subject to opinions, and we welcome and
> > appreciate independent review, like this thread did.
> 
> Actually, there is a "formal, implementation-independent language
> spec":
> 
> https://docs.python.org/3/reference/

Opening that link in browser, pressing Ctrl+F and pasting your quote
gives zero hits, so it's not exactly what you claim it to be. It's also
pretty far from being formal (unambiguous, covering all choices, etc.)
and comprehensive. Also, please point me at "conformance" section.

That said, all of us Pythoneers treat it as the best formal reference
available, no news here.

> >> Realistically, most Python code that works on Python 3.4 won't work
> >> on Micropython (for various reasons, not just the string behavior)
> >> and neither does it need to.
> >
> > That's true. However, as was said, we're striving to provide a
> > compatible implementation, and compatibility claims must be
> > validated. While we have simple "in-house" testsuite, more serious
> > compatibility validation requires running a testsuite for reference
> > implementation (CPython), and that's gradually being approached.
> 
> To a large extent the test suite in
> http://hg.python.org/cpython/file/default/Lib/test effectively
> validates (full) compliance with the corresponding release (change
> "default" to the release branch of your choice).  With that goal, no
> small effort has been made to mark implementation-specific tests as
> such.  So uPy could consider using the test suite (and explicitly skip
> the tests for features that uPy doesn't support).

That's exactly what we do, per the previous paragraph. And we face a
lot of questionable tests, just like you say. Shameless plug: if anyone
interested to run existing code on MicroPython, please help us with
CPython testsuite! ;-)

> 
> -eric



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka


05.06.14 00:21, Terry Reedy написав(ла):

On 6/4/2014 3:41 AM, Jeff Allen wrote:

Jython uses UTF-16 internally -- probably the only sensible choice in a
Python that can call Java. Indexing is O(N), fundamentally. By
"fundamentally", I mean for those strings that have not yet noticed that
they contain no supplementary (>0x) characters.


Indexing can be made O(log(k)) where k is the number of astral chars,
and is usually small.


I like your idea and think it would be great if Jython will implement 
it. Unfortunately it is too late to do this in CPython.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Eric Snow

On Wed, Jun 4, 2014 at 5:11 PM, Paul Sokolovsky  wrote:
> On Wed, 4 Jun 2014 16:12:23 -0600
> Eric Snow  wrote:
>> Actually, there is a "formal, implementation-independent language
>> spec":
>>
>> https://docs.python.org/3/reference/
>
> Opening that link in browser, pressing Ctrl+F and pasting your quote
> gives zero hits, so it's not exactly what you claim it to be. It's also
> pretty far from being formal (unambiguous, covering all choices, etc.)
> and comprehensive. Also, please point me at "conformance" section.
>
> That said, all of us Pythoneers treat it as the best formal reference
> available, no news here.

It's not just the best formal reference.  It's the official
specification.  I agree it is not so "formal" as other language
specifications and it does not enumerate every facet of the language.
However, underspecified parts are worth improving (as we've done with
the import system portion in the last few years).  Incidentally, the
efforts of other Python implementors have often resulted in such
improvements to the language reference.  Those improvements typically
come as a result of questions to this very list. :)  That's
essentially what this email thread is!

-eric
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Greg Ewing


Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't 
use iterators. They use indices, str.find and/or regular expressions. 
Common use case is quickly find substring starting from current position 
using str.find or re.search, process found token, advance position and 
repeat.


For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.

Instead of an integer, str.find() etc. could return a
StringPosition, which would be an opaque reference to a
particular point in a particular string. You would be
able to pass StringPositions to indexing and slicing
operations to get fast indexing into the string that
they were derived from.

StringPositions could support the following operations:

   StringPosition + int --> StringPosition
   StringPosition - int --> StringPosition
   StringPosition - StringPosition --> int

These would be computed by counting characters forwards
or backwards in the string, which would be slower than
int arithmetic but still faster than counting from the
beginning of the string every time.

In other contexts, StringPositions would coerce to ints
(maybe being an int subclass?) allowing them to be used
in any existing algorithm that slices strings using ints.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Greg Ewing


Serhiy Storchaka wrote:
A language which doesn't support O(1) indexing is not Python, it is only 
Python-like language.


That's debatable, but even if it's true, I don't think
there's anything wrong with MicroPython being only a
"Python-like language". As has been pointed out, fitting
Python onto a small device is always going to necessitate
some compromises.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Glenn Linderman


On 6/4/2014 5:03 PM, Greg Ewing wrote:

Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize 
don't use iterators. They use indices, str.find and/or regular 
expressions. Common use case is quickly find substring starting from 
current position using str.find or re.search, process found token, 
advance position and repeat.


For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.


I think you meant codepoint index, rather than character index.



Instead of an integer, str.find() etc. could return a
StringPosition, which would be an opaque reference to a
particular point in a particular string. You would be
able to pass StringPositions to indexing and slicing
operations to get fast indexing into the string that
they were derived from.

StringPositions could support the following operations:

   StringPosition + int --> StringPosition
   StringPosition - int --> StringPosition
   StringPosition - StringPosition --> int

These would be computed by counting characters forwards
or backwards in the string, which would be slower than
int arithmetic but still faster than counting from the
beginning of the string every time.

In other contexts, StringPositions would coerce to ints
(maybe being an int subclass?) allowing them to be used
in any existing algorithm that slices strings using ints.

This starts to diverge from Python codepoint indexing via integers. 
Calculating or caching the codepoint index to byte offset as part of the 
str implementation stays compatible with Python. Introducing 
StringPosition makes a Python-like language. Or so it seems to me.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Glenn Linderman


On 6/4/2014 5:08 PM, Glenn Linderman wrote:

On 6/4/2014 5:03 PM, Greg Ewing wrote:

Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize 
don't use iterators. They use indices, str.find and/or regular 
expressions. Common use case is quickly find substring starting from 
current position using str.find or re.search, process found token, 
advance position and repeat.


For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.


I think you meant codepoint index, rather than character index.



Instead of an integer, str.find() etc. could return a
StringPosition, which would be an opaque reference to a
particular point in a particular string. You would be
able to pass StringPositions to indexing and slicing
operations to get fast indexing into the string that
they were derived from.

StringPositions could support the following operations:

   StringPosition + int --> StringPosition
   StringPosition - int --> StringPosition
   StringPosition - StringPosition --> int

These would be computed by counting characters forwards
or backwards in the string, which would be slower than
int arithmetic but still faster than counting from the
beginning of the string every time.

In other contexts, StringPositions would coerce to ints
(maybe being an int subclass?) allowing them to be used
in any existing algorithm that slices strings using ints.

This starts to diverge from Python codepoint indexing via integers. 
Calculating or caching the codepoint index to byte offset as part of 
the str implementation stays compatible with Python. Introducing 
StringPosition makes a Python-like language. Or so it seems to me.


Another thought is that StringPosition only works (quickly, at least), 
as you point out, for the string that they were derived from... so 
algorithms that walk two strings at a time cannot use the same 
StringPosition to do so... yep, this is quite divergent from CPython and 
Python.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Greg Ewing


Glenn Linderman wrote:



For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.


I think you meant codepoint index, rather than character index.


Probably, but what I said is true either way.


This starts to diverge from Python codepoint indexing via integers.


That's true, although most programs would have to go
out of their way to tell the difference, especially if
StringPosition were a subclass of int.

I agree that cacheing indexes would be more transparent,
though.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Greg Ewing


Glenn Linderman wrote:


so algorithms that walk two strings at a time cannot use the same 
StringPosition to do so... yep, this is quite divergent from CPython and 
Python.


They can, it's just that at most one of the indexing
operations would be fast; the StringPosition would
devolve into an int for the other one.

Such an algorithm would be of dubious correctness
anyway, since as you pointed out, codepoints and
characters are not quite the same thing. A codepoint
index in one string doesn't necessarily count off
the same number of characters in another string.
So to be safe, you should really walk each string
individually.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Thu, 05 Jun 2014 12:03:17 +1200
Greg Ewing  wrote:

> Serhiy Storchaka wrote:
> > html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
> > don't use iterators. They use indices, str.find and/or regular
> > expressions. Common use case is quickly find substring starting
> > from current position using str.find or re.search, process found
> > token, advance position and repeat.
> 
> For that kind of thing, you don't need an actual character
> index, just some way of referring to a place in a string.
> 
> Instead of an integer, str.find() etc. could return a
> StringPosition, 

That's more brave then I had in mind, but definitely shows what
alternative implementation have in store to fight back if some
perfomance problems are actually detected. My own thoughts were, for
example, as response to people who (quoting) "slice strings for living"
is some form of "extended slicing" like str[(0, 4, 6, 8, 15)].

But I really think that providing iterator interface for common string
operations would cover most of real-world cases, and will be actually
beneficial for Python language in general.

> 
> -- 
> Greg

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico

On Thu, Jun 5, 2014 at 10:03 AM, Greg Ewing  wrote:
> StringPositions could support the following operations:
>
>StringPosition + int --> StringPosition
>StringPosition - int --> StringPosition
>StringPosition - StringPosition --> int
>
> These would be computed by counting characters forwards
> or backwards in the string, which would be slower than
> int arithmetic but still faster than counting from the
> beginning of the string every time.

The SP would have to keep track of which string it's associated with,
which might make for some surprising retentions of large strings.
(Imagine returning what you think is an integer, but actually turns
out to be a SP, and you're trying to work out why your program is
eating up so much more memory than it should. This int-like object is
so much more than an int.)

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky

Hello,

On Thu, 05 Jun 2014 12:08:21 +1200
Greg Ewing  wrote:

> Serhiy Storchaka wrote:
> > A language which doesn't support O(1) indexing is not Python, it is
> > only Python-like language.
> 
> That's debatable, but even if it's true, I don't think
> there's anything wrong with MicroPython being only a
> "Python-like language". As has been pointed out, fitting
> Python onto a small device is always going to necessitate
> some compromises.

Thanks. I mentioned in another mail that we exactly trying to develop a
minimalistic, but Python implementation, not Python-like language.

What is "Python-like" for me. The other most well-know, and mature (as
in "started quite some time ago") "small Python" implementation is
PyMite aka Python-on-a-chip
https://code.google.com/p/python-on-a-chip/ . It implements good deal
of Python2 language. It doesn't implement exception handling
(try/except). Can a Python be without exception handling? For me,
the clear answer is "no".

Please put that in perspective when alarming over O(1) indexing of
inherently problematic niche datatype. (Again, it's not my or
MicroPython's fault that it was forced as standard string type. Maybe
if CPython seriously considered now-standard UTF-8 encoding, results
of what is "str" type might be different. But CPython has gigabytes of
heap to spare, and for MicroPython, every half-bit is precious).

> 
> -- 
> Greg
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/pmiscml%40gmail.com

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy


On 6/4/2014 6:52 PM, Paul Sokolovsky wrote:


"Well" is subjective (or should be defined formally based on the
requirements). With my MicroPython hat on, an implementation which
receives a string, transcodes it, leading to bigger size, just to
immediately transcode back and send out - is awful, environment
unfriendly implementation ;-).


I am not sure what you concretely mean by 'receive a string', but I 
think you are again batting at a strawman. If you mean 'read from a 
file', and all you want to do is read bytes from and write bytes to 
external 'files', then there is obviously no need to transcode and 
neither Python 2 or 3 make you do so.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy


On 6/4/2014 6:54 PM, Serhiy Storchaka wrote:

05.06.14 00:21, Terry Reedy написав(ла):

On 6/4/2014 3:41 AM, Jeff Allen wrote:

Jython uses UTF-16 internally -- probably the only sensible choice in a
Python that can call Java. Indexing is O(N), fundamentally. By
"fundamentally", I mean for those strings that have not yet noticed that
they contain no supplementary (>0x) characters.


Indexing can be made O(log(k)) where k is the number of astral chars,
and is usually small.


I like your idea and think it would be great if Jython will implement
it.


A proof of concept implementation in Python that handles both indexing 
and slicing is on the tracker. It is simpler than I initially expected.


> Unfortunately it is too late to do this in CPython.

I mentioned it as an alternative during the '393 discussion. I more than 
half agree that the FSR is the better choice for CPython, which had no 
particular attachment to UTF-16 in the way that I think Jython, for 
instance, does.


--
Terry Jan Reedy



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

76 matches

Mail list logo