Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Paul Rubin
Chris Angelico  writes:
> You can't do a look-ahead with a vanilla string iterator. That's
> necessary for a lot of parsers.

For JSON?  For other parsers you usually have a tokenizer that reads
characters with maybe 1 char of lookahead.

> Yes, which gives a two-level indexing (first find the strand, then the
> character), and that's going to play pretty badly with CPU caches.

If you're jumping around at random all over the string, you probably
really want a bytearray rather than a unicode string.  If you're
scanning sequentually you won't have to look at the outer table very
often.
-- 
https://mail.python.org/mailman/listinfo/python-list


Let ipython3 use the latest python3

2017-01-21 Thread Cecil Westerhof
I built python3.6, but ipython3 is still using the old one (3.4.5).
How can I make ipython3 use 3.6?

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Let ipython3 use the latest python3

2017-01-21 Thread Chris Warrick
On 21 January 2017 at 12:30, Cecil Westerhof  wrote:
> I built python3.6, but ipython3 is still using the old one (3.4.5).
> How can I make ipython3 use 3.6?

All packages you have installed are tied to a specific Python version.
If you want to use IPython with Python 3.6, you need to install it for
that version (most likely, with pip) and make sure there is an
ipython3 executable in your $PATH pointing at 3.6. You don’t need to
remove IPython for 3.4 (but you can if you want to get rid of it).

-- 
Chris Warrick 
PGP: 5EAAEA16
-- 
https://mail.python.org/mailman/listinfo/python-list


Problems with python3.6 on one system, but OK on another

2017-01-21 Thread Cecil Westerhof
I build python3.6 on two systems. On one system everything is OK:
Python 3.6.0 (default, Jan 21 2017, 11:19:56) 
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.


But on another I get:
Could not find platform dependent libraries 
Consider setting $PYTHONHOME to [:]
Python 3.6.0 (default, Jan 21 2017, 12:20:38) 
[GCC 4.8.5] on linux
Type "help", "copyright", "credits" or "license" for more information.

Probably not a big problem, but just wondering what is happening here.
On both systems PYTHONHOME is not set and with the old version (3.4.5)
I did/do not get this message.


Another is that I used PYTHONSTARTUP to point to the following script:
# startup script for python to enable saving of interpreter history and
# enabling name completion

# import needed modules
import atexit
import os
import readline
import rlcompleter

# where is history saved
historyPath = os.path.expanduser("~/.pyhistory")

# handler for saving history
def save_history(historyPath=historyPath):
import readline
try:
readline.write_history_file(historyPath)
except:
pass

# read history, if it exists
if os.path.exists(historyPath):
readline.set_history_length(1)
readline.read_history_file(historyPath)

# register saving handler
atexit.register(save_history)

# enable completion
readline.parse_and_bind('tab: complete')

# cleanup
del os, atexit, readline, rlcompleter, save_history, historyPath


This works with 3.4.5, but with 3.6 it gives:
Traceback (most recent call last):
  File "/etc/pythonstart", line 7, in 
import readline
ModuleNotFoundError: No module named 'readline'


Probably not a big problem because I will mostly work with ipython3 at
the moment I get it working with 3.6, but just wondering.

By the way all other import (including rlcompleter) do work in 3.6.

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Tim Chase
On 2017-01-21 11:58, Chris Angelico wrote:
> So, how could you implement this function? The current
> implementation maintains an index - an integer position through the
> string. It repeatedly requests the next character as string[idx],
> and can also slice the string (to check for keywords like "true")
> or use a regex (to check for numbers). Everything's clean, but it's
> lots of indexing.

But in these parsing cases, the indexes all originate from stepping
through the string from the beginning and processing it
codepointwise.  Even this is a bit of an oddity, especially once you
start taking combining characters into consideration and need to
process them with the preceding character(s).  So while you may be
doing indexing, those indexes usually stem from having walked to that
point, not arbitrarily picking some offset.

You allude to it in your:

> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing. Every string
> parse operation would have to be reimplemented this way, lest it
> perform abysmally on large strings. It'd mean some sort of magic
> "thing" that probably has a reference to the original string, so
> you don't get the progressive RAM refunds that slicing gives, and
> you'd still have to deal with lots of the other consequences. It's
> probably doable, but it would be a lot of pain.

but I'm hard-pressed to come up with any use case where direct
indexing into a (non-byte)string makes sense unless you've already
processed/searched up to that point and can use a recorded index
from that processing/search.

Can you provide real-world examples of "I need character 2832 from
this string of unicode text, but I never had to scan to that point
linearly from the beginning/end of the string"?

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sat, 21 Jan 2017 09:35 am, Pete Forman wrote:

> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation?

I've read over the PEP, and the email discussion, and there is very little
mention of UTF-8, and as far as I can see no counter-proposal for using
UTF-8. However, there are a few mentions of UTF-8 that suggest that the
participants were aware of it as an alternative, and simply didn't think it
was worth considering. I don't know why.

You can read the PEP and the mailing list discussion here:

The PEP:

https://www.python.org/dev/peps/pep-0393/

Mailing list discussion starts here:

https://mail.python.org/pipermail/python-dev/2011-January/107641.html

Stefan Behnel (author of Cython) states that UTF-8 is much harder to use:

https://mail.python.org/pipermail/python-dev/2011-January/107739.html

I see nobody challenging that claim, so perhaps there was simply enough
broad agreement that UTF-8 would have been more work and so nobody wanted
to propose it. I'm just guessing though.

Perhaps it would have been too big a change to adapt the CPython internals
to variable-width UTF-8 from the existing fixed-width UTF-16 and UTF-32
implementations?

(I know that UTF-16 is actually variable-width, but Python prior to PEP 393
treated it as if it were fixed.)

There was a much earlier discussion about the internal implementation of
Unicode strings:

https://mail.python.org/pipermail/python-3000/2006-September/003795.html

including some discussion of UTF-8:

https://mail.python.org/pipermail/python-3000/2006-September/003816.html

It too proposed using a three-way internal implementation, and made it clear
that O(1) indexing was an requirement.

Here's a comment explicitly pointing out that constant-time indexing is
wanted, and that using UTF-8 with a two-level table destroys any space
advantage UTF-8 might have:

https://mail.python.org/pipermail/python-3000/2006-September/003822.html

Ironically, Martin v. Löwis, the author of PEP 393 originally started off
opposing an three-way internal representation, calling it "terrible":

https://mail.python.org/pipermail/python-3000/2006-September/003891.html

Another factor which I didn't see discussed anywhere is that Python strings
treat surrogates as normal code points. I believe that would be troublesome
for a UTF-8 implementation:

py> '\uDC37'.encode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
position 0: surrogates not allowed


but of course with a UCS-2 or UTF-32 implementation it is trivial: you just
treat the surrogate as another code point like any other.


[...]
> ISTM that most operations on strings are via iterators and thus agnostic
> to variable or fixed width encodings. 

Slicing is not.

start = text.find(":")
end = text.rfind("!")
assert end > start
chunk = text[start:end]

But even with iteration, we still would expect that indexes be consecutive:

for i, c in enumerate(text):
assert c == text[i]


The complexity of those functions will be greatly increased with UTF-8. Of
course you can make it work, and you can even hide the fact that UTF-8 has
variable-width code points. But you can't have all three of:

- simplicity;
- memory efficiency;
- O(1) operations

with UTF-8.

But of course, I'd be happy for a competing Python implementation to use
UTF-8 and prove me wrong!




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote:

> but I'm hard-pressed to come up with any use case where direct
> indexing into a (non-byte)string makes sense unless you've already
> processed/searched up to that point and can use a recorded index
> from that processing/search.


Let's take a simple example: you do a find to get an offset, and then slice
from that offset.

py> text = "αβγдлфxx"
py> offset = text.find("ф")
py> stuff = text[offset:]
py> assert stuff == "фxx"


That works fine whether indexing refers to code points or bytes.

py> "αβγдлфxx".find("ф")
5
py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8'))
10

Either way, you get the expected result. However:

py> stuff = text[offset + 1:]
py> assert stuff == "xx"


That requires indexes to point to the beginning of *code points*, not bytes:
taking byte 11 of "αβγдлфxx".encode('utf-8') drops you into the middle of
the ф representation:

py> "αβγдлфxx".encode('utf-8')[11:]
b'\x84xx'

and it isn't a valid UTF-8 substring. Slicing would generate an exception
unless you happened to slice right at the start of a code point.

It's like seek() and tell() on text files: you cannot seek to arbitrary
positions, but only to the opaque positions returned by tell. That's
unacceptable for strings.

You could avoid that error by increasing the offset by the right amount:

stuff = text[offset + len("ф".encode('utf-8'):]

which is awful. I believe that's what Go and Julia expect you to do.

Another solution would be to have the string slicing method automatically
scan forward to the start of the next valid UTF-8 code point. That would be
the "Do What I Mean" solution.

The problem with the DWIM solution is that not only is it adding complexity,
but it's frankly *weird*. It would mean:


- if the character at position `offset` fits in 2 bytes:
  text[offset+1:] == text[offset+2:]

- if it fits in 3 bytes:
  text[offset+1:] == text[offset+2:] == text[offset+3:]

- and if it fits in 4 bytes:
  text[offset+1:] == text[offset+2:] == text[offset+3:] == text[offset+4:]


Having the string slicing method Do The Right Thing would actually be The
Wrong Thing. It would make it awful to reason about slicing.

You can avoid this by having the interpreter treat the Python-level indexes
as opaque "code point offsets", and converting them to and from "byte
offsets" as needed. That's not even very hard. But it either turns every
indexing into O(N) (since you have to walk the string to count which byte
represents the nth code point), or you have to keep an auxiliary table with
every string, letting you convert from byte indexes to code point indexes
quickly, but that will significantly increase the memory size of every
string, blowing out the advantage of using UTF-8 in the first place.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Pete Forman
Steve D'Aprano  writes:

> [...]
> Another factor which I didn't see discussed anywhere is that Python
> strings treat surrogates as normal code points. I believe that would
> be troublesome for a UTF-8 implementation:
>
> py> '\uDC37'.encode('utf-8')
> Traceback (most recent call last):
>   File "", line 1, in 
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
> position 0: surrogates not allowed
>
> but of course with a UCS-2 or UTF-32 implementation it is trivial: you
> just treat the surrogate as another code point like any other.

Thanks for a very thorough reply, most useful. I'm going to pick you up
on the above, though.

Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC
3629 (2003). There is CESU-8 if you really need a naive encoding of
UTF-16 to UTF-8-alike.

py> low = '\uDC37'

is only meaningful on narrow builds pre Python 3.3 where the user must
do extra to correctly handle characters outside the BMP.

-- 
Pete Forman
https://payg-petef.rhcloud.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Jussi Piitulainen
Steve D'Aprano writes:

[snip]

> You could avoid that error by increasing the offset by the right
> amount:
>
> stuff = text[offset + len("ф".encode('utf-8'):]
>
> which is awful. I believe that's what Go and Julia expect you to do.

Julia provides a method to get the next index.

let text = "ἐπὶ οἴνοπα πόντον", offset = 1
while offset <= endof(text)
print(text[offset], ".")
offset = nextind(text, offset)
end
println()
end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Chris Angelico
On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen
 wrote:
> Steve D'Aprano writes:
>
> [snip]
>
>> You could avoid that error by increasing the offset by the right
>> amount:
>>
>> stuff = text[offset + len("ф".encode('utf-8'):]
>>
>> which is awful. I believe that's what Go and Julia expect you to do.
>
> Julia provides a method to get the next index.
>
> let text = "ἐπὶ οἴνοπα πόντον", offset = 1
> while offset <= endof(text)
> print(text[offset], ".")
> offset = nextind(text, offset)
> end
> println()
> end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

This implies that regular iteration isn't good enough, though.

Here's a function that creates a numbered list:

def print_list(items):
width = len(str(len(items)))
for idx, item in enumerate(items, 1):
print("%*d: %s" % (width, idx, item))

In Python, this will happily accept anything that is iterable and has
a known length. Could be a list or tuple, obviously, but can also just
as easily be a dict view (keys or items), a range object, or a
string. It's perfectly acceptable to enumerate the characters of a
string. And enumerate() itself is implemented entirely generically. If
you have to call nextind() to get the next character, you've made it
impossible to do any kind of generic operation on the text. You can't
do a windowed view by slicing while iterating, you can't have a "lag"
or "lead" value, you can't do any of those kinds of simple and obvious
index-based operations.

Oh, and Python 3.3 wasn't the first programming language to use this
flexible string representation. Pike introduced an extremely similar
string representation back in 1998:

https://github.com/pikelang/Pike/commit/db4a4

So yes, UTF-8 has its advantages. But it also has its costs, and for a
text processing language like Pike or Python, they significantly
outweigh the benefits.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Jussi Piitulainen
Chris Angelico writes:

> On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote:
>> Steve D'Aprano writes:
>>
>> [snip]
>>
>>> You could avoid that error by increasing the offset by the right
>>> amount:
>>>
>>> stuff = text[offset + len("ф".encode('utf-8'):]
>>>
>>> which is awful. I believe that's what Go and Julia expect you to do.
>>
>> Julia provides a method to get the next index.
>>
>> let text = "ἐπὶ οἴνοπα πόντον", offset = 1
>> while offset <= endof(text)
>> print(text[offset], ".")
>> offset = nextind(text, offset)
>> end
>> println()
>> end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.
>
> This implies that regular iteration isn't good enough, though.

It doesn't. Here's the straightforward iteration over the whole string:

let text = "ἐπὶ οἴνοπα πόντον"
for c in text
print(c, ".")
end
println()
end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

One can also join any iterable whose elements can be converted to
strings, and characters can:

let text = "ἐπὶ οἴνοπα πόντον"
println(join(text, "."), ".")
end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

And strings, trivially, can:

let text = "ἐπὶ οἴνοπα πόντον"
println(join(split(text), "."), ".")
end # prints: ἐπὶ.οἴνοπα.πόντον.

> Here's a function that creates a numbered list:
>
> def print_list(items):
> width = len(str(len(items)))
> for idx, item in enumerate(items, 1):
> print("%*d: %s" % (width, idx, item))
>
> In Python, this will happily accept anything that is iterable and has
> a known length. Could be a list or tuple, obviously, but can also just
> as easily be a dict view (keys or items), a range object, or a
> string. It's perfectly acceptable to enumerate the characters of a
> string. And enumerate() itself is implemented entirely generically.

I'll skip the formatting - I don't know off-hand how to do it - but keep
the width calculation, and I cut the character iterator short at 10
items to save some space. There, it's much the same in Julia:

let text = "ἐπὶ οἴνοπα πόντον"

function print_list(items)
width = endof(string(length(items)))
println("width = ", width)
for (idx, item) in enumerate(items)
println(idx, '\t', item)
end
end

print_list(take(text, 10))
print_list([text, text, text])
print_list(split(text))
end

That prints this:

width = 2
1   ἐ
2   π
3   ὶ
4
5   ο
6   ἴ
7   ν
8   ο
9   π
10  α
width = 1
1   ἐπὶ οἴνοπα πόντον
2   ἐπὶ οἴνοπα πόντον
3   ἐπὶ οἴνοπα πόντον
width = 1
1   ἐπὶ
2   οἴνοπα
3   πόντον

> If you have to call nextind() to get the next character, you've made
> it impossible to do any kind of generic operation on the text. You
> can't do a windowed view by slicing while iterating, you can't have a
> "lag" or "lead" value, you can't do any of those kinds of simple and
> obvious index-based operations.

Yet Julia does with ease many things that you seem to think it cannot
possibly do at all. The iteration system works on types that have
methods for certain generic functions. For strings, the default is to
iterate over something like its characters; I think another iterator
over valid indexes is available, or wouldn't be hard to write; it could
be forward or backward, and in Julia many of these things are often
peekable by default (because the iteration protocol itself does not have
state - see below at "more magic").

The usual things work fine:

let text = "ἐπὶ οἴνοπα πόντον"
foreach(print, enumerate(zip(text, split(text
end # prints: (1,('ἐ',"ἐπὶ"))(2,('π',"οἴνοπα"))(3,('ὶ',"πόντον"))

How is that bad?

More magic:

let text = "ἐπὶ οἴνοπα πόντον"
let ever = cycle(split(text))
println(first(ever))
println(first(ever))
for n in 2:6
println(join(take(ever, n), " "))
end end end

This prints the following. The cycle iterator, ever, produces an
endless repetition of the three words, but it doesn't have state like
Python iterators do, so it's possible to look at the first word twice
(and then five more times).

ἐπὶ
ἐπὶ
ἐπὶ οἴνοπα
ἐπὶ οἴνοπα πόντον
ἐπὶ οἴνοπα πόντον ἐπὶ
ἐπὶ οἴνοπα πόντον ἐπὶ οἴνοπα
ἐπὶ οἴνοπα πόντον ἐπὶ οἴνοπα πόντον

> Oh, and Python 3.3 wasn't the first programming language to use this
> flexible string representation. Pike introduced an extremely similar
> string representation back in 1998:
>
> https://github.com/pikelang/Pike/commit/db4a4

Ok. Is GitHub that old?

> So yes, UTF-8 has its advantages. But it also has its costs, and for a
> text processing language like Pike or Python, they significantly
> outweigh the benefits.

I process text in my work but I really don't use character indexes much
at all. Rather split, join, startswith, endswith, that kind of thing,
and whether a string contains some character or substring anywhere.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Marko Rauhamaa
Pete Forman :

> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32.

Also, they don't exist as Unicode code points. Python shouldn't allow
surrogate characters in strings.

   Thus the range of code points that are available for use as
   characters is U+–U+D7FF and U+E000–U+10 (1,112,064 code
   points).

   https://en.wikipedia.org/wiki/Unicode>


   The Unicode Character Database is basically a table of characters
   indexed using integers called ’code points’. Valid code points are in
   the ranges 0 to #xD7FF inclusive or #xE000 to #x10 inclusive,
   which is about 1.1 million code points.

   https://www.gnu.org/software/guile/docs/master/guile.html/Char
   acters.html>

Guile does the right thing:

   scheme@(guile-user)> #\xd7ff
   $1 = #\153777
   scheme@(guile-user)> #\xe000
   $2 = #\16
   scheme@(guile-user)> #\xd812
   While reading expression:
   ERROR: In procedure scm_lreadr: #:5:8: out-of-range hex c
   haracter escape: xd812

> py> low = '\uDC37'

That should raise a SyntaxError exception.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Pete Forman
Marko Rauhamaa  writes:

>> py> low = '\uDC37'
>
> That should raise a SyntaxError exception.

Quite. My point was that with older Python on a narrow build (Windows
and Mac) you need to understand that you are using UTF-16 rather than
Unicode. On a wide build or Python 3.3+ then all is rosy. (At this point
I'm tempted to put in a winky emoji but that might push the internal
representation into UCS-4.)

-- 
Pete Forman
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Adding colormaps?

2017-01-21 Thread Martin Schöön
Den 2017-01-21 skrev Gilmeh Serda :
> On Wed, 18 Jan 2017 21:41:34 +, Martin Schöön wrote:
>
>> What I would like to do is to add the perceptually uniform sequential
>> colormaps introduced in version 1.5.something. I would like to do this
>> without breaking my Debian system in which Matplotlib version 1.4.2 is
>> the newest version available in the repo.
>
> Haven't checked, but I assume you can get the source. Compile it but 
> don't install it and then use the result in virtualenv, maybe?
>
I am hoping for directions to a config file to download and place 
somewhere...

/Martin
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread eryk sun
On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman  wrote:
> Marko Rauhamaa  writes:
>
>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
>
> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
> Unicode. On a wide build or Python 3.3+ then all is rosy. (At this point
> I'm tempted to put in a winky emoji but that might push the internal
> representation into UCS-4.)

CPython allows surrogate codes for use with the "surrogateescape" and
"surrogatepass" error handlers, which are used for POSIX and Windows
file-system encoding, respectively. Maybe MicroPython goes about the
file-system round-trip problem differently, or maybe it just require
using bytes for file-system and environment-variable names on POSIX
and doesn't care about Windows.

"surrogateescape" allows 'decoding' arbitrary bytes:

>>> b'\x81'.decode('ascii', 'surrogateescape')
'\udc81'
>>> '\udc81'.encode('ascii', 'surrogateescape')
b'\x81'

This error handler is required by CPython on POSIX to handle arbitrary
bytes in file-system paths. For example, when running with LANG=C:

>>> sys.getfilesystemencoding()
'ascii'
>>> os.listdir(b'.')
[b'\x81']
>>> os.listdir('.')
['\udc81']

"surrogatepass" allows encoding surrogates:

>>> '\udc81'.encode('utf-8', 'surrogatepass')
b'\xed\xb2\x81'
>>> b'\xed\xb2\x81'.decode('utf-8', 'surrogatepass')
'\udc81'

This error handler is used by CPython 3.6+ to encode Windows UCS-2
file-system paths as WTF-8 (Wobbly). For example:

>>> os.listdir('.')
['\udc81']
>>> os.listdir(b'.')
[b'\xed\xb2\x81']
-- 
https://mail.python.org/mailman/listinfo/python-list


How to create a socket.socket() object from a socket fd?

2017-01-21 Thread Grant Edwards
Given a Unix file discriptor for an open TCP socket, I can't figure
out how to create a python 2.7 socket object like those returned by
socket.socket() 

Based on the docs, one might think that socket.fromfd() would do that
(since the docs say that's what it does):

Quoting https://docs.python.org/2/library/socket.html

 the socket() function returns a socket object

 [...]  
 
 socket.socket([family[, type[, proto]]])
Create a new socket using the given address family[...]
 
 [...]

 socket.fromfd(fd, family, type[, proto])
Duplicate the file descriptor fd (an integer as returned by a file
object’s fileno() method) and build a socket object from the
result. 

But it doesn't work as described -- the docs apprently use the term
"socket object" to refer to two different things.

IOW, a "socket object" is a different type than a "socket object".

Here's what a "socket object" returned by socket.socket() looks like:

  repr(osock):
  
  
  dir(osock):
  ['__class__', '__delattr__', '__doc__', '__format__',
  '__getattribute__', '__hash__', '__init__', '__module__', '__new__',
  '__reduce__', '__reduce_ex__', '__repr__', '__setattr__',
  '__sizeof__', '__slots__', '__str__', '__subclasshook__',
  '__weakref__', '_sock', 'accept', 'bind', 'close', 'connect',
  'connect_ex', 'dup', 'family', 'fileno', 'getpeername', 'getsockname',
  'getsockopt', 'gettimeout', 'listen', 'makefile', 'proto', 'recv',
  'recv_into', 'recvfrom', 'recvfrom_into', 'send', 'sendall', 'sendto',
  'setblocking', 'setsockopt', 'settimeout', 'shutdown', 'type']

[If asked, I would call that a "socket._socketobject object".]

Here's what a "socket object" returned from socket.fromfd() object looks like:

  repr(sock):
  
  
  dir(sock):
  ['__class__', '__delattr__', '__doc__', '__format__',
  '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__',
  '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
  '__subclasshook__', 'accept', 'bind', 'close', 'connect',
  'connect_ex', 'dup', 'family', 'fileno', 'getpeername', 'getsockname',
  'getsockopt', 'gettimeout', 'listen', 'makefile', 'proto', 'recv',
  'recv_into', 'recvfrom', 'recvfrom_into', 'send', 'sendall', 'sendto',
  'setblocking', 'setsockopt', 'settimeout', 'shutdown', 'timeout',
  'type']

They're different types.

[That one I sould all a "socket object".]  

Note that the socket.socket() object has a '_sock' attribute, which
appears to contain an object like that returned by socket.fromfd():

  repr(osock._sock):
  

That prompts a question: given a "socket object" as returned by
socket.fromfd(), how does one create a "socket._socketobject object"
as returned by socket.socket()? 

It also makes one wonder why doesn't socket.fromfd() return the same
type of object as socke.socket()?  In what context would you want the
type of object returned from socket.fromrd() instead of that returned
by socket.socket()?

[An ssl context will wrap the latter, but not the former, in case
you're wondering why I care about the difference.]

Also, I passed socket.fromfd fd==5, and the resulting "socket object"
is using fd==6. Why does socket.fromfd() duplicate the fd?  If I
wanted the file descriptor duplicated, I would have do it myself!
[Yes, I know the documentation _says_ it will duplicate the fd, I just
don't understand why, and I think it a bad idea.]

-- 
Grant

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to create a socket.socket() object from a socket fd?

2017-01-21 Thread Chris Angelico
On Sun, Jan 22, 2017 at 9:28 AM, Grant Edwards
 wrote:
> Given a Unix file discriptor for an open TCP socket, I can't figure
> out how to create a python 2.7 socket object like those returned by
> socket.socket()

I suspect you can't easily do it. In more recent Pythons, you can
socket.socket(fileno=N), but that doesn't exist in 2.7. But maybe.

> Here's what a "socket object" returned by socket.socket() looks like:
>
>   repr(osock):
>   
>
> Here's what a "socket object" returned from socket.fromfd() object looks like:
>
>   repr(sock):
>   
>
> Note that the socket.socket() object has a '_sock' attribute, which
> appears to contain an object like that returned by socket.fromfd():
>
>   repr(osock._sock):
>   

... maybe you could construct a socket.socket(), then monkeypatch its _sock??

> [An ssl context will wrap the latter, but not the former, in case
> you're wondering why I care about the difference.]
>
> Also, I passed socket.fromfd fd==5, and the resulting "socket object"
> is using fd==6. Why does socket.fromfd() duplicate the fd?  If I
> wanted the file descriptor duplicated, I would have do it myself!
> [Yes, I know the documentation _says_ it will duplicate the fd, I just
> don't understand why, and I think it a bad idea.]

Agreed. It's much cleaner in Py3, but I don't know of a backport.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to create a socket.socket() object from a socket fd?

2017-01-21 Thread Grant Edwards
On 2017-01-21, Grant Edwards  wrote:

> Given a Unix file discriptor for an open TCP socket, I can't figure
> out how to create a python 2.7 socket object like those returned by
> socket.socket()
>
> Based on the docs, one might think that socket.fromfd() would do that
> (since the docs say that's what it does):
[...]
> That prompts a question: given a "socket object" as returned by
> socket.fromfd(), how does one create a "socket._socketobject object"
> as returned by socket.socket()?

Of course I figured it out immediately after spending 15 minutes
whinging about it.

help(socket.socket) gives you a hint:

class _socketobject(__builtin__.object)
 |  socket([family[, type[, proto]]]) -> socket object
 |  
[...] 
 |  
 |  Methods defined here:
 |  
 |  __init__(self, family=2, type=1, proto=0, _sock=None)
 |  

Ah! There's a keyword argument that doesn't appear in the docs, so
let's try that...

  context = ssl.create_default_context()
  [...]  
  sock = socket.fromfd(fd2,socket.AF_INET,socket.SOCK_STREAM)
  nsock = socket.socket(socket.AF_INET,socket.SOCK_STREAM,_sock=sock)
  conn = context.wrap_socket(nsock, server_hostname="whatever.invalid")

That works.

-- 
Grant






-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to create a socket.socket() object from a socket fd?

2017-01-21 Thread Chris Angelico
On Sun, Jan 22, 2017 at 9:41 AM, Grant Edwards
 wrote:
>  |  __init__(self, family=2, type=1, proto=0, _sock=None)
>  |
>
> Ah! There's a keyword argument that doesn't appear in the docs, so
> let's try that...

That's marginally better than my monkeypatch-after-creation
suggestion, but still broadly the same. Your code may well break in
other Python implementations, but within CPython 2.7, you should be
pretty safe.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to create a socket.socket() object from a socket fd?

2017-01-21 Thread Christian Heimes
On 2017-01-21 23:41, Grant Edwards wrote:
> On 2017-01-21, Grant Edwards  wrote:
> 
>> Given a Unix file discriptor for an open TCP socket, I can't figure
>> out how to create a python 2.7 socket object like those returned by
>> socket.socket()
>>
>> Based on the docs, one might think that socket.fromfd() would do that
>> (since the docs say that's what it does):
> [...]
>> That prompts a question: given a "socket object" as returned by
>> socket.fromfd(), how does one create a "socket._socketobject object"
>> as returned by socket.socket()?
> 
> Of course I figured it out immediately after spending 15 minutes
> whinging about it.
> 
> help(socket.socket) gives you a hint:
> 
> class _socketobject(__builtin__.object)
>  |  socket([family[, type[, proto]]]) -> socket object
>  |  
> [...] 
>  |  
>  |  Methods defined here:
>  |  
>  |  __init__(self, family=2, type=1, proto=0, _sock=None)
>  |  
> 
> Ah! There's a keyword argument that doesn't appear in the docs, so
> let's try that...
> 
>   context = ssl.create_default_context()
>   [...]  
>   sock = socket.fromfd(fd2,socket.AF_INET,socket.SOCK_STREAM)
>   nsock = socket.socket(socket.AF_INET,socket.SOCK_STREAM,_sock=sock)
>   conn = context.wrap_socket(nsock, server_hostname="whatever.invalid")
> 
> That works.

You might be interested in my small module
https://pypi.python.org/pypi/socketfromfd/ . I just releases a new
version with a fix for Python 2. Thanks for the hint! :)

The module correctly detects address family, socket type and proto from
a fd. It works correctly with e.g. IPv6 or Unix sockets. Ticket
https://bugs.python.org/issue28134 has additional background information
on the matter.

Christian

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to create a socket.socket() object from a socket fd?

2017-01-21 Thread Peter Otten
Grant Edwards wrote:

> Given a Unix file discriptor for an open TCP socket, I can't figure
> out how to create a python 2.7 socket object like those returned by
> socket.socket()
> 
> Based on the docs, one might think that socket.fromfd() would do that
> (since the docs say that's what it does):

[...]

> Note that the socket.socket() object has a '_sock' attribute, which
> appears to contain an object like that returned by socket.fromfd():
> 
>   repr(osock._sock):
>   
> 
> That prompts a question: given a "socket object" as returned by
> socket.fromfd(), how does one create a "socket._socketobject object"
> as returned by socket.socket()?

The socket.py source reveals that you can do it like this:

mysocket = socket.socket(_sock=socket.fromfd(...))

The code has

from _socket import socket  # actually a star import
...
_realsocket = socket
...
socket = _socketobject

which looks very much like an ad-hoc hack gone permanent.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to create a socket.socket() object from a socket fd?

2017-01-21 Thread Grant Edwards
On 2017-01-21, Chris Angelico  wrote:
> On Sun, Jan 22, 2017 at 9:41 AM, Grant Edwards
> wrote:
>>  |  __init__(self, family=2, type=1, proto=0, _sock=None)
>>  |
>>
>> Ah! There's a keyword argument that doesn't appear in the docs, so
>> let's try that...
>
> That's marginally better than my monkeypatch-after-creation
> suggestion, but still broadly the same. Your code may well break in
> other Python implementations, but within CPython 2.7, you should be
> pretty safe.

For those of you still paying attention...

It looks like CPython 3.4 socket.fromfd() does return the same type
object as socket.socket().  And, using the fileno= argument to
socket.socket() avoids duplicating the descriptor.

Looks like somebody with a time machine read my original post...

-- 
Grant


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to create a socket.socket() object from a socket fd?

2017-01-21 Thread Grant Edwards
On 2017-01-21, Christian Heimes  wrote:

> You might be interested in my small module
> https://pypi.python.org/pypi/socketfromfd/ . I just releases a new
> version with a fix for Python 2. Thanks for the hint! :)
>
> The module correctly detects address family, socket type and proto from
> a fd. It works correctly with e.g. IPv6 or Unix sockets. Ticket
> https://bugs.python.org/issue28134 has additional background information
> on the matter.

Yes, thanks!

Just a few minutes ago I stumbled across that issue.  For Python3, I
was using:

  sock = socket.socket(fileno=fd)

But as you point out in that issue, the Python3 docs are wrong: when
using socket.socket(fileno=fd) you _do_ have to specify the correct
family and type parameters that correspond to the socket file
descriptor. So, I starting looking for os.getsockopt(), which doesn't
exist.

I see you use ctypes to call gestsockopt (that was going to be my next
step).

I suspect the code I'm working will end up being re-written in C for
the real product (so that it can run in-process in a thread rather
than as an external helper process).  If not, I'll have to use your
module (or something like it) so that the solution will work on both
IPv4 and IPv6 TCP sockets (I'd also like it to work with Unix domain
sockets, but the piece at the other end of the socket connection
currently only supports TCP).

-- 
Grant




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to create a socket.socket() object from a socket fd?

2017-01-21 Thread Grant Edwards
Newsgroups: gmane.comp.python.general
From: Grant Edwards 
Subject: Re: How to create a socket.socket() object from a socket fd?
References:  
 

Followup-To: 



I'm still baffled why the standard library fromfd() code dup()s the
descriptor.

According to the comment in the CPython sources, the author of
fromfd() is guessing that the user wants to be able to close the
descriptor separately from the socket.

If the user wanted the socket object to use a duplicate descriptor for
some reason, the caller should call os.dup() -- it's only _eight_
keystrokes.  Eight keystrokes that makes it obvious to anybody reading
the code that there are now two descriptors and you have to close both
the original descriptor and the socket.

When you create a Python file object from a file descriptor using
os.fdopen(), does it dup the descriptor?  No.  Would a reasonable
person expect socket.fromfd() to duplicate the descriptor?  No.

Should it?

No.

I know... that particular mistake is set in stone now, and it's not
going to change.  But I feel better.  :)


 $ python
 Python 2.7.12 (default, Dec  6 2016, 23:41:51) 
 [GCC 4.9.3] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import this
 The Zen of Python, by Tim Peters
 
 Beautiful is better than ugly.
 Explicit is better than implicit.
 Simple is better than complex.
 Complex is better than complicated.
 Flat is better than nested.
 Sparse is better than dense.
 Readability counts.
 Special cases aren't special enough to break the rules.
 Although practicality beats purity.
 Errors should never pass silently.
 Unless explicitly silenced.
 In the face of ambiguity, refuse the temptation to guess.
 There should be one-- and preferably only one --obvious way to do it.
 Although that way may not be obvious at first unless you're Dutch.
 Now is better than never.
 Although never is often better than *right* now.
 If the implementation is hard to explain, it's a bad idea.
 If the implementation is easy to explain, it may be a good idea.
 Namespaces are one honking great idea -- let's do more of those!



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Matt Ruffalo
On 2017-01-21 10:50, Pete Forman wrote:
> Thanks for a very thorough reply, most useful. I'm going to pick you up
> on the above, though.
>
> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC
> 3629 (2003). There is CESU-8 if you really need a naive encoding of
> UTF-16 to UTF-8-alike.
>
> py> low = '\uDC37'
>
> is only meaningful on narrow builds pre Python 3.3 where the user must
> do extra to correctly handle characters outside the BMP.

Hi Pete-

Lone surrogate characters have a standardized use in Python, not just in
narrow builds of Python <= 3.2. Unpaired high surrogate characters are
used to store any bytes that couldn't be decoded with a given character
encoding scheme, for use in OS/filesystem interfaces that use arbitrary
byte strings:

"""
Python 3.6.0 (default, Dec 23 2016, 08:25:24)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = 'héllo'
>>> b = s.encode('latin-1')
>>> b
b'h\xe9llo'
>>> from os import fsdecode, fsencode
>>> decoded = fsdecode(b)
>>> decoded
'h\udce9llo'
>>> fsencode(decoded)
b'h\xe9llo'
"""

This provides a mechanism for lossless round-trip decoding and encoding
of arbitrary byte strings which aren't valid under the user's locale.
This is absolutely necessary in POSIX systems in which filenames can
contain any sequence of bytes despite the user's locale, and is even
necessary in Windows, where filenames are stored as opaque
not-quite-UCS2 strings:

"""
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64
bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> from pathlib import Path
>>> import os
>>> os.chdir(Path('~/Desktop').expanduser())
>>> filename = '\udcf9'
>>> with open(filename, 'w'): pass

>>> os.listdir('.')
['desktop.ini', '\udcf9']
"""

MMR...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Tim Chase
On 2017-01-22 01:44, Steve D'Aprano wrote:
> On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote:
> 
> > but I'm hard-pressed to come up with any use case where direct
> > indexing into a (non-byte)string makes sense unless you've already
> > processed/searched up to that point and can use a recorded index
> > from that processing/search.
> 
> 
> Let's take a simple example: you do a find to get an offset, and
> then slice from that offset.
> 
> py> text = "αβγдлфxx"
> py> offset = text.find("ф")

Right, so here, you've done a (likely linear, but however you get
here) search, which then makes sense to use this opaque "offset"
token for slicing purposes:

> py> stuff = text[offset:]
> py> assert stuff == "фxx"

> That works fine whether indexing refers to code points or bytes.
> 
> py> "αβγдлфxx".find("ф")
> 5
> py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8'))
> 10
> 
> Either way, you get the expected result. However:
> 
> py> stuff = text[offset + 1:]
> py> assert stuff == "xx"
>
> That requires indexes to point to the beginning of *code points*,
> not bytes: taking byte 11 of "αβγдлфxx".encode('utf-8') drops you
> into the middle of the ф representation:
> 
> py> "αβγдлфxx".encode('utf-8')[11:]
> b'\x84xx'
> 
> and it isn't a valid UTF-8 substring. Slicing would generate an
> exception unless you happened to slice right at the start of a code
> point.

Right.  It gets even weirder (edge-case'ier) when dealing with
combining characters:


>>> s = "man\N{COMBINING TILDE}ana"
>>> for i, c in enumerate(s): print("%i: %s" % (i, c))
... 
0: m
1: a
2: n
3:˜
4: a
5: n
6: a
>>> ''.join(reversed(s))
'anãnam'

Offsetting s[3:] produces a (sub)string that begins with a combining
character that doesn't have anything preceding it to combine with.

> It's like seek() and tell() on text files: you cannot seek to
> arbitrary positions, but only to the opaque positions returned by
> tell. That's unacceptable for strings.

I'm still unclear on *why* this would be considered unacceptable for
strings.  It makes sense when dealing with byte-strings, since they
contain binary data that may need to get sliced at arbitrary
offsets.  But for strings, slicing only makes sense (for every
use-case I've been able to come up with) in the context of known
offsets like you describe with tell().  The cost of not using opaque
tell()like offsets is, as you describe, slicing in the middle of
characters.

> You could avoid that error by increasing the offset by the right
> amount:
> 
> stuff = text[offset + len("ф".encode('utf-8'):]
> 
> which is awful. I believe that's what Go and Julia expect you to do.

It may be awful, but only because it hasn't been pythonified.  If the
result from calling .find() on a string returns a "StringOffset"
object, then it would make sense that its __add__/__radd__ methods
would accept an integer and to such translation for you.

> You can avoid this by having the interpreter treat the Python-level
> indexes as opaque "code point offsets", and converting them to and
> from "byte offsets" as needed. That's not even very hard. But it
> either turns every indexing into O(N) (since you have to walk the
> string to count which byte represents the nth code point)

The O(N) cost has to be paid at some point, but I'd put forth that
other operations like .find() already pay that O(N) cost and can
return an opaque "offset token" that can be subsequently used for O(1)
indexing (multiple times if needed).

-tkc












-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:

> Pete Forman :
> 
>> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
>> and UTF-32.
> 
> Also, they don't exist as Unicode code points. Python shouldn't allow
> surrogate characters in strings.

Not quite. This is where it gets a bit messy and confusing. The bottom line
is: surrogates *are* code points, but they aren't *characters*. Strings
which contain surrogates are strictly speaking illegal, although some
programming languages (including Python) allow them.

The Unicode standard defines surrogates as follows:

http://www.unicode.org/glossary/

- Surrogate Character. A misnomer. It would be an encoded character 
  having a surrogate code point, which is impossible. Do not use 
  this term.

- Surrogate Code Point. A Unicode code point in the range 
  U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of 
  surrogate code units (a high surrogate followed by a low surrogate)
  “stand in” for a supplementary code point.

- Surrogate Pair. A representation for a single abstract character
  that consists of a sequence of two 16-bit code units, where the
  first value of the pair is a high-surrogate code unit, and the
  second is a low-surrogate code unit. (See definition D75 in 
  Section 3.8, Surrogates.)


http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf#G2630


So far so good, this is clear: surrogates are not characters. Surrogate
pairs are only used by UTF-16 (since that's the only UTF which uses 16-bit
code units).

Suppose you read two code units (four bytes) in UTF-16 (big endian):

b'\xd8\x02\xdd\x00'

That could be ambiguous, as it could mean:

- a single code point, U+10900 PHOENICIAN LETTER ALF, encoded as
  the surrogate pair, U+D802 U+DD00;

- two surrogate code points, U+D802 followed by U+DD00.

UTF-16 definitely rejects the second alternative and categorically makes
only the first valid. To ensure that is the only valid interpretation, the
second is explicitly disallowed: conforming Unicode strings are not allowed
to include surrogate code points, regardless of whether you are intending
to encode them in UTF-32 (where there is no ambiguity) or UTF-16 (where
there is). Only UTF-16 encoded bytes are allowed to include surrogate code
*units*, and only in pairs.

However, Python (and other languages?) take a less strict approach. In
Python 3.3 and better, the code point U+10900 (Phoenician Alf) is encoded
using a single four-byte code unit: 0x00010900', so there's no ambiguity.
That allows Python to encode surrogates as double-byte code units with no
ambiguity: the interpreter can distinguish a single 32 byte number
0x00010900 from a pair of 16 bit numbers 0x0001 0x0900 and treat the first
as Alf and the second as two surrogates.

By the letter of the Unicode standard, it should not do this, but
nevertheless it does and it appears to do no real harm and have some
benefit.


The glossary also says:


- High-Surrogate Code Point. A Unicode code point in the range 
  U+D800 to U+DBFF. (See definition D71 in Section 3.8, Surrogates.)

- High-Surrogate Code Unit. A 16-bit code unit in the range D800_16 
  to DBFF_16, used in UTF-16 as the leading code unit of a surrogate
  pair. Also known as a leading surrogate. (See definition D72 in 
  Section 3.8, Surrogates.)

- Low-Surrogate Code Point. A Unicode code point in the range 
  U+DC00 to U+DFFF. (See definition D73 in Section 3.8, Surrogates.)

- Low-Surrogate Code Unit. A 16-bit code unit in the range DC00_16
  to DFFF_16, used in UTF-16 as the trailing code unit of a surrogate
  pair. Also known as a trailing surrogate. (See definition D74 in
  Section 3.8, Surrogates.)


So we can certainly talk about surrogates being code points: the code points
U+D800 through U+DFFF inclusive are surrogate code points, but not
characters.

They're not "non-characters" either. Unicode includes exactly 66 code points
formally defined as "non-characters":

- Noncharacter. A code point that is permanently reserved for internal
  use. Noncharacters consist of the values U+nFFFE and U+n (where
  n is from 0 to 1016), and the values U+FDD0..U+FDEF. See the FAQ on
  Private-Use Characters, Noncharacters and Sentinels.

http://www.unicode.org/faq/private_use.html#noncharacters

So even though noncharacters (with or without the hyphen) are code points
reserved for internal use, and surrogates are code points reserved for the
internal use of the UTF-16 encoder, and surrogates are not characters,
surrogates are not noncharacters.

Naming things is hard.


>Thus the range of code points that are available for use as
>characters is U+–U+D7FF and U+E000–U+10 (1,112,064 code
>points).
> 
>https://en.wikipedia.org/wiki/Unicode>

That is correct for a strictly conforming Unicode implementation.


>> py> low = '\uDC37'
> 
> That should raise a SyntaxError exception.

If Python was strictly conforming, that is correct, but it turns out there
are some useful things yo

Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steve D'Aprano
On Sun, 22 Jan 2017 07:21 am, Pete Forman wrote:

> Marko Rauhamaa  writes:
> 
>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
> 
> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
> Unicode.

But you're *not* using UTF-16, at least not proper UTF-16, in older narrow
builds. If you were, then Unicode strings u'...' containing surrogate pairs
would be treated as supplementary single code points, but they aren't.

unichr() doesn't support supplementary code points in narrow builds:

[steve@ando ~]$ python2.7 -c "print len(unichr(0x10900))"
Traceback (most recent call last):
  File "", line 1, in 
ValueError: unichr() arg not in range(0x1) (narrow Python build)


and even if you sneak a supplementary code point in, it is treated wrongly:

[steve@ando ~]$ python2.7 -c "print len(u'\U00010900')"
2


So Python narrow builds are more like a bastard hybrid of UCS-2 and UTF-16.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: PEP 393 vs UTF-8 Everywhere

2017-01-21 Thread Steven D'Aprano
On Sunday 22 January 2017 06:58, Tim Chase wrote:

> Right.  It gets even weirder (edge-case'ier) when dealing with
> combining characters:
> 
> 
 s = "man\N{COMBINING TILDE}ana"
 for i, c in enumerate(s): print("%i: %s" % (i, c))
> ...
> 0: m
> 1: a
> 2: n
> 3:˜
> 4: a
> 5: n
> 6: a
 ''.join(reversed(s))
> 'anãnam'
> 
> Offsetting s[3:] produces a (sub)string that begins with a combining
> character that doesn't have anything preceding it to combine with.

That doesn't matter. Unicode is a universal character set, not a universal 
*grapheme* set. But even speaking about characters is misleading: Unicode's 
"characters" (note the scare quotes) are abstract code points which can 
represent at least:

- letters of alphabets
- digits
- punctuation marks
- ideographs
- line drawing symbols
- emoji
- noncharacters

Since it doesn't promise to only provide graphemes (I can write "$\N{COMBINING 
TILDE}" which is not a valid grapheme in any human language) it doesn't matter 
if you end up with lone combining characters. Or rather, it does matter, but 
fixing that is not Unicode's responsibility. That should become a layer built 
on top of Unicode.


>> It's like seek() and tell() on text files: you cannot seek to
>> arbitrary positions, but only to the opaque positions returned by
>> tell. That's unacceptable for strings.
> 
> I'm still unclear on *why* this would be considered unacceptable for
> strings.

Sometimes you want to slice at a particular index which is *not* an opaque 
position returned by find().

text[offset + 1:]

Of for that matter:

middle_character = text[len(text)//2]


Forbidding those sorts of operations are simply too big a break with previous 
versions.


> It makes sense when dealing with byte-strings, since they
> contain binary data that may need to get sliced at arbitrary
> offsets.  But for strings, slicing only makes sense (for every
> use-case I've been able to come up with) in the context of known
> offsets like you describe with tell().

I'm sorry, I find it hard to believe that you've never needed to add or 
subtract 1 from a given offset returned by find() or equivalent.


> The cost of not using opaque
> tell()like offsets is, as you describe, slicing in the middle of
> characters.

>> You could avoid that error by increasing the offset by the right
>> amount:
>> 
>> stuff = text[offset + len("ф".encode('utf-8'):]
>> 
>> which is awful. I believe that's what Go and Julia expect you to do.
> 
> It may be awful, but only because it hasn't been pythonified.

No, it's awful no matter what. It makes it painful to reason about which code 
points will be picked up by a slice. What's the length of...?

text[offset:offset+5]

In current Python, that's got to be five code points (excluding the edge cases 
of slicing past the end of the string). But with opaque indexes, that could be 
anything from 1 to 5 code points.


> If the
> result from calling .find() on a string returns a "StringOffset"
> object, then it would make sense that its __add__/__radd__ methods
> would accept an integer and to such translation for you.

At cost of predictability.


>> You can avoid this by having the interpreter treat the Python-level
>> indexes as opaque "code point offsets", and converting them to and
>> from "byte offsets" as needed. That's not even very hard. But it
>> either turns every indexing into O(N) (since you have to walk the
>> string to count which byte represents the nth code point)
> 
> The O(N) cost has to be paid at some point, but I'd put forth that
> other operations like .find() already pay that O(N) cost and can
> return an opaque "offset token" that can be subsequently used for O(1)
> indexing (multiple times if needed).

Sure -- but only at the cost of blowing out the complexity and memory 
requirements of the string, which completely negates the point in using UTF-8 
in the first place.



-- 
Steven
"Ever since I learned about confirmation bias, I've been seeing 
it everywhere." - Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list