Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>
>>If all you're interested in is the lexical class of the code points
>>in a string, you could use such a codec to map each code point
>>to a code point representing the lexical class.
>
>
> How can I efficiently implement such a codec? The whole p
On May 10, 2005, at 7:34 PM, James Y Knight wrote:
> If you're going to call python's implementation UTF-16, I'd consider
> all these very serious deficiencies:
The --enable-unicode option declares a character encoding form (CEF),
not a character encoding scheme (CES). It is unfortunate that U
On May 10, 2005, at 2:48 PM, Nicholas Bastin wrote:
> On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote:
>
>
>>> Wow, what an inane way of looking at it. I don't know what world
>>> you
>>> live in, but in my world, users read the configure options and
>>> suppose
>>> that they mean somethin
Nicholas Bastin wrote:
> I'm perfectly happy to continue supporting --enable-unicode=ucs2, but
> not displaying it as an option. Is that acceptable to you?
It is. Somewhere, the code should say that this is for backwards
compatibility, of course (so people won't remove it too easily;
if there is
M.-A. Lemburg wrote:
> If all you're interested in is the lexical class of the code points
> in a string, you could use such a codec to map each code point
> to a code point representing the lexical class.
How can I efficiently implement such a codec? The whole point is doing
that in pure Python (
On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote:
>> Wow, what an inane way of looking at it. I don't know what world you
>> live in, but in my world, users read the configure options and suppose
>> that they mean something. In fact, they *have* to go off on their own
>> to assume something,
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>
>>On sre character classes: I don't think that these provide
>>a good approach to XML lexical classes - custom functions
>>or methods or maybe even a codec mapping the characters
>>to their XML lexical class are much more efficient in
>>practice.
>
M.-A. Lemburg wrote:
> On sre character classes: I don't think that these provide
> a good approach to XML lexical classes - custom functions
> or methods or maybe even a codec mapping the characters
> to their XML lexical class are much more efficient in
> practice.
That isn't my experience: func
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>
>>Unicode has many code points that are meant only for composition
>>and don't have any standalone meaning, e.g. a combining acute
>>accent (U+0301), yet they are perfectly valid code points -
>>regardless of UCS-2 or UCS-4. It is easily possible to
Nicholas Bastin wrote:
>> Again, patches are welcome. I was opposed to Nick's proposed changes,
>> since they explicitly said that you are not supposed to know what
>> is in a Py_UNICODE. Integrating the essence of PEP 261 into the
>> main documentation would be a worthwhile task.
>
>
> You can't
Nicholas Bastin wrote:
> It's not always 2 bytes on Windows. Users can alter the config options
> (and not unreasonably so, btw, on 64-bit windows platforms).
Did you try that? I'm not sure it even builds when you do so, but if it
does, you will lose the "mbcs" codec, and the ability to use Unico
Nicholas Bastin wrote:
>> Changing the documentation that goes along with the option
>> would be fine.
>
>
> That is exactly what I proposed originally, which you shot down. Please
> actually read the contents of my messages. What I said was "change the
> configure option and related documentat
On May 8, 2005, at 1:44 PM, Martin v. Löwis wrote:
> Shane Hathaway wrote:
>> Fair enough. The original point is that the documentation is unclear
>> about what a Py_UNICODE[] contains. I deduced that it contains either
>> UCS2 or UCS4 and implemented accordingly. Not only did I guess wrong,
>
On May 8, 2005, at 5:28 AM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> All of my proposals for what to change the documention to have been
>> shot down by Martin. If someone has better verbiage that they'd like
>> to see, I'd be perfectly happy to patch the doc.
>
> I don't look into the
On May 8, 2005, at 5:15 AM, Martin v. Löwis wrote:
> 'configure takes an option --enable-unicode, with the possible
> values "ucs2", "ucs4", "yes" (equivalent to no argument),
> and "no" (equivalent to --disable-unicode)'
>
> *THIS* documentation would break. This documentation is factually
> co
Shane Hathaway wrote:
> Fair enough. The original point is that the documentation is unclear
> about what a Py_UNICODE[] contains. I deduced that it contains either
> UCS2 or UCS4 and implemented accordingly. Not only did I guess wrong,
> but others will probably guess wrong too. Something in t
M.-A. Lemburg wrote:
> All this talk about UTF-16 vs. UCS-2 is not very useful
> and strikes me a purely academic.
>
> The reference to possibly breakage by slicing a Unicode and
> breaking a surrogate pair is valid, the idea of UCS-4 being
> less prone to breakage is a myth:
Fair enough. The or
Nicholas Bastin wrote:
> All of my proposals for what to change the documention to have been
> shot down by Martin. If someone has better verbiage that they'd like
> to see, I'd be perfectly happy to patch the doc.
I don't look into the specific wording - you speak English much better
than I do
Nicholas Bastin wrote:
>> -1. This breaks existing documentation and usage, and provides only
>> minimum value.
>
>
> Have you been missing this conversation? UTF-16 is *WHAT PYTHON
> CURRENTLY IMPLEMENTS*. The current documentation is flat out wrong.
> Breaking that isn't a big problem in my
Nicholas Bastin wrote:
> I don't consider either alternative useless (well, I consider UCS-2 to
> be largely useless in the general case, but as we've already discussed
> here, Python isn't really UCS-2). However, I would be a lot happier if
> we just chose *one*, and all Python's used that one.
M.-A. Lemburg wrote:
> I believe that it would be more appropriate to adjust the _tkinter
> module to adapt to the TCL Unicode size rather than
> forcing the complete Python system to adapt to TCL - I don't
> really see the point in an optional extension module
> defining the default for the interp
M.-A. Lemburg wrote:
> Unicode has many code points that are meant only for composition
> and don't have any standalone meaning, e.g. a combining acute
> accent (U+0301), yet they are perfectly valid code points -
> regardless of UCS-2 or UCS-4. It is easily possible to break
> such a combining seq
Nicholas Bastin wrote:
> On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote:
>
>
>>However, I don't understand all the excitement
>>about Py_UNICODE: if you don't like the way this Python
>>typedef works, you are free to interface to Python using
>>any of the supported encodings using PyUnicode_Enco
On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote:
> Please upload your doc-patch to SF.
All of my proposals for what to change the documention to have been
shot down by Martin. If someone has better verbiage that they'd like
to see, I'd be perfectly happy to patch the doc.
My last suggestion
On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote:
> However, I don't understand all the excitement
> about Py_UNICODE: if you don't like the way this Python
> typedef works, you are free to interface to Python using
> any of the supported encodings using PyUnicode_Encode()
> and PyUnicode_Decode()
Nicholas Bastin wrote:
> On May 7, 2005, at 9:29 AM, Martin v. Löwis wrote:
>>With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start
>>supporting the full Unicode ccs the same way it supports UCS-2.
>>Individual surrogate values remain accessible, and supporting
>>non-BMP characters is le
On May 7, 2005, at 9:29 AM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> --enable-unicode=ucs2
>>
>> be replaced with:
>>
>> --enable-unicode=utf16
>>
>> and the docs be updated to reflect more accurately the variance of the
>> internal storage type.
>
> -1. This breaks existing documentati
On May 7, 2005, at 9:24 AM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> Yes, but the important question here is why would we want that? Why
>> doesn't Python just have *one* internal representation of a Unicode
>> character? Having more than one possible definition just creates
>> proble
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>
>>Hmm, looking at the configure.in script, it seems you're right.
>>I wonder why this weird dependency on TCL was added.
>
>
> If Python is configured for UCS-2, and Tcl for UCS-4, then
> Tkinter would not work out of the box. Hence the weird depen
Shane Hathaway wrote:
> Martin v. Löwis wrote:
>
>>Shane Hathaway wrote:
>>
>>
>>>I agree that UCS4 is needed. There is a balancing act here; UTF-16 is
>>>widely used and takes less space, while UCS4 is easier to treat as an
>>>array of characters. Maybe we can have both: unicode objects start w
Shane Hathaway wrote:
> Py_UNICODE would always be 32 bits wide.
This would break PythonWin, which relies on Py_UNICODE being
the same as WCHAR_T. PythonWin is not broken, it just hasn't
been ported to UCS-4, yet (and porting this is difficult and
will cause a performance loss).
Regards,
Martin
Martin v. Löwis wrote:
> Shane Hathaway wrote:
>
>>I agree that UCS4 is needed. There is a balancing act here; UTF-16 is
>>widely used and takes less space, while UCS4 is easier to treat as an
>>array of characters. Maybe we can have both: unicode objects start with
>>an internal representation
> Yes, but the first few steps are the same for nearly everyone, and
> people need more help taking the first few steps.
Contributions to the documentation are certainly welcome.
Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://m
Shane Hathaway wrote:
> I agree that UCS4 is needed. There is a balancing act here; UTF-16 is
> widely used and takes less space, while UCS4 is easier to treat as an
> array of characters. Maybe we can have both: unicode objects start with
> an internal representation in UTF-16, but get promoted
Nicholas Bastin wrote:
> --enable-unicode=ucs2
>
> be replaced with:
>
> --enable-unicode=utf16
>
> and the docs be updated to reflect more accurately the variance of the
> internal storage type.
-1. This breaks existing documentation and usage, and provides only
minimum value.
With --enable-u
Nicholas Bastin wrote:
> Yes, but the important question here is why would we want that? Why
> doesn't Python just have *one* internal representation of a Unicode
> character? Having more than one possible definition just creates
> problems, and provides no value.
It does provide value, there ar
Martin v. Löwis wrote:
> Shane Hathaway wrote:
>>More generally, how should a non-unicode-expert writing Python extension
>>code find out the minimum they need to know about unicode to use the
>>Python unicode API? The API reference [1] ought to at least have a list
>>of background links. I had t
Martin v. Löwis wrote:
> Define correctly. Python, in ucs2 mode, will allow to address individual
> surrogate codes, e.g. in indexing. So you get
>
>
u"\U00012345"[0]
When Python encodes characters internally in UCS-2, I would expect
u"\U00012345" to produce a UnicodeError("character can not
On May 6, 2005, at 8:11 PM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> Well, this is a completely separate issue/problem. The internal
>> representation is UTF-16, and should be stated as such. If the
>> built-in methods actually don't work with surrogate pairs, then that
>> should be fi
On May 6, 2005, at 8:25 PM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> Yes. Not only in my mind, but in the Python source code. If
>> Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4),
>> otherwise the encoding is UTF-16 (*not* UCS-2).
>
> I see. Some people equate "encodi
Nicholas Bastin wrote:
> Well, this is a completely separate issue/problem. The internal
> representation is UTF-16, and should be stated as such. If the
> built-in methods actually don't work with surrogate pairs, then that
> should be fixed.
Yes to the former, no to the latter. PEP 261 speci
Nicholas Bastin wrote:
> Yes. Not only in my mind, but in the Python source code. If
> Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4),
> otherwise the encoding is UTF-16 (*not* UCS-2).
I see. Some people equate "encoding" with "encoding scheme";
neither UTF-32 nor UTF-16 is an
Nicholas Bastin wrote:
> What I mean is pretty clear. UCS-2 does *NOT* support surrogate pairs.
> If it did, it would be called UTF-16. If Python really supported
> UCS-2, then surrogate pairs from UTF-16 inputs would either get turned
> into two garbage characters, or the "I couldn't transc
Shane Hathaway wrote:
> Ok. Thanks for helping me understand where Python is WRT unicode. I
> can work around the issues (or maybe try to help solve them) now that I
> know the current state of affairs. If Python correctly handled UTF-16
> strings internally, we wouldn't need the UCS-4 configura
On May 6, 2005, at 7:45 PM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> Because the encoding of that buffer appears to be different depending
>> on
>> the configure options.
>
> What makes it appear so? sizeof(Py_UNICODE) changes when you change
> the option - does that, in your mind, mea
On May 6, 2005, at 7:43 PM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> If this is the case, then we're clearly misleading users. If the
>> configure script says UCS-2, then as a user I would assume that
>> surrogate pairs would *not* be encoded, because I chose UCS-2, and it
>> doesn't s
M.-A. Lemburg wrote:
> Hmm, looking at the configure.in script, it seems you're right.
> I wonder why this weird dependency on TCL was added.
If Python is configured for UCS-2, and Tcl for UCS-4, then
Tkinter would not work out of the box. Hence the weird dependency.
Regards,
Martin
_
Nicholas Bastin wrote:
> No, that's not true. Python lets you choose UCS-4 or UCS-2. What the
> default is depends on your platform.
The truth is more complicated. If your Tcl is built for UCS-4, then
Python will also be built for UCS-4 (unless overridden by command line).
Otherwise, Python will
Nicholas Bastin wrote:
> Because the encoding of that buffer appears to be different depending on
> the configure options.
What makes it appear so? sizeof(Py_UNICODE) changes when you change
the option - does that, in your mind, mean that the encoding changes?
> If that isn't true, then someone n
Nicholas Bastin wrote:
> If this is the case, then we're clearly misleading users. If the
> configure script says UCS-2, then as a user I would assume that
> surrogate pairs would *not* be encoded, because I chose UCS-2, and it
> doesn't support that.
What do you mean by that? That the interprete
Shane Hathaway wrote:
> Then something in the Python docs ought to say why UCS-2 is not what you
> want. I still don't know; I've heard differing opinions on the subject.
> Some say you'll never need more than what UCS-2 provides. Is that
> incorrect?
That clearly depends on who "you" is.
> Mo
On May 6, 2005, at 7:05 PM, Shane Hathaway wrote:
> Nicholas Bastin wrote:
>
>> On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
>>
>>> Wait... are you saying a Py_UNICODE array contains either UTF-16 or
>>> UTF-32 characters, but never UCS-2? That's a big surprise to
>>> me. I may
>>> need t
Nicholas Bastin wrote:
> I'm not sure the Python documentation is the place to teach someone
> about unicode. The ISO 10646 pretty clearly defines UCS-2 as only
> containing characters in the BMP (plane zero). On the other hand, I
> don't know why python lets you choose UCS-2 anyhow, since it's a
Nicholas Bastin wrote:
> The important piece of information is that it is not guaranteed to be a
> particular one of those sizes. Once you can't guarantee the size, no
> one really cares what size it is.
Please trust many years of experience: This is just not true. People
do care, and they want t
Nicholas Bastin wrote:
>
> On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
>> Wait... are you saying a Py_UNICODE array contains either UTF-16 or
>> UTF-32 characters, but never UCS-2? That's a big surprise to me. I may
>> need to change my PyXPCOM patch to fit this new understanding. I tried
On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
> Nicholas Bastin wrote:
>> On May 6, 2005, at 3:42 PM, James Y Knight wrote:
>>> It means all the string operations treat strings as if they were
>>> UCS-2, but that in actuality, they are UTF-16. Same as the case in
>>> the
>>> windows APIs and
Nicholas Bastin wrote:
> On May 6, 2005, at 3:42 PM, James Y Knight wrote:
>>It means all the string operations treat strings as if they were
>>UCS-2, but that in actuality, they are UTF-16. Same as the case in the
>>windows APIs and Java. That is, all string operations are essentially
>>broken,
On May 6, 2005, at 3:42 PM, James Y Knight wrote:
> On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote:
>> If this is the case, then we're clearly misleading users. If the
>> configure script says UCS-2, then as a user I would assume that
>> surrogate pairs would *not* be encoded, because I chose
After reading through the code and the comments in this thread, I
propose the following in the documentation as the definition of
Py_UNICODE:
"This type represents the storage type which is used by Python
internally as the basis for holding Unicode ordinals. Extension module
developers should
Nicholas Bastin wrote:
> On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote:
>
>
>>You've got that wrong: Python let's you choose UCS-4 -
>>UCS-2 is the default.
>
>
> No, that's not true. Python lets you choose UCS-4 or UCS-2. What the
> default is depends on your platform. If you run raw con
On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote:
> If this is the case, then we're clearly misleading users. If the
> configure script says UCS-2, then as a user I would assume that
> surrogate pairs would *not* be encoded, because I chose UCS-2, and it
> doesn't support that. I would assume th
On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote:
> You've got that wrong: Python let's you choose UCS-4 -
> UCS-2 is the default.
No, that's not true. Python lets you choose UCS-4 or UCS-2. What the
default is depends on your platform. If you run raw configure, some
systems will choose UCS-
On May 6, 2005, at 3:25 AM, M.-A. Lemburg wrote:
> I don't see why you shouldn't use Py_UNICODE buffer directly.
> After all, the reason why we have that typedef is to make it
> possible to program against an abstract type - regardless of
> its size on the given platform.
Because the encoding of
On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote:
> You've got that wrong: Python let's you choose UCS-4 -
> UCS-2 is the default.
>
> Note that Python's Unicode codecs UTF-8 and UTF-16
> are surrogate aware and thus support non-BMP code points
> regardless of the build type: A UCS2-build of Pytho
Nicholas Bastin wrote:
> On May 4, 2005, at 6:03 PM, Martin v. Löwis wrote:
>
>
>>Nicholas Bastin wrote:
>>
>>>"This type represents the storage type which is used by Python
>>>internally as the basis for holding Unicode ordinals. Extension
>>>module
>>>developers should make no assumptions abo
Fredrik Lundh wrote:
> Thomas Heller wrote:
>
>
>>AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars,
>>independend from the size of wchar_t. The HAVE_USABLE_WCHAR_T macro
>>can be used by extension writers to determine if Py_UNICODE is the same as
>>wchar_t.
>
>
> note th
Nicholas Bastin wrote:
> On May 4, 2005, at 6:20 PM, Shane Hathaway wrote:
>
>>>Nicholas Bastin wrote:
>>>
>>>
"This type represents the storage type which is used by Python
internally as the basis for holding Unicode ordinals. Extension
module
developers should make no assumptio
Nicholas Bastin wrote:
>
> On May 4, 2005, at 6:20 PM, Shane Hathaway wrote:
>> On a related note, it would be help if the documentation provided a
>> little more background on unicode encoding. Specifically, that UCS-2 is
>> not the same as UTF-16, even though they're both two bytes wide and mos
On May 4, 2005, at 6:20 PM, Shane Hathaway wrote:
> Martin v. Löwis wrote:
>> Nicholas Bastin wrote:
>>
>>> "This type represents the storage type which is used by Python
>>> internally as the basis for holding Unicode ordinals. Extension
>>> module
>>> developers should make no assumptions abo
On May 4, 2005, at 6:03 PM, Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>> "This type represents the storage type which is used by Python
>> internally as the basis for holding Unicode ordinals. Extension
>> module
>> developers should make no assumptions about the size of this type on
>> a
Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>
>>"This type represents the storage type which is used by Python
>>internally as the basis for holding Unicode ordinals. Extension module
>>developers should make no assumptions about the size of this type on
>>any given platform."
>
>
> But
Nicholas Bastin wrote:
> "This type represents the storage type which is used by Python
> internally as the basis for holding Unicode ordinals. Extension module
> developers should make no assumptions about the size of this type on
> any given platform."
But people want to know "Is Python's Un
"Fredrik Lundh" <[EMAIL PROTECTED]> writes:
> Thomas Heller wrote:
>
>> AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars,
>> independend from the size of wchar_t. The HAVE_USABLE_WCHAR_T macro
>> can be used by extension writers to determine if Py_UNICODE is the same as
>>
Thomas Heller wrote:
> AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars,
> independend from the size of wchar_t. The HAVE_USABLE_WCHAR_T macro
> can be used by extension writers to determine if Py_UNICODE is the same as
> wchar_t.
note that "usable" is more than just "same
On May 4, 2005, at 1:02 PM, Michael Hudson wrote:
> Nicholas Bastin <[EMAIL PROTECTED]> writes:
>
>> The current documentation for Py_UNICODE states:
>>
>> "This type represents a 16-bit unsigned storage type which is used by
>> Python internally as basis for holding Unicode ordinals. On platfor
Nicholas Bastin <[EMAIL PROTECTED]> writes:
> The current documentation for Py_UNICODE states:
>
> "This type represents a 16-bit unsigned storage type which is used by
> Python internally as basis for holding Unicode ordinals. On platforms
> where wchar_t is available and also has 16-bits, P
Nicholas Bastin <[EMAIL PROTECTED]> writes:
> The current documentation for Py_UNICODE states:
>
> "This type represents a 16-bit unsigned storage type which is used by
> Python internally as basis for holding Unicode ordinals. On platforms
> where wchar_t is available and also has 16-bits, P
77 matches
Mail list logo