Nick Coghlan writes:
> GvR writes:
> > Let's just define a Unicode string to be a sequence of code points and
> > let libraries deal with the rest. Ok, methods like lower() should
> > consider characters, but indexing/slicing should refer to code points.
> > Same for '=='; we can have a libra
On 8/24/2011 7:29 PM, Guido van Rossum wrote:
(Hey, I feel a QOTW coming. "Standards? We don't need no stinkin'
standards."http://en.wikipedia.org/wiki/Stinking_badges :-)
Which deserves an appropriate, follow-on, misquote:
Guido says the Unicode standard stinks.
˚͜˚ <- and a Unicode smiley
"Martin v. Löwis", 24.08.2011 20:15:
Guido has agreed to eventually pronounce on PEP 393. Before that can
happen, I'd like to collect feedback on it. There have been a number
of voice supporting the PEP in principle
Absolutely.
- conditions you would like to pose on the implementation before
Victor Stinner, 25.08.2011 00:29:
With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any adverse effects?
For pure ASCII, it might be possible to use a shorter struct:
typedef struct {
PyO
Guido van Rossum writes:
> On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull
> wrote:
> > Strings contain Unicode code units, which for most purposes can be
> > treated as Unicode characters. However, even as "simple" an
> > operation as "s1[0] == s2[0]" cannot be relied upon t
On Thu, Aug 25, 2011 at 1:11 PM, Guido van Rossum wrote:
>> With narrow builds, code units can currently come into play
>> internally, but with PEP 393 everything internal will be working
>> directly with code points. Normalisation, combining characters and
>> bidi issues may still affect the corr
On Wed, Aug 24, 2011 at 7:47 PM, Nick Coghlan wrote:
> On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum wrote:
>> Now I am happy to admit that for many Unicode issues the level at
>> which we have currently defined things (code units, I think -- the
>> thingies that encodings are made of) is co
On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum wrote:
> Now I am happy to admit that for many Unicode issues the level at
> which we have currently defined things (code units, I think -- the
> thingies that encodings are made of) is confusing, and it would be
> better to switch to the others (
On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull wrote:
> Guido van Rossum writes:
>
> > I see nothing wrong with having the language's fundamental data types
> > (i.e., the unicode object, and even the re module) to be defined in
> > terms of codepoints, not characters, and I see nothing w
On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull
wrote:
> Terry Reedy writes:
>
> > Please suggest a re-wording then, as it is a bug for doc and behavior to
> > disagree.
>
> Strings contain Unicode code units, which for most purposes can be
> treated as Unicode characters. However, e
Guido van Rossum writes:
> I see nothing wrong with having the language's fundamental data types
> (i.e., the unicode object, and even the re module) to be defined in
> terms of codepoints, not characters, and I see nothing wrong with
> len() returning the number of codepoints (as long as it i
Terry Reedy writes:
> Please suggest a re-wording then, as it is a bug for doc and behavior to
> disagree.
Strings contain Unicode code units, which for most purposes can be
treated as Unicode characters. However, even as "simple" an
operation as "s1[0] == s2[0]" cannot be relied
Antoine Pitrou writes:
> Le jeudi 25 août 2011 à 02:15 +0900, Stephen J. Turnbull a écrit :
> > Antoine Pitrou writes:
> > > On Thu, 25 Aug 2011 01:34:17 +0900
> > > "Stephen J. Turnbull" wrote:
> > > >
> > > > Martin has long claimed that the fact that I/O is done in terms of
> > > >
On Wed, Aug 24, 2011 at 3:29 PM, Glenn Linderman wrote:
> It would seem helpful if the stdlib could have some support for efficient
> handling of Unicode characters in some representation. It would help
> address the class of applications that does care.
I claim that we have insufficient underst
On 25 August 2011 07:10, Victor Stinner wrote:
>
> I used stringbench and "./python -m test test_unicode". I plan to try
> iobench.
>
> Which other benchmark tool should be used? Should we write a new one?
I think that the PyPy benchmarks (or at least selected tests such as
slowspitfire) would p
On 8/24/2011 12:34 PM, Guido van Rossum wrote:
On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman wrote:
On 8/24/2011 9:00 AM, Stefan Behnel wrote:
Nick Coghlan, 24.08.2011 15:06:
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
In utf16.py, attached to http://bugs.python.org/issue12729
I
> With this PEP, the unicode object overhead grows to 10 pointer-sized
> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
> Does it have any adverse effects?
For pure ASCII, it might be possible to use a shorter struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length
> For Windows users, I believe it will nearly double the memory footprint
> if there are any non-BMP chars. On my new machine, I should not mind
> that in exchange for correct behavior.
In addition, strings with non-BMP chars are much more rare than strings
with all Latin-1, for which memory usage
Le mercredi 24 août 2011 20:52:51, Glenn Linderman a écrit :
> Given the required variability of character size in all presently
> Unicode defined encodings, I tend to agree with Tom that UTF-8, together
> with some technique of translating character index to code unit offset,
> may provide the bes
Terry Reedy wrote:
PEP-393 provides support of the full Unicode charset (U+-U+10)
an all platforms with a small memory footprint and only O(1) functions.
For Windows users, I believe it will nearly double the memory footprint
if there are any non-BMP chars. On my new machine, I should
On 8/24/2011 1:45 PM, Victor Stinner wrote:
Le 24/08/2011 02:46, Terry Reedy a écrit :
I don't think that using UTF-16 with surrogate pairs is really a big
problem. A lot of work has been done to hide this. For example,
repr(chr(0x10)) now displays '\U0010' instead of two characters.
E
In article <20110824205047.6be49...@pitrou.net>,
Antoine Pitrou wrote:
> On Wed, 24 Aug 2011 11:37:20 -0700
> Ned Deily wrote:
> > In article <20110824184927.2697b...@pitrou.net>,
> > Antoine Pitrou wrote:
> > > On Wed, 24 Aug 2011 15:31:50 +0200
> > > Charles-François Natali wrote:
> > > > >
In article
,
Charles-Francois Natali wrote:
> > But Snow Leopard, where these failures occur, is OS X 10.6.
>
> *sighs*
> It still looks like a kernel/libc bug to me: AFAICT, both the code and
> the tests are correct.
> And apparently, there are still issues pertaining to FD passing on
> 10.5 (
On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:
Terry Reedy writes:
> Excuse me for believing the fine 3.2 manual that says
> "Strings contain Unicode characters."
The manual is wrong, then, subject to a pronouncement to the contrary,
Please suggest a re-wording then, as it is a bug f
On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman wrote:
> On 8/24/2011 9:00 AM, Stefan Behnel wrote:
>
> Nick Coghlan, 24.08.2011 15:06:
>
> On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
>
> In utf16.py, attached to http://bugs.python.org/issue12729
> I propose for consideration a prototyp
> But Snow Leopard, where these failures occur, is OS X 10.6.
*sighs*
It still looks like a kernel/libc bug to me: AFAICT, both the code and
the tests are correct.
And apparently, there are still issues pertaining to FD passing on
10.5 (and maybe later, I couldn't find a public access to their bug
On 8/24/2011 9:00 AM, Stefan Behnel wrote:
Nick Coghlan, 24.08.2011 15:06:
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the
'mostly
BMP chars, few non-BMP chars'
On Wed, 24 Aug 2011 11:37:20 -0700
Ned Deily wrote:
> In article <20110824184927.2697b...@pitrou.net>,
> Antoine Pitrou wrote:
> > On Wed, 24 Aug 2011 15:31:50 +0200
> > Charles-François Natali wrote:
> > > > The buildbots are complaining about some of tests for the new
> > > > socket.sendmsg/
On 8/24/2011 1:50 PM, "Martin v. Löwis" wrote:
I'd like to point out that the improved compatibility is only a side
effect, not the primary objective of the PEP.
Then why does the Rationale start with "on systems only supporting
UTF-16, users complain that non-BMP characters are not properly
In article <20110824184927.2697b...@pitrou.net>,
Antoine Pitrou wrote:
> On Wed, 24 Aug 2011 15:31:50 +0200
> Charles-François Natali wrote:
> > > The buildbots are complaining about some of tests for the new
> > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
> > > provide
On Wed, 24 Aug 2011 20:15:24 +0200
"Martin v. Löwis" wrote:
> - issues to be considered (unclarities, bugs, limitations, ...)
With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any adverse effects
Guido has agreed to eventually pronounce on PEP 393. Before that can
happen, I'd like to collect feedback on it. There have been a number
of voice supporting the PEP in principle, so I'm now interested in
comments in the following areas:
- principle objection. I'll list them in the PEP.
- issues t
Le 24/08/2011 11:22, Glenn Linderman a écrit :
c) mostly ASCII (utf8) with clever indexing/caching to be efficient
d) UTF-8 with clever indexing/caching to be efficient
I see neither a need nor a means to consider these.
The discussion about "mostly ASCII" strings seems convincing that there
c
>> Eg, display of characters in the interpreter.
>
> I don't know why you say it's "done in terms of UTF-16", then. Unicode
> strings are simply encoded to whatever character set is detected as the
> terminal's character set.
I think what he means (and what I meant when I said something similar):
> > PEP 393 abolishes narrow builds as we now know them and changes
> > semantics. I was answering a complaint about that change. If you do
> > not like the PEP, fine.
>
> No, I do like the PEP. However, it is only a step, a rather
> conservative one in some ways, toward conformance to the Uni
Le 24/08/2011 02:46, Terry Reedy a écrit :
On 8/23/2011 9:21 AM, Victor Stinner wrote:
Le 23/08/2011 15:06, "Martin v. Löwis" a écrit :
Well, things have to be done in order:
1. the PEP needs to be approved
2. the performance bottlenecks need to be identified
3. optimizations should be applied.
Le jeudi 25 août 2011 à 02:15 +0900, Stephen J. Turnbull a écrit :
> Antoine Pitrou writes:
> > On Thu, 25 Aug 2011 01:34:17 +0900
> > "Stephen J. Turnbull" wrote:
> > >
> > > Martin has long claimed that the fact that I/O is done in terms of
> > > UTF-16 means that the internal representati
Antoine Pitrou writes:
> On Thu, 25 Aug 2011 01:34:17 +0900
> "Stephen J. Turnbull" wrote:
> >
> > Martin has long claimed that the fact that I/O is done in terms of
> > UTF-16 means that the internal representation is UTF-16
>
> Which I/O?
Eg, display of characters in the interpreter.
_
+1 for FileSystemError. I see myself misspelling it as FileSystemError if we
go with alternate spelling. I'll probably won't be the only one.
Thank you,
Vlad
On Wed, Aug 24, 2011 at 4:09 AM, Eli Bendersky wrote:
>
> When reviewing the PEP 3151 implementation (*), Ezio commented that
>> "FileSys
On Wed, 24 Aug 2011 15:31:50 +0200
Charles-François Natali wrote:
> > The buildbots are complaining about some of tests for the new
> > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
> > provide CMSG_LEN.
>
> Looks like kernel bugs:
> http://developer.apple.com/library/mac/#q
On Thu, 25 Aug 2011 01:34:17 +0900
"Stephen J. Turnbull" wrote:
>
> Martin has long claimed that the fact that I/O is done in terms of
> UTF-16 means that the internal representation is UTF-16
Which I/O?
___
Python-Dev mailing list
Python-Dev@python
Terry Reedy writes:
> Excuse me for believing the fine 3.2 manual that says
> "Strings contain Unicode characters."
The manual is wrong, then, subject to a pronouncement to the contrary,
of course. I was on your side of the fence when this was discussed,
pre-release. I was wrong then. My bet
Nick Coghlan, 24.08.2011 15:06:
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 'mostly
BMP chars, few non-BMP chars' case. Rather than expand every character from
> The buildbots are complaining about some of tests for the new
> socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
> provide CMSG_LEN.
Looks like kernel bugs:
http://developer.apple.com/library/mac/#qa/qa1541/_index.html
"""
Yes. Mac OS X 10.5 fixes a number of kernel bugs rela
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
> In utf16.py, attached to http://bugs.python.org/issue12729
> I propose for consideration a prototype of different solution to the 'mostly
> BMP chars, few non-BMP chars' case. Rather than expand every character from
> 2 bytes to 4, attach an a
The buildbots are complaining about some of tests for the new
socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
provide CMSG_LEN.
http://www.python.org/dev/buildbot/all/builders/AMD64%20Snow%20Leopard%202%203.x/builds/831/steps/test/logs/stdio
Before I start trying to figure thi
> When reviewing the PEP 3151 implementation (*), Ezio commented that
> "FileSystemError" looks a bit strange and that "FilesystemError" would
> be a better spelling. What is your opinion?
>
> (*) http://bugs.python.org/issue12555
>
+1 for FileSystemError
Eli
_
>> I think the value for wstr/uninitialized/reserved should not be
>> removed. The wstr representation is still used in the error case in
>> the utf8 decoder because these strings can be resized.
>
> In Python, you can resize an object if it has only one reference. Why is
> it not possible in you
On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote:
Terry Reedy writes:
> The current UCS2 Unicode string implementation, by design, quickly gives
> WRONG answers for len(), iteration, indexing, and slicing if a string
> contains any non-BMP (surrogate pair) Unicode characters. That may ha
Am 24.08.2011 10:17, schrieb Victor Stinner:
> Le 24/08/2011 04:41, Torsten Becker a écrit :
>> On Tue, Aug 23, 2011 at 18:27, Victor Stinner
>> wrote:
>>> I posted a patch to re-add it:
>>> http://bugs.python.org/issue12819#msg142867
>>
>> Thank you for the patch! Note that this patch adds the
On 8/24/2011 4:11 AM, Victor Stinner wrote:
> Le 24/08/2011 06:59, Scott Dial a écrit :
>> On 8/23/2011 6:38 PM, Victor Stinner wrote:
>>> Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
- You could try to run stringbench, which can be found at
http://svn.python.org/projects/s
On 8/24/2011 1:18 AM, "Martin v. Löwis" wrote:
So am I correctly reading between the lines when, after reading this
thread so far, and the complete issue discussion so far, that I see a
PEP 393 revision or replacement that has the following characteristics:
1) Narrow builds are dropped.
PEP 393
On 24Aug2011 12:31, Nick Coghlan wrote:
| On Wed, Aug 24, 2011 at 5:19 AM, Steven D'Aprano wrote:
| > Antoine Pitrou wrote:
| >> When reviewing the PEP 3151 implementation (*), Ezio commented that
| >> "FileSystemError" looks a bit strange and that "FilesystemError" would
| >> be a better spellin
Le 24/08/2011 04:56, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 18:56, Victor Stinner
wrote:
kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still
necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape().
If it can be removed, it would be nice to have kind in
Terry Reedy writes:
> The current UCS2 Unicode string implementation, by design, quickly gives
> WRONG answers for len(), iteration, indexing, and slicing if a string
> contains any non-BMP (surrogate pair) Unicode characters. That may have
> been excusable when there essentially were no su
> So am I correctly reading between the lines when, after reading this
> thread so far, and the complete issue discussion so far, that I see a
> PEP 393 revision or replacement that has the following characteristics:
>
> 1) Narrow builds are dropped.
PEP 393 already drops narrow builds.
> 2) The
Le 24/08/2011 04:41, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 18:27, Victor Stinner
wrote:
I posted a patch to re-add it:
http://bugs.python.org/issue12819#msg142867
Thank you for the patch! Note that this patch adds the fast path only
to the helper function which determines the len
Le 24/08/2011 06:59, Scott Dial a écrit :
On 8/23/2011 6:38 PM, Victor Stinner wrote:
Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
- You could try to run stringbench, which can be found at
http://svn.python.org/projects/sandbox/trunk/stringbench (*)
and there's iobench (the te
Le 24/08/2011 04:41, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou wrote:
Macros are useful to shield the abstraction from the implementation. If
you access the members directly, and the unicode object is represented
differently in some future version of Python (say e.g
On 8/23/2011 5:46 PM, Terry Reedy wrote:
On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote:
Am 23.08.2011 11:46, schrieb Xavier Morel:
Mostly ascii is pretty common for western-european languages
(French, for
instance, is probably 90 to 95% ascii). It's also a risk in english,
when
the writer "co
Nick Coghlan writes:
> Since I tend to use the one word 'filesystem' form myself (ditto for
> 'filename'), I'm +1 for FilesystemError, but I'm only -0 for
> FileSystemError (so I expect that will be the option chosen, given
> other responses).
I slightly prefer FilesystemError because it pars
Torsten Becker, 24.08.2011 04:41:
Also, common, now simple, checks for "unicode->str == NULL" would look
more ambiguous with a union ("unicode->str.latin1 == NULL").
You could just add yet another field "any", i.e.
union {
unsigned char* latin1;
Py_UCS2* ucs2;
Py_UCS4*
62 matches
Mail list logo