[issue12266] str.capitalize contradicts oneself

2011-08-15 Thread Roundup Robot

Roundup Robot  added the comment:

New changeset d3816fa1bcdf by Ezio Melotti in branch '2.7':
#12266: move the tests in test_unicode.
http://hg.python.org/cpython/rev/d3816fa1bcdf

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12711] Explain tracker components in devguide

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

Fixed in http://hg.python.org/devguide/rev/c9dd231b0940

--
resolution:  -> fixed
stage: patch review -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12746] normalization is affected by unicode width

2011-08-15 Thread STINNER Victor

STINNER Victor  added the comment:

See also #12737.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-15 Thread STINNER Victor

STINNER Victor  added the comment:

See also #12746.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12746] normalization is affected by unicode width

2011-08-15 Thread Tom Christiansen

Changes by Tom Christiansen :


--
nosy: +tchrist

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-15 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

> Keep in mind that we should be able to access and use lone surrogates too, 
> therefore:
> s = '\ud800'  # should be valid
> len(s)  # should this raise an error? (or return 0.5 ;)?
> s[0]  # error here too?
> list(s)  # here too?
> 
> p = s + '\udc00'
> len(p)  # 1?
> s[0]  # '\U0001' ?
> s[1]  # IndexError?
> list(p + 'a')  # ['\ud800\udc00', 'a']?
> 
> We can still decide that strings with lone surrogates work only with a 
> limited number of methods/functions but:
> 1) it's not backward compatible;
> 2) it's not very consistent
> 
> Another thing I noticed is that (at least on wide builds) surrogate pairs are 
> not joined "on the fly":
 p
> '\ud800\udc00'
 len(p)
> 2
 p.encode('utf-16').decode('utf-16')
> '𐀀'
 len(_)
> 1

Hi Tom,

welcome to Python land :-) Here's some more background information
on how Python's Unicode implementation works:

You need to differentiate between Unicode code points stored in
Unicode objects and ones encoded in transfer formats by codecs.

We generally do allow lone surrogates, unassigned code
points, lone combining code points, etc. in Unicode objects
since Python needs to be able to work on all Unicode code points
and build strings with them.

The transfer format codecs do try to combine surrogates
on decoding data on UCS4 builds. On UCS2 builds they create
surrogate pairs as necessary. On output, those pairs will again
be joined to get round-trip safety.

It helps if you think of Python's Unicode objects using UCS2
and UCS4 instead of UTF-16/32. Python does try to make working
with UCS2 easy and in many cases behaves as if it were using
UTF-16 internally, but there are, of course, limits to this. In
practice, you only rarely get to see any of these special cases,
since non-BMP code points are usually not found in everyday
use. If they do become a problem for you, you have the option
of switching to a UCS4 build of Python.

You also have to be aware of the fact that Python started
Unicode in 1999/2000 with Unicode 2.0/3.0, so it uses the
terminology of those versions, some of which has changed in
more recent versions of Unicode.

For more background information, you might want take a look
at this talk from 2002:

http://www.egenix.com/library/presentations/#PythonAndUnicode

Related to the other tickets you opened You'll also find that
collation and compression was already on the plate back then,
but since no one step forward, it wasn't implemented.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com


2011-10-04: PyCon DE 2011, Leipzig, Germany50 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--
nosy: +lemburg
title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> 
Python lib re cannot handle Unicode properly due to   narrow/wide bug

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12751] Use macros for surrogates in unicodeobject.c

2011-08-15 Thread STINNER Victor

New submission from STINNER Victor :

A lot of code is duplicated in unicodeobject.c to manipulate ("encode/decode") 
surrogates. Each function has from one to three different implementations. The 
new decode_ucs4() function adds a new implementation. Attached patch replaces 
this code by macros.

I think that only the implementations of IS_HIGH_SURROGATE and IS_LOW_SURROGATE 
are important for speed. ((ch & 0xFC00UL) == 0xD800) (from decode_ucs4) is 
*a little bit* faster than (0xD800 <= ch && ch <= 0xDBFF) on my CPU (Atom Z520 
@ 1.3 GHz): running test_unicode 4 times takes ~54 sec instead of ~57 sec (-3%).

These 3 macros have to be checked, I wrote the first one:

#define IS_SURROGATE(ch) (((ch) & 0xF800UL) == 0xD800)
#define IS_HIGH_SURROGATE(ch) (((ch) & 0xFC00UL) == 0xD800)
#define IS_LOW_SURROGATE(ch) (((ch) & 0xFC00UL) == 0xDC00)

I added cast to Py_UCS4 in COMBINE_SURROGATES to avoid integer overflow if 
Py_UNICODE is 16 bits (narrow build). It's maybe useless.

#define COMBINE_SURROGATES(ch1, ch2) \
 (Py_UCS4)(ch1) & 0x3FF) << 10) | ((Py_UCS4)(ch2) & 0x3FF)) + 0x1)

HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument has been 
preproceed to fit in [0; 0x]. I added this requirement in the comment of 
these macros. It would be better to have only one macro to do the two 
operations, but because "*p++" (dereference and increment) is usually used, I 
prefer to avoid one unique macro (I don't like passing *p++ in a macro using 
its argument more than once).

Or we may add a third macro using HIGH_SURROGATE and LOW_SURROGATE.

I rewrote the main loop of PyUnicode_EncodeUTF16() to avoid an useless test on 
ch2 on narrow build.

I also added a IS_NONBMP macro just because I prefer macro over hardcoded 
constants.

--
files: unicode_macros.patch
keywords: patch
messages: 142108
nosy: benjamin.peterson, ezio.melotti, haypo, lemburg, loewis, pitrou, tchrist, 
terry.reedy
priority: normal
severity: normal
status: open
title: Use macros for surrogates in unicodeobject.c
versions: Python 3.3
Added file: http://bugs.python.org/file22901/unicode_macros.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12751] Use macros for surrogates in unicodeobject.c

2011-08-15 Thread STINNER Victor

STINNER Victor  added the comment:

We may use the following unlikely macro for IS_SURROGATE, IS_HIGH_SURROGATE and 
IS_LOW_SURROGATE:

#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

I suppose that we should use microbenchmarks to validate these macros?

Should I open a new issue for this idea?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

So the issue here is that while using combing chars, str.title() fails to 
titlecase the string properly.

The algorithm implemented by str.title() [0] is quite simple: it loops through 
the code units, and uppercases all the chars that follow a char that is not 
lower/upper/titlecased.
This means that if DΓ©me doesn't use combining accents, the char before the 'm' 
is 'Γ©', 'Γ©' is a lowercase char, so 'm' is not capitalized.
If the 'Γ©' is represented as 'e' + 'Β΄', the char before the 'm' is 'Β΄', 'Β΄' is 
not a lower/upper/titlecase char, so the 'm' is capitalized.

I guess we could normalize the string before doing the title casing, and then 
normalize it back.
Also the str methods don't claim to follow Unicode afaik, so unless we decide 
that they should, we could implement whatever algorithm we want.

[0]: Objects/unicodeobject.c:6752

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12751] Use macros for surrogates in unicodeobject.c

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

This has been proposed already in #10542 (the issue also has patches).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

If the regex module works fine here, I think it's better to leave the re module 
alone and include the regex module in 3.3.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12734] Request for property support in Python re lib

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

This indeed should be "fixed" by replacing 're' with 'regex'.  So I would 
suggest to focus your tests on 'regex' and report them there so that possible 
bugs gets fixed and tested before we include the module in the stdlib.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12733] Request for grapheme support in Python re lib

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

As I said on #12734 and #12731, if the 'regex' module address this issue, we 
should just wait until we include it in the stdlib.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

This is actually a duplicated of #9200.

@Terry

> Besides which, all I see (on Windowsj) in Firefox is things like
> "Γ°ΒΒΒΌΓ°ΒΒΒ―Γ°Ββ€˜β€¦Γ°ΒΒΒ¨Γ°Ββ€˜β€°Γ°ΒΒΒ―Γ°ΒΒΒ»".

Encoding problem.  Firefox thinks this is some iso-8859-*.  You can fix this 
selecting 'Unicode (UTF-8)' from "View -> Character Encoding".

> IDLE just has empty boxes.

This is most likely because it doesn't use a font able to display those chars.

--
resolution:  -> duplicate
stage: needs patch -> committed/rejected
status: open -> closed
superseder:  -> str.isprintable() is always False for large code points

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9200] Make str methods work with non-BMP chars on narrow builds

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

I closed #12730 as a duplicate of this and updated the title of this issue.

--
title: str.isprintable() is always False for large code points -> Make str 
methods work with non-BMP chars on narrow builds

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

See also #12751.

--
nosy: +tchrist

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9200] Make str methods work with non-BMP chars on narrow builds

2011-08-15 Thread Ezio Melotti

Changes by Ezio Melotti :


--
nosy: +tchrist

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Julian Taylor

New submission from Julian Taylor :

using unicode strings for locale.normalize gives following traceback with 
python2.7:

~$ python2.7 -c 'import locale; locale.normalize(u"en_US")'
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python2.7/locale.py", line 358, in normalize
fullname = localename.translate(_ascii_lower_map)
TypeError: character mapping must return integer, None or unicode

with python2.6 it works and it also works with non-unicode strings in 2.7

--
components: Unicode
messages: 142118
nosy: jtaylor
priority: normal
severity: normal
status: open
title: locale.normalize does not take unicode strings
versions: Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Ezio Melotti

Changes by Ezio Melotti :


--
nosy: +ezio.melotti
stage:  -> test needed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12204] str.upper converts to title

2011-08-15 Thread Roundup Robot

Roundup Robot  added the comment:

New changeset 16edc5cf4a79 by Ezio Melotti in branch '3.2':
#12204: document that str.upper().isupper() might be False and add a note about 
cased characters.
http://hg.python.org/cpython/rev/16edc5cf4a79

New changeset fb49394f75ed by Ezio Melotti in branch '2.7':
#12204: document that str.upper().isupper() might be False and add a note about 
cased characters.
http://hg.python.org/cpython/rev/fb49394f75ed

New changeset c821e3a54930 by Ezio Melotti in branch 'default':
#12204: merge with 3.2.
http://hg.python.org/cpython/rev/c821e3a54930

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12204] str.upper converts to title

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

Fixed, thanks for the report!

--
resolution:  -> fixed
stage: commit review -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-15 Thread Matthew Barnett

Matthew Barnett  added the comment:

For what it's worth, I've had idea about string storage, roughly based on how 
*nix stores data on disk.

If a string is small, point to a block of codepoints.

If a string is medium-sized, point to a block of pointers to codepoint blocks.

If a string is large, point to a block of pointers to pointer blocks.

This means that a large string doesn't need a single large allocation.

The level of indirection can be increased as necessary.

For simplicity, all codepoint blocks contain the same number of codepoints, 
except the final codepoint block, which may contain fewer.

A codepoint block may use the minimum width necessary (1, 2 or 4 bytes) to 
store all of its codepoints.

This means that there are no surrogates and that different sections of the 
string can be stored in different widths to reduce memory usage.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Julian Taylor

Julian Taylor  added the comment:

this is a regression introduced by fixing http://bugs.python.org/issue1813

This breaks some user code,. e.g. wx.Locale.GetCanonicalName returns unicode.
Example bugs:
https://bugs.launchpad.net/ubuntu/+source/update-manager/+bug/824734
https://bugs.launchpad.net/ubuntu/+source/playonlinux/+bug/825421

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Julian Taylor wrote:
> 
> New submission from Julian Taylor :
> 
> using unicode strings for locale.normalize gives following traceback with 
> python2.7:
> 
> ~$ python2.7 -c 'import locale; locale.normalize(u"en_US")'
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/python2.7/locale.py", line 358, in normalize
> fullname = localename.translate(_ascii_lower_map)
> TypeError: character mapping must return integer, None or unicode
> 
> with python2.6 it works and it also works with non-unicode strings in 2.7

This looks like a side-effect of the change Antoine made to the locale
module when trying to make the case mapping work in a non-locale
dependent way.

--
nosy: +lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12204] str.upper converts to title

2011-08-15 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

Are you sure this should have been backported?  Are there any apps that may be 
working now but won't be after the next point release?

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12751] Use macros for surrogates in unicodeobject.c

2011-08-15 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

> HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument
> has been preproceed to fit in [0; 0x]. I added this requirement in
> the comment of these macros.

The macros should preprocess the argument themselves. It will make the
code even simpler.
Otherwise +1.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12204] str.upper converts to title

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

This is only a doc patch, maybe you are confusing this issue with #12266?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12204] str.upper converts to title

2011-08-15 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

Right.  I was looking at the other patches that went in in the last 24 hours.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12204] str.upper converts to title

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

It's unlikely that #12266 might break apps.  The behavior changed only for 
fairly unusual characters, and the old behavior was clearly wrong.
FWIW the str.capitalize() implementation of PyPy doesn't have the bug, and 
after the fix both CPython and PyPy have the same behavior.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12750] datetime.datetime timezone problems

2011-08-15 Thread R. David Murray

R. David Murray  added the comment:

In what way does 'replace' not satisfy your need to set the tzinfo?

As for utcnow, we can't change what it returns for backward compatibility 
reasons, but you can get a non-naive utc datatime by doing 
datetime.now(timezone.utc).  (I must admit, however, that at least this morning 
I can't wrap my head around how that works based on the docs :(.

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12750] datetime.datetime timezone problems

2011-08-15 Thread Daniel O'Connor

Daniel O'Connor  added the comment:

On 15/08/2011, at 23:39, R. David Murray wrote:
> R. David Murray  added the comment:
> 
> In what way does 'replace' not satisfy your need to set the tzinfo?

Ahh that would work, although it is pretty clumsy since you have to specify 
everything else as well.

In the end I used calendar.timegm (which I only found out about after this).

> As for utcnow, we can't change what it returns for backward compatibility 
> reasons, but you can get a non-naive utc datatime by doing Β΄

That is a pity :(

> datetime.now(timezone.utc).  (I must admit, however, that at least this 
> morning I can't wrap my head around how that works based on the docs :(.

OK.. I am only using 2.7 so I can't try that :)

> 
> --
> nosy: +r.david.murray
> 
> ___
> Python tracker 
> 
> ___
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12750] datetime.datetime timezone problems

2011-08-15 Thread R. David Murray

R. David Murray  added the comment:

Ah.  Well, pre-3.2 datetime itself did not generate *any* non-naive datetimes.

Nor do you need to specify everything for replace.  dt.replace(tzinfo=tz) 
should work just fine.

--
resolution:  -> invalid
stage:  -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

Here are some benchmarks:
Commands:
# half of the bytes are invalid
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", 
"surrogateescape")'
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", 
"replace")'
./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", 
"ignore")'

With patch:
1000 loops, best of 3: 854 usec per loop
1000 loops, best of 3: 509 usec per loop
1000 loops, best of 3: 415 usec per loop

Without patch:
1000 loops, best of 3: 670 usec per loop
1000 loops, best of 3: 470 usec per loop
1000 loops, best of 3: 382 usec per loop

Commands (from the interactive interpreter):
# all valid codepoints
import timeit
b = "".join(chr(c) for c in range(0x11) if c not in range(0xD800, 
0xE000)).encode("utf-8")
b_dec = b.decode
timeit.Timer('b_dec("utf-8")', 'from __main__ import b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "surrogateescape")', 'from __main__ import 
b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "replace")', 'from __main__ import 
b_dec').timeit(100)/100
timeit.Timer('b_dec("utf-8", "ignore")', 'from __main__ import 
b_dec').timeit(100)/100

With patch:
0.03830226898193359
0.03849360942840576
0.03835036039352417
0.03821949005126953

Without patch:
0.03750091791152954
0.037977190017700196
0.04067679166793823
0.038579678535461424

Commands:
# near-worst case scenario, 1 byte dropped every 5 from a valid utf-8 string
b2 = bytes(c for k,c in enumerate(b) if k%5)
b2_dec = b2.decode
timeit.Timer('b2_dec("utf-8", "surrogateescape")', 'from __main__ import 
b2_dec').timeit(10)/10
timeit.Timer('b2_dec("utf-8", "replace")', 'from __main__ import 
b2_dec').timeit(10)/10
timeit.Timer('b2_dec("utf-8", "ignore")', 'from __main__ import 
b2_dec').timeit(10)/10

With patch:
9.645482301712036
6.602735090255737
5.338080596923828

Without patch:
8.124328684806823
5.804249691963196
4.851014900207519

All tests done on wide 3.2.

Since the changes are about errors, decoding of valid utf-8 strings is not 
affected.  Decoding with non-strict error handlers and invalid strings are 
slower, but I don't think the difference is significant.
If the patch is fine I will commit it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-15 Thread Martin v . LΓΆwis

Martin v. LΓΆwis  added the comment:

A PEP 393 draft implementation is available at 
https://bitbucket.org/t0rsten/pep-393/ (branch pep-393); if this gets into 3.3, 
this issue will be outdated: there won't be "narrow" builds of Python anymore 
(nor will there be "wide" builds).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

That's a really good news.
Some Unicode issues can still be fixed on 2.7 and 3.2 though.
FWIW I was planning to look at this and #9200 in the following days and see if 
I can fix them.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

My Firefox is already set at utf-8. More likely a font limitation. I will look 
again after installing one of the fonts Tom suggested.

The pair of boxes on IDLE are for the surrogate pairs. Perhaps tk does not even 
try to display a single char. I will experiment more when I have a more 
complete font.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-15 Thread Tom Christiansen

New submission from Tom Christiansen :

Unicode character names share a common namespace with formal aliases and with 
named sequences, but Python recognizes only the original name. That means not 
everything in the namespace is accessible from Python.  (If this is construed 
to be an extant bug from than an absent feature, you probably want to change 
this from a wish to a bug in the ticket.)

This is a problem because aliases correct errors in the original names, and are 
the preferred versions.  For example, ISO screwed up when they called U+01A2 
LATIN CAPITAL LETTER OI.  It is actually LATIN CAPITAL LETTER GHA according to 
the file NameAliases.txt in the Unicode Character Database.  However, Python 
blows up when you try to use this:

% env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL 
LETTER OI}")'
Ζ’

% env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL 
LETTER GHA}")'
  File "", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 0-27: unknown Unicode character name
Exit 1

This unfortunate, because the formal aliases correct egregious blunders, such 
as the Standard reading "BRAKCET" instead of "BRACKET":

$ uninames '^\s+%'
 Ζ’  01A2LATIN CAPITAL LETTER OI
% LATIN CAPITAL LETTER GHA
 Ζ£  01A3LATIN SMALL LETTER OI
% LATIN SMALL LETTER GHA
* Pan-Turkic Latin alphabets
 ೞ  0CDEKANNADA LETTER FA
% KANNADA LETTER LLLA
* obsolete historic letter
* name is a mistake for LLLA
 ຝ  0E9DLAO LETTER FO TAM
% LAO LETTER FO FON
= fo fa
* name is a mistake for fo sung
 ຟ  0E9FLAO LETTER FO SUNG
% LAO LETTER FO FAY
* name is a mistake for fo tam
 ΰΊ£  0EA3LAO LETTER LO LING
% LAO LETTER RO
= ro rot
* name is a mistake, lo ling is the mnemonic for 0EA5
 ΰΊ₯  0EA5LAO LETTER LO LOOT
% LAO LETTER LO
= lo ling
* name is a mistake, lo loot is the mnemonic for 0EA3
 ࿐  0FD0TIBETAN MARK BSKA- SHOG GI MGO RGYAN
% TIBETAN MARK BKA- SHOG GI MGO RGYAN
* used in Bhutan
 ꀕ A015YI SYLLABLE WU
% YI SYLLABLE ITERATION MARK
* name is a misnomer
 ︘ FE18PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET
% PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET
* misspelling of "BRACKET" in character name is a known defect
#  3017
 𝃅  1D0C5   BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
% BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
* misspelling of "FTHORA" in character name is a known defect

There are only 

In Perl, \N{...} grants access to the single, shared, common namespace of 
Unicode character names, formal aliases, and named sequences without 
distinction:

% env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER OI}")'
Ζ’
% env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER GHA}")'
Ζ’

% env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER OI}")'  | uniquote -x
\x{1A2}
% env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER GHA}")' | uniquote -x
\x{1A2}

It is my suggestion that Python do the same thing. There are currently only 11 
of these.  

The third element in this shared namespace of name, named sequences, are 
multiple code points masquerading under one name.  They come from the 
NamedSequences.txt file in the Unicode Character Database.  An example entry is:

LATIN CAPITAL LETTER A WITH MACRON AND GRAVE;0100 0300

There are 418 of these named sequences as of Unicode 6.0.0.  This shows that 
Perl can also access named sequences:

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER A WITH MACRON AND GRAVE}")'
  Δ€Μ€

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER A WITH MACRON AND GRAVE}")' | uniquote -x
  \x{100}\x{300}

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER 
AINU P}")'
  γ‡·γ‚š

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER 
AINU P}")' | uniquote -x
   \x{31F7}\x{309A}


Since it is a single namespace, it makes sense that all members of that 
namespace should be accessible using \N{...} as a sort of equal-opportunity 
accessor mechanism, and it does not make sense that they not be.

Just makes sure you take only the approved named sequences from the 
NamedSequences.txt file. It would be unwise to give users access to the 
provisional sequences located in a neighboring file I shall not name :) because 
those are not guaranteed never to be withdrawn the way the others are, and so 
you would risk introducing an incompatibility.

If you look at the ICU UCharacter class, you can see that they provide a more

--
component

[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Ezio Melotti

Ezio Melotti  added the comment:

> My Firefox is already set at utf-8.

Every page can specify the encoding it uses (in HTTP headers,  tag and/or 
xml prologue).  If none of these are specified, afaik Firefox tries to detect 
the encoding, and sometimes fails.  What encoding does it show for you in the 
menu when you open the patch?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen

Tom Christiansen  added the comment:

>Terry J. Reedy  added the comment:

> My Firefox is already set at utf-8. More likely a font limitation. I
> will look again after installing one of the fonts Tom suggested.

Symbola is best for exotic glyphs, especially astral ones.

Alfios just looks nice as a normal default roman.

--tom

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-15 Thread Ezio Melotti

Changes by Ezio Melotti :


--
components: +Unicode
nosy: +ezio.melotti
stage:  -> test needed
versions:  -Python 2.7, Python 3.1, Python 3.2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

You are right, FF switched on me without notice. Bad FF.
Thank you! What I now see makes much more sense.
[ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐇𐐝𐐀𐐑𐐇𐐓"  ],
and I now know to check on other pages (although Tom's Unicode talk slides 
still have boxes even in utf-8, so that must be a font lack).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen

Tom Christiansen  added the comment:

>Terry J. Reedy  added the comment:

> You are right, FF switched on me without notice. Bad FF. Thank you! What
> I now see makes much more sense.

>[ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐇𐐝𐐀𐐑𐐇𐐓"  ],

> and I now know to check on other pages (although Tom's Unicode talk
> slides still have boxes even in utf-8, so that must be a font lack).

Do you have Symbola installed?  Here's Appendix I on Fonts for things that
should look right for the presentation to look right.  

* I recommend two free fonts from George Douros at users.teilar.gr/~g1951d/ 
known to
  work with this presentation: his Alfios font for regular text, and his 
Symbola font
  for fancy emoji. If any of these don’t look right to you, you probably 
need to
  supplement your system fonts:

Ligatures: fi ffi ff ffl fl Ξ² ẞ ο¬… st
Math letters: π’œ π’Ÿ 𝔅 π”Ž 𝔼 𝔽
Gothic & Deseret: πŒΈπŒΌπŒ½π‚, 𐐔𐐯𐑅𐐨𐑉𐐯𐐻
Symbols: βœ” βœ… πŸͺ πŸ“– πŸ›‚ 🐍
Emoticons: πŸ˜‡ 😈 πŸ˜‰ 😨 😭 😱
Upside‐down: Β‘pɐəΙ₯ ΙΉnoʎ uo Ζƒuᴉpuɐʇs ʎq sᴉΙ₯Κ‡ pΙΙ™α΄š
Combining characters: β—ŒΜ‚,β—ŒΜƒ,β—Œβƒž,β—ŒΜ²,β—ŒοΈ€,β—ŒΜ΅,β—ŒΜ·

* The last line with combining characters is especially hard to get to look 
right. 
  You may find that the shareware font Everson Mono works when all else 
fails.

You do need Unicode 5.1 support for the LATIN CAPITAL LETTER SHARP S, and
you need Unicode 6.0 support for most of the emoji (I think Snow Leopard
has colorized versions of these.  The Ligature line above looks good in Alfios.

It  turns out it may not always the font used with combining chars as it is 
whether and
well your browser supports true combining characters dynamically generated, or 
whether it
runs stuff through NFC and looks for substitution glyphs.  I am not a GUI 
person, so am
mostly just guessing.

But this I find interesting:  If you look at slide 33 of my first talk or slide 
5 of my
second talk, which are duplicates entitled Canonical Conundra, the second 
column which is
labelled Glyphs explicitly uses Time New Roman because of this issue.  Even so 
you can
tell it is doing the NFC trick, because lines 1+2 have the same NFC of \x{F5} 
or Γ΅, as do
3+4+5 with \x{22D} with Θ­, and and 6+7 with ō̃.

The glyphs from the first group are both identical, and so are all three those 
of the
second group, as both the first two groups have a single precomposed character 
available
for their NFC.  In contrast, there is no single precomposed glyph available for 
6+7, and
you can tell that it's stacking it on the fly using slightly less tight 
grouping rules
than the font has in the precomposed versions above it.

I use Safari, but I am told Firefox looks ok, too.  Opera is my normal browser 
but it
does the copout I just described on combining chars without ever being able to
dynamically stack them if the copout fail, so I can't use it for this 
presentation.

--tom

  $ uniprops -a 'LATIN CAPITAL LETTER SHARP S' 'DESERET CAPITAL LETTER DEE' 
'GOTHIC LETTER MANNA' 'SNAKE' 'FACE SCREAMING IN FEAR'

U+1E9E <ẞ> \N{LATIN CAPITAL LETTER SHARP S}
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InLatinExtendedAdditional Cased 
Cased_Letter LC Changes_When_Casefolded CWCF
   Changes_When_Casemapped CWCM Changes_When_Lowercased CWL 
Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base
   Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn 
Latin_Extended_Additional Uppercase_Letter Print Upper
   Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum 
X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper
   X_POSIX_Word
Age=5.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L 
Block=Latin_Extended_Additional Canonical_Combining_Class=0
   Canonical_Combining_Class=Not_Reordered CCC=NR 
Canonical_Combining_Class=NR Decomposition_Type=None DT=None
   East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
   Hangul_Syllable_Type=Not_Applicable HST=NA 
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining
   JT=U Joining_Type=U Script=Latin Line_Break=AL Line_Break=Alphabetic 
LB=AL Numeric_Type=None NT=None Numeric_Value=NaN
   NV=NaN Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 
IN=6.0 SC=Latn Script=Latn Sentence_Break=UP
   Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE 
_X_Begin

U+10414 <𐐔> \N{DESERET CAPITAL LETTER DEE}
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC 
Changes_When_Casefolded CWCF
   Changes_When_Casemapped CWCM Changes_When_Lowercased CWL 
Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base
   Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ 
Uppercase_Letter Print Upper Uppercase Word
   XID_Continue XIDC XID_Start XIDS X_POS

[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen

Tom Christiansen  added the comment:

Sorry I didn't include a test case. Hope this makes up for it.  If not, please 
tell me how to write better test cases. :(

Yeah ok, so I'm a bit persnickety or even unorthodox about my vertical 
alignment, but it really helps to make what is different from one to line to 
the next stand out if the parts that are the same from line to line are at the 
same column every time.

--
Added file: http://bugs.python.org/file22902/nametests.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen

Tom Christiansen  added the comment:

Oh whoops, that was the long ticket.  Shall I reupload to the right number?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

Adding Symbola filled in the symbols and emoticons lines.
The gothic chars are still missing even with Alfios.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12730] Python's casemapping functions are untrustworthy due to narrow/wide build issues

2011-08-15 Thread Tom Christiansen

Tom Christiansen  added the comment:

>Terry J. Reedy  added the comment:

>Adding Symbola filled in the symbols and emoticons lines.
>The gothic chars are still missing even with Alfios.

That's too bad, as the Gothic paternoster is kinda cute. :)

Hm, I wonder where I got them from.  I think there must 
be a way to figure that out using the Mac FontBook program,
but I don't know what it is other than pasting them in
the sample screen and scrolling through the fonts to see
how those get rendered.

--tom

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12746] normalization is affected by unicode width

2011-08-15 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis :


--
nosy: +Arfrever

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9200] Make str methods work with non-BMP chars on narrow builds

2011-08-15 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis :


--
nosy: +Arfrever

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

2011-08-15 Thread Tom Christiansen

Tom Christiansen  added the comment:

Here’s the right test file for the right ticket.

--
Added file: http://bugs.python.org/file22903/nametests.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12734] Request for property support in Python re lib

2011-08-15 Thread Tom Christiansen

Changes by Tom Christiansen :


Removed file: http://bugs.python.org/file22902/nametests.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Barry A. Warsaw

Barry A. Warsaw  added the comment:

A cheap way of fixing this would be to test for str-ness of localename and if 
it's a unicode, just localname.encode('ascii')

Or is that completely insane?

--
nosy: +barry

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Barry A. Warsaw

Barry A. Warsaw  added the comment:

For example:


diff -r fb49394f75ed Lib/locale.py
--- a/Lib/locale.py Mon Aug 15 14:24:15 2011 +0300
+++ b/Lib/locale.py Mon Aug 15 16:47:23 2011 -0400
@@ -355,6 +355,8 @@
 
 """
 # Normalize the locale name and extract the encoding
+if isinstance(localename, unicode):
+localename = localename.encode('ascii')
 fullname = localename.translate(_ascii_lower_map)
 if ':' in fullname:
 # ':' is sometimes used as encoding delimiter.
diff -r fb49394f75ed Lib/test/test_locale.py
--- a/Lib/test/test_locale.py   Mon Aug 15 14:24:15 2011 +0300
+++ b/Lib/test/test_locale.py   Mon Aug 15 16:47:23 2011 -0400
@@ -412,6 +412,11 @@
 locale.setlocale(locale.LC_CTYPE, loc)
 self.assertEqual(loc, locale.getlocale())
 
+def test_normalize_issue12752(self):
+# Issue #1813 caused a regression where locale.normalize() would no
+# longer accept unicode strings.
+self.assertEqual(locale.normalize(u'en_US'), 'en_US.ISO8859-1')
+
 
 def test_main():
 tests = [

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Barry A. Warsaw

Changes by Barry A. Warsaw :


--
keywords: +patch
Added file: http://bugs.python.org/file22904/issue12752.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12751] Use macros for surrogates in unicodeobject.c

2011-08-15 Thread STINNER Victor

Changes by STINNER Victor :


--
nosy: +belopolsky

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12751] Use macros for surrogates in unicodeobject.c

2011-08-15 Thread STINNER Victor

STINNER Victor  added the comment:

> This has been proposed already in #10542 (the issue also has patches).

The two issues are different: this issue is only a refactoring, whereas #10542 
adds a new "feature" (function/macro: Py_UNICODE_NEXT).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

The proposed resolution looks ok. Another possibility is simply to use .lower() 
if the string is an unicode string, since that will bypass the C locale.

--
nosy: +pitrou
stage: test needed -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12750] datetime.datetime timezone problems

2011-08-15 Thread Daniel O'Connor

Daniel O'Connor  added the comment:

On 16/08/2011, at 1:06, R. David Murray wrote:
> R. David Murray  added the comment:
> 
> Ah.  Well, pre-3.2 datetime itself did not generate *any* non-naive datetimes.
> 
> Nor do you need to specify everything for replace.  dt.replace(tzinfo=tz) 
> should work just fine.

OK.

I did try this and it seems broken though..
In [19]: now = datetime.datetime.utcnow()

In [21]: now.replace(tzinfo = pytz.utc)
Out[21]: datetime.datetime(2011, 8, 15, 22, 54, 13, 173110, tzinfo=)

In [22]: datetime.datetime.strftime(now, "%s")
Out[22]: '1313414653'

In [23]: now
Out[23]: datetime.datetime(2011, 8, 15, 22, 54, 13, 173110)

[ur 8:22] ~ >date -ujr 1313414653
Mon 15 Aug 2011 13:24:13 UTC

i.e. it appears that replace() applies the TZ offset to a naive datetime object 
effectively assuming it is local time rather than un-timezoned (which is what 
the docs imply to me)

> --
> resolution:  -> invalid
> stage:  -> committed/rejected
> status: open -> closed
> 
> ___
> Python tracker 
> 
> ___
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12754] Add alternative random number generators

2011-08-15 Thread Raymond Hettinger

New submission from Raymond Hettinger :

While keeping the MT generator as the default, add new alternative random 
number generators as drop-in replacements.  Since MT was first introduced, PRNG 
technology has continued to advance.

I'm opening this feature request to be a centralized place to discuss which 
alternatives should be offered.

Since we already have a mostly-good-enough(tm) prng, any new generators need to 
be of top quality, be well researched, and offer significantly different 
performance characteristics than we have now (i.e. speed, cryptographic 
strength, simplicity, smaller state vectors, etc).

At least one of the new generators should be cryptographically strong (both to 
the left and to the right) while keeping reasonable speed for simulation, 
sampling, and gaming apps.  (The speed requirement precludes the likes of Blum 
Blum Shub for example.)  I believe there are several good candidates based on 
stream ciphers or that use block ciphers in a feedback mode.

--
assignee: rhettinger
components: Library (Lib)
messages: 142151
nosy: rhettinger
priority: low
severity: normal
status: open
title: Add alternative random number generators
type: feature request
versions: Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Barry A. Warsaw

Changes by Barry A. Warsaw :


--
assignee:  -> barry

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Roundup Robot

Roundup Robot  added the comment:

New changeset 0d64fe6c737f by Barry Warsaw in branch '2.7':
The simplest possible fix for the regression in bug 12752 by encoding unicodes
http://hg.python.org/cpython/rev/0d64fe6c737f

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Barry A. Warsaw

Changes by Barry A. Warsaw :


--
resolution:  -> fixed
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12748] Problems using IDLE accelerators with OS X Dvorak - Qwerty ⌘ input method

2011-08-15 Thread Ned Deily

Ned Deily  added the comment:

Interesting, I didn't know the "Dvorak - Qwerty ⌘" input method existed.  In 
just some causal experimentation with it, it seems pretty clear that the input 
method is not being consistently followed by Tk and there seem to be 
differences between Tk 8.4 and 8.5.  Part of the problem, I think, is that some 
of the keyboard accelerators are implemented in menus provided by Tk itself and 
the unrecognized ones are passed on to IDLE.  Unfortunately, Tk does not fully 
implement Input Method text processing although there does seem to be some 
work-in-progress in the Tk community to help fix that, see for instance:

http://sourceforge.net/tracker/?func=detail&aid=3205153&group_id=12997&atid=112997

But the work for that issue, if it does get released, may still have no impact 
on this problem.

If someone is interested in further investigating the issue, I would suggest 
setting up a small Tcl test case using menu accelerators using Tcl/Tk Wish and, 
if the problem can be demonstrated there, opening an issue on the Tk tracker.  
I'm reasonably certain there is nothing to be done about this in Python itself 
but I'm not interested in spending more time to prove that.  So I'm going to 
close this issue as "wont fix".  If anyone has an interest in it, feel free to 
reopen and reassign.

--
assignee: ned.deily -> 
priority: normal -> low
resolution:  -> wont fix
stage:  -> committed/rejected
status: open -> closed
title: IDLE halts on osx when copy and paste -> Problems using IDLE 
accelerators with OS X Dvorak - Qwerty ⌘ input method
type:  -> behavior
versions: +Python 3.2, Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12754] Add alternative random number generators

2011-08-15 Thread Sturla Molden

Sturla Molden  added the comment:

George Marsaglia's latest random number generator KISS4691 is worth 
considering, though I am not sure the performance is that different from 
MT19937. 

Here is a link to Marsaglia's post on comp.lang.c. Marasglia passed away 
shortly after (Feb. 2011), and to my knowledge a paper on KISS4691 was never 
published:

http://www.rhinocerus.net/forum/lang-c/620168-kiss4691-potentially-top-ranked-rng.html

On my laptop, KISS4691 could produce about 110 million random numbers per 
second (148 millon if inlined), whereas MT19937 produced 118 million random 
numbers per second. Another user on comp.lang.c reported that (with my 
benchmark) KISS4691 was about twice as fast as MT19937 on his computer. As for 
quality, I have been told that MT19937 only failes a couple of obscure tests 
for randomness, whereas KISS4691 failes no single-seed test.

The source code I used for this test is available here:

http://folk.uio.no/sturlamo/prngtest.zip

(Requires Windows because that's what I use, sorry, might work with winelib on 
Linux though.)

Marsaglia has previously recommended several PRNGs that are considerably 
simpler and faster than MT19937. These are the ones used in the 3rd edition of 
"Numerical Receipes" (yes I know that not a sign of good quality). We can look 
at them too, with Marsaglia's comments:

https://groups.google.com/group/sci.stat.math/msg/edcb117233979602?hl=en&pli=1

https://groups.google.com/group/sci.math.num-analysis/msg/eb4ddde782b17051?hl=en&pli=1

There are also SIMD-oriented versions of MT19937, though for licensing and 
portability reasons they might not be suitable for Python's standard library.

High-performance PRNGs are also present in the Intel MKL and AMD ACML 
libraries. These could be used if Python was linked against these libraries at 
build-time.

Regards,
Sturla Molden

--
nosy: +sturlamolden

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12754] Add alternative random number generators

2011-08-15 Thread Sturla Molden

Sturla Molden  added the comment:

I'm posting the code for comparison of KISS4691 and MT19937. I do realize 
KISS4691 might not be sufficiently different from MT19937 in characteristics 
for Raymond Hettinger to consider it. But at least here it is for reference 
should it be of value.

--
Added file: http://bugs.python.org/file22905/prngtest.zip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12754] Add alternative random number generators

2011-08-15 Thread Sturla Molden

Sturla Molden  added the comment:

Another (bug fix) post by Marsaglia on KISS4691:

http://www.phwinfo.com/forum/comp-lang-c/460292-ensuring-long-period-kiss4691-rng.html

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12754] Add alternative random number generators

2011-08-15 Thread Sturla Molden

Changes by Sturla Molden :


Removed file: http://bugs.python.org/file22905/prngtest.zip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12754] Add alternative random number generators

2011-08-15 Thread Sturla Molden

Changes by Sturla Molden :


Added file: http://bugs.python.org/file22906/prngtest.zip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12740] Add struct.Struct.nmemb

2011-08-15 Thread Meador Inge

Meador Inge  added the comment:

On Sun, Aug 14, 2011 at 1:03 PM, Stefan Krah  wrote:
>
> Stefan Krah  added the comment:
>
> I like random tests in the stdlib, otherwise the same thing gets tested
> over and over again. `make buildbottest` prints the seed, and you can do
> it for a single test as well:
>
> Β $ ./python -m test -r test_heapq
> Using random seed 5857004
> [1/1] test_heapq
> 1 test OK.

Ah, I see.  Then you can reproduce a run like:

$ ./python -m test -r --randseed=5857004 test_heapq

Perhaps it might be useful to include the failing output in the
assertion message as well
(just in case the seed printing option is not used):

==
FAIL: test_Struct_nmemb (__main__.StructTest)
--
Traceback (most recent call last):
  File "Lib/test/test_struct.py", line 596, in test_Struct_nmemb
self.assertEqual(s.nmemb, n, "for struct.Struct(%s)" % fmt)
AssertionError: 3658572 != 3658573 : for
struct.Struct(378576l?403320c266165pb992937H198961PiIL529090sfh887898d796871B)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12672] Some problems in documentation extending/newtypes.html

2011-08-15 Thread Eli Bendersky

Eli Bendersky  added the comment:

Terry, I'm not 100% sure about what you mean by "Python wrapper objects ... 
visible from Python", but I think I'll disagree.

There's a big difference between "C functions" in general and "type methods" 
this document speaks of. Let's leave list aside for a moment, since its being 
built-in complicates matters a bit, and let's talk about the "Noddy" type this 
documentation page plays with.

You may implement "normal" methods for Noddy, such as the "name" method added 
in the first example, by defining an array of PyMethodDef structures and 
assigning it to tp_methods.

On the other hand, the other tp_ fields imlpement special type methods (used by 
__new__, __str__, getattr/setattr, and so on). This is the major difference. 
Both are C functions, but some implement special type methods and some 
implement "normal" object methods.

If this is also what you meant, I apologize for disagreeing :-)

I believe my latest rephrasing proposal is reflecting the above understanding.

P.S. as for s/that/than/ further down - good catch, will add it to the patch 
when we decide about the first issue

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12754] Add alternative random number generators

2011-08-15 Thread Sturla Molden

Sturla Molden  added the comment:

Further suggestions to improve the random module:

** Object-oriented PRNG: Let it be an object which stores the random state 
internally, so we can create independent PRNG objects. I.e. not just one global 
generator.

** Generator for quasi-random Sobol sequences.  These fail most statistical 
tests for randomness. But for many practical uses of random numbers, e.g. 
simulations and numerical integration, convergence can be orders of magnitude 
faster (often 1000x faster convergence) but still numerically correct. That is, 
they are not designed to "look random", but fill the sample space as uniformly 
as possible.

** For cryptographically strong random numbers, os.urandom is a good back-end. 
It will use /dev/urandom on Linux and CryptGenRandom on Windows.  We still e.g. 
need a method to convert random bits from os.urandom to floats with given 
distributions. os.urandom is already in the standard library, so there is no 
reason the random module should not use it.

** Ziggurat method for generating normal, exponential and gamma deviates. This 
avoids most call to trancendental functions (sin, cos, log) in transforming 
from uniform random deviates, and is thus much faster.

** Option to return a buffer (bytearray?) with random numbers. Some times we 
don't need Python ints or floats, but rather the raw bytes.

** Support for more statistical distributions. Provide a much broader coverage 
than today. 

** Markov Chain Monte Carlo (MCMC) generator. Provide simple plug-in factories 
for the Gibbs sampler and the Metropolis-Hastings alorithm.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12754] Add alternative random number generators

2011-08-15 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

Please focus your thoughts.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12672] Some problems in documentation extending/newtypes.html

2011-08-15 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

"the type object determines which (C) functions get called when, for instance, 
an attribute gets looked up on an object or it is multiplied by another object. 
These C functions are called β€œtype methods”

"These C functions" are any of the C functions that are members of the type 
object. But they are C-level methods.

"to distinguish them from things like [].append (which we call β€œobject 
methods”)."

[].append is a Python-level method object that wraps a C function.

My revised suggestion is "... in contrast to PyObject that contain C functions, 
such as list.append or [].append."

The only contrast that makes sense to me in this context is between directly 
callable C functions and Py_Objects (which have just been described) that 
contain a C function. I believe that author is addressing Python programmers 
who are used to 'method' referring to Python objects whereas the author wants 
to use 'method' to refer to C functions, which are not Python objects.

Or the sentence could be deleted.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12672] Some problems in documentation extending/newtypes.html

2011-08-15 Thread Eli Bendersky

Eli Bendersky  added the comment:

"[].append is a Python-level method object that wraps a C function."

What makes you think that? There's no Python implementation of .append that I 
know of. Neither is there a Python implementation of the Noddy.name method that 
is discussed in the page. Both are implemented solely in C and exposed as 
methods for their respective classes via the tp_methods array.

"Or the sentence could be deleted."

This could be problematic, because the document does refer to "type methods" on 
several occasions, and it makes sense to define what it means.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12755] Service application crash in python25!PyObject_Malloc

2011-08-15 Thread Chandra Sekhar Reddy

New submission from Chandra Sekhar Reddy :

Service application crashed in python25.dll, below are the environment details.

Operating System : Windows server 2008 R2 (Virtual Machine)
Application Type : Service Application

FAULTING_IP: 
python25!PyObject_Malloc+2d
1e09603d 8b30mov esi,dword ptr [eax]

EXCEPTION_RECORD:   -- (.exr 0x)
ExceptionAddress: 1e09603d (python25!PyObject_Malloc+0x002d)
   ExceptionCode: c005 (Access violation)
  ExceptionFlags: 
NumberParameters: 2
   Parameter[0]: 
   Parameter[1]: 
Attempt to read from address 

PROCESS_NAME:  adem.exe

ADDITIONAL_DEBUG_TEXT:  
Use '!findthebuild' command to search for the target build information.
If the build information is available, run '!findthebuild -s ; .reload' to set 
symbol path and load symbols.

FAULTING_MODULE: 76f8 ntdll

DEBUG_FLR_IMAGE_TIMESTAMP:  4625bfe5

ERROR_CODE: (NTSTATUS) 0xc005 - The instruction at 0x%08lx referenced 
memory at 0x%08lx. The memory could not be %s.

EXCEPTION_CODE: (NTSTATUS) 0xc005 - The instruction at 0x%08lx referenced 
memory at 0x%08lx. The memory could not be %s.

EXCEPTION_PARAMETER1:  

EXCEPTION_PARAMETER2:  

READ_ADDRESS:   

FOLLOWUP_IP: 
python25!PyObject_Malloc+2d
1e09603d 8b30mov esi,dword ptr [eax]

FAULTING_THREAD:  2474

BUGCHECK_STR:  
APPLICATION_FAULT_INVALID_POINTER_WRITE_NULL_POINTER_WRITE_NULL_POINTER_READ_WRONG_SYMBOLS

PRIMARY_PROBLEM_CLASS:  INVALID_POINTER_WRITE_NULL_POINTER_WRITE

DEFAULT_BUCKET_ID:  INVALID_POINTER_WRITE_NULL_POINTER_WRITE

LAST_CONTROL_TRANSFER:  from 1e0c1093 to 1e09603d

STACK_TEXT:  
WARNING: Stack unwind information not available. Following frames may be wrong.
0505f088 1e0c1093 0025 04a128ea 04a128d0 python25!PyObject_Malloc+0x2d
     
python25!PyString_FromStringAndSize+0x43


STACK_COMMAND:  ~4s; .ecxr ; kb

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  python25!PyObject_Malloc+2d

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: python25

IMAGE_NAME:  python25.dll

BUCKET_ID:  WRONG_SYMBOLS

FAILURE_BUCKET_ID:  
INVALID_POINTER_WRITE_NULL_POINTER_WRITE_c005_python25.dll!PyObject_Malloc

--
components: Windows
messages: 142163
nosy: chandra
priority: normal
severity: normal
status: open
title: Service application crash in python25!PyObject_Malloc
type: crash

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12750] datetime.datetime timezone problems

2011-08-15 Thread Ben Finney

Changes by Ben Finney :


--
nosy: +bignose

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12756] datetime.datetime.utcnow should return a UTC timestamp

2011-08-15 Thread Ben Finney

New submission from Ben Finney :

=
$ date -u +'%F %T %s %z'
2011-08-16 06:42:12 1313476932 +

$ python -c 'import sys, datetime; now = datetime.datetime.utcnow(); 
sys.stdout.write(now.strftime("%F %T %s %z"))'
2011-08-16 06:42:12 1313440932 
=

The documentation for β€˜datetime.datetime.utcnow’ says β€œReturn a new datetime 
representing UTC day and time.” The resulting object should be in the UTC 
timezone, not a naive no-timezone value.

This results in programs specifically requesting UTC time with β€˜utcnow’, but 
then Python treating the return value as representing local time since it is 
not marked with the UTC timezone.

--
components: Library (Lib)
messages: 142164
nosy: Daniel.O'Connor, bignose, r.david.murray
priority: normal
severity: normal
status: open
title: datetime.datetime.utcnow should return a UTC timestamp
type: feature request
versions: Python 2.7, Python 3.2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com