from:"Marc\-Andre Lemburg"

[issue46249] [sqlite3] move set lastrowid out of the query loop

2022-01-04 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

On 04.01.2022 00:49, Erlend E. Aasland wrote:
> 
> Erlend E. Aasland  added the comment:
> 
> I see that PEP 249 does not define the semantics of lastrowid for 
> executemany(). What's the precedence here, MAL? IMO, it would be nice to be 
> able to fetch the last row id after you've done a 1000 inserts using 
> executemany().

The DB-API deliberately leaves this undefined, since there are many ways you
could implement this, e.g.

- return the last row id for the last entry in the array passed to 
.executemany()
- return the last row id of the last actually modified/inserted row after
running .executemany()
- return an array of row ids, one for each row modified/inserted
- return a row id of one of the modified/inserted rows, without defining which
- always return None for .executemany()

Note that in some cases, the order of actions taken by the database is not
predefined (e.g. some databases run such inserts in chunks across
a cluster), so even the "last" semantics are not clear.

> So, another option would be to keep "set-lastrowid" in the query loop, and 
> just remove the condition; we set it every time (but of course only build a 
> PyObject of it when the loop is done).

Since the DB-API leaves this undefined, you are free to add your own
meaning, which makes most sense for SQLite, to always return None or
not implement it at all.

DB-API extensions such as Cursor.lastrowid are optional and don't need
to be implemented if they don't make sense for a particular use case:

https://www.python.org/dev/peps/pep-0249/#optional-db-api-extensions

--

___
Python tracker 
<https://bugs.python.org/issue46249>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46249] [sqlite3] move set lastrowid out of the query loop

2022-01-04 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

On 04.01.2022 11:02, Erlend E. Aasland wrote:
> 
> Erlend E. Aasland  added the comment:
> 
> Thank you for your input Marc-André.
> 
> For SQLite, it's pretty simple: we use an API called 
> sqlite3_last_insert_rowid() which takes the database connection as it's 
> argument, not a statement pointer. This function returns "the rowid of the 
> most recent successful INSERT into a rowid table or virtual table on database 
> connection" (quote from SQLite docs). IMO, it would make sense to also use 
> this post executemany().

Sounds like a plan.

If possible, it's usually better to have the .executemany() create a
cursor with an output array providing the row ids, e.g. using "INSERT ...
RETURNING ..." (PostgreSQL). That way you can access all row ids and
can also provide the needed detail in case the INSERTs happen out of
order to map them to the input data.

For cases where you don't need sequence IDs, it's often better to
not rely on auto-increment columns for IDs, but instead use random
pre-generated IDs. Saves roundtrips to the database and works nicely
with cluster databases as well.

--

___
Python tracker 
<https://bugs.python.org/issue46249>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46249] [sqlite3] move set lastrowid out of the query loop

2022-01-04 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

On 04.01.2022 21:02, Erlend E. Aasland wrote:
> 
> Erlend E. Aasland  added the comment:
> 
>> If possible, it's usually better to have the .executemany() create a
>> cursor with an output array providing the row ids, e.g. using "INSERT ...
>> RETURNING ..." (PostgreSQL). That way you can access all row ids and
>> can also provide the needed detail in case the INSERTs happen out of
>> order to map them to the input data.
> 
> Hm, maybe so. But it will add a lot of overhead and complexity to 
> executemany(), and there haven't been requests for this feature for sqlite3. 
> AFAIK, there hasn't been request for lastrowid for executemany() at all. 
> OTOH, my proposal of modifying lastrowid to always show the rowid of the 
> actual last inserted row is a very cheap operation, _and_ it simplifies the 
> code (=> increased maintainability), so I think I'll go for that.

Sorry, I wasn't suggesting this for SQLite; it's just a better
and more flexible option than using cursor.lastrowid where
available. Sadly, the PG extension is not standards conform SQL.

>> For cases where you don't need sequence IDs, it's often better to
>> not rely on auto-increment columns for IDs, but instead use random
>> pre-generated IDs. Saves roundtrips to the database and works nicely
>> with cluster databases as well.
> 
> Yes, but in those cases you keep track of the row id yourself, so you 
> probably won't need lastrowid ;)

Indeed, and that's the point :-)

Those auto-increment ID fields are
not really such a good idea to begin with. Either you know that you
will need to manipulate the rows after inserting them (in which case
you can set an ID) or you don't care about the individual rows and
only want to aggregate or search in them based on other fields.

--

___
Python tracker 
<https://bugs.python.org/issue46249>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12756] datetime.datetime.utcnow should return a UTC timestamp

2022-01-10 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

Hi Tony,

from practical experience, it is a whole lot better to not deal with
timezones in data processing code at all, but instead only use
naive UTC datetime values everywhere, expect when you have to
prepare reports or output which has a requirement to show datetime
value in local time or some specific timezone.

You convert all datetime values into UTC upon input, possibly
store the timezone somewhere, if that is relevant for later reporting,
and then forget about timezones.

Your code will run faster, become a lot easier to understand
and you avoid many pitfalls that TZs have, esp. when TZs are
silently dropped interfacing to e.g. numeric code, databases or
other external code.

There's a reason why cloud code (and a lot of other code, such
as data science code) has standardized on UTC :-)

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

--
nosy: +lemburg

___
Python tracker 
<https://bugs.python.org/issue12756>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46338] libc_ver() runtime error when sys.executable is empty

2022-01-11 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

On 10.01.2022 23:01, Allie Hammond wrote:
> 
> libc_ver() in platform.py (called from platform()) causes a runtime error if 
> sys.executable returns null. In my case, FreeRADIUS offers a module 
> rlm_python3 which allows you to run python code from the C based FreeRADIUS 
> server - since this module doesn't use a python binary to execute 
> sys.executable returns null trigering this error.

Interesting. I guess rlm_python3 embeds Python. Is sys.executable an
empty string or None ?

--

___
Python tracker 
<https://bugs.python.org/issue46338>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46249] [sqlite3] move set lastrowid out of the query loop and enable it for executemany()

2022-01-11 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

On 08.01.2022 21:56, Erlend E. Aasland wrote:
>  
> Marc-André: since Python 3.6, the sqlite3.Cursor.lastrowid attribute does no 
> longer comply with the recommendations of PEP 249:
> 
> Previously, lastrowid was set to None for operations other than INSERT or 
> REPLACE. This changed with ab994ed8b97e1b0dac151ec827c857f5e7277565 (in 
> Python 3.6), so that lastrowid is _unchanged_ for operations other than 
> INSERT or REPLACE, and it is set to 0 after the first valid SQL (that is not 
> INSERT/REPLACE) is executed on the cursor.
> 
> Now, PEP 249 only _recommends_ that lastrowid is set to None for operations 
> that do not modify a row, so it's probably not a big deal. No-one has ever 
> mentioned this change in behaviour; there have been no bug reports.
> 
> FTR, here is the relevant quote from PEP 249:
> 
> If the operation does not set a rowid or if the database does not support
> rowids, this attribute should be set to None.
> 
> (I interpret "should" as understood by RFC 2119.)

Well, it may be a little stronger than the SHOULD in the RFC, but then
again the whole DB-API is about conventions and if they don't make sense
for a database backend, it is possible to deviate from the spec, esp. for
optional extensions such as .lastrowid.

> So, my follow-up question becomes:
> I see no point in reverting to pre Python 3.6 behaviour. I would rather 
> change the default value to be 0 (to get rid of the dirty flag in GH-30380), 
> and to make the behaviour more consistent with how the actual SQLite API 
> behaves.
> 
> 
> Do you have an opinion about such a change (in behaviour)?

Is 0 a valid row ID in SQLite ? If not, then I guess this would be
an alternative to None as suggested by the DB-API.

If it is a valid row ID, I'd suggest to go back to resetting to None,
since otherwise code might get confused: if an UPDATE does not get
applied (e.g. a condition is false), code could then still take
.lastrowid as referring to the UPDATE and not a previous
operation, since code will now know whether the condition was met
or not.
 --
Marc-Andre Lemburg
eGenix.com

--
title: [sqlite3] lastrowid improvements -> [sqlite3] move set lastrowid out of 
the query loop and enable it for executemany()

___
Python tracker 
<https://bugs.python.org/issue46249>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46249] [sqlite3] move set lastrowid out of the query loop and enable it for executemany()

2022-01-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

On 11.01.2022 20:46, Erlend E. Aasland wrote:
> 
> If we are to revert to this behaviour, we'll have to start examining the SQL 
> we are given (search for INSERT and REPLACE keywords, determine if they are 
> valid (i.e. not a comment, not part of a column or table name, etc.), which 
> will lead to a noticeable performance hit for every new statement (not for 
> statements reused via the LRU cache though). I'm not sure this is a good 
> idea. However I will give it a good thought.
>
> My first thought now, is that it would be better for the sqlite3 module to 
> align lastrowid with the behaviour of the C API sqlite3_last_insert_rowid() 
> (also available as an SQL function: last_insert_rowid). OTOH, the SQLite API 
> is tied to the _connection_ object, so it may not make sense to align it with 
> lastrowid which is a _cursor_ attribute.

I've had a look at the API description and find it less than useful,
to be honest:

https://sqlite.org/c3ref/last_insert_rowid.html

You don't know on which cursor the last row was inserted, it's
possible that this was or is done by a trigger and the last row
is not updated in case the INSERT does not succeed for some reason,
leaving it unchanged - without the user getting a notification of
this failure, since the .execute() call itself will succeed for
e.g. "INSERT INTO table SELECT ...;".

It also seems that the function really only works for INSERTs and
not for UPDATEs.

> Perhaps the Right Thing To Do™ is to be conservative and just leave it as it 
> is. I still want to apply the optimisation, though. It does not alter the 
> behaviour in any kind of way, and it speeds up executemany().

I'd suggest to deprecate the cursor.lastrowid attribute and
instead point people to the much more useful

"INSERT INTO t (name) VALUES ('two'), ('three') RETURNING ROWID;"

https://sqlite.org/lang_insert.html
https://sqlite.org/forum/forumpost/058ac49cc3

(good to know that SQLite has adopted this PostgreSQL variant as
well)

RETURNING is also available for UPDATES:

https://sqlite.org/lang_update.html

If people really want to use the sqlite3_last_insert_rowid()
functionality, they can use the SQL function of the same name:

https://www.sqlite.org/lang_corefunc.html#last_insert_rowid

which then has known semantics and doesn't conflict with the DB-API
specs.

But this is your call :-)

--

___
Python tracker 
<https://bugs.python.org/issue46249>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46249] [sqlite3] move set lastrowid out of the query loop and enable it for executemany()

2022-01-12 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

On 11.01.2022 21:30, Erlend E. Aasland wrote:
> 
>> I'd suggest to deprecate the cursor.lastrowid attribute and
>> instead point people to the much more useful [...]
> 
> Yes, I think mentioning the RETURNING ROWID trick in the sqlite3 docs is a 
> very nice improvement. Mentioning the last_insert_rowid SQL function is 
> probably also worth consideration.
> 
> I'm reluctant to deprecate cursor.lastrowid, though. ATM, I'm leaning towards 
> just keeping the current behaviour.

Fair enough :-)

Perhaps just documenting that the value is not necessarily what
people may expect, when coming from other databases due to the
different semantics with SQLite, is enough.
 --
Marc-Andre Lemburg
eGenix.com

--

___
Python tracker 
<https://bugs.python.org/issue46249>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45382] platform() is not able to detect windows 11

2022-01-26 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

On 26.01.2022 01:29, Eryk Sun wrote:
> 
> Eryk Sun  added the comment:
> 
>> Bit wmic seems nice solution.
>> Is still working for windows lower than 11?
> 
> wmic.exe is still included in Windows 10 and 11, but it's officially 
> deprecated [1], which means it's no longer being actively developed, and it 
> might be removed in a future update. PowerShell is the preferred way to use 
> WMI.

All of these require shelling out to the OS, so why not stick to
`ver` as we've done in the past. This has existed for ages and
will most likely not disappear anytime soon.

Is there a good reason to prefer wmic or PowerShell (which are
less likely to be available or reliable) ?

> ---
> [1] 
> https://docs.microsoft.com/en-us/windows/deployment/planning/windows-10-deprecated-features
-- 
Marc-Andre Lemburg
eGenix.com

--

___
Python tracker 
<https://bugs.python.org/issue45382>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46659] Deprecate locale.getdefaultlocale() function

2022-02-06 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

> For these reasons, I propose to deprecate locale.getdefaultlocale(): 
> setlocale(), getpreferredencoding() and getlocale() should be used instead.

Please see the discussion on https://bugs.python.org/issue43552: 
locale.getpreferredencoding() needs to be deprecated as well. Instead we should 
have a single locale.getencoding() as outlined there... perhaps in a separate 
ticket ?! Thanks.

--
nosy: +lemburg

___
Python tracker 
<https://bugs.python.org/issue46659>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46662] Lib/sqlite3/dbapi2.py: convert_timestamp function failed to correctly parse timestamp

2022-02-09 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

On 08.02.2022 11:54, Erlend E. Aasland wrote:
> 
> The sqlite3 timestamp converter is buggy, as already noted in the docs[^1]. 
> Adding timezone support is out of the question[^2][^3][^4][^5], but fixing it 
> to be able to discard any attached timezone info _may_ be ok; at first sight, 
> I don't see how this could break existing applications (like, for example 
> adding time zone support could do). I need to think it through.

I think it's better to deprecate these converters and let users implement
their own.

--

___
Python tracker 
<https://bugs.python.org/issue46662>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46659] Deprecate locale.getdefaultlocale() function

2022-02-24 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

Thanks, Victor.

--

___
Python tracker 
<https://bugs.python.org/issue46659>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1771381] bsddb can't use unicode keys

2007-08-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Unassigning since I don't know the details of bsddb.

--
assignee: lemburg -> 

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1771381>
_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue225476] Codec naming scheme and aliasing support

2007-08-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Closing this request as the encodings package search function should not
be used import external codecs (this poses a security risk).

--
status: open -> closed


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue225476>

___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue880951] "ez" format code for ParseTuple()

2007-08-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Closing. There doesn't seem to be much interest in this.

--
status: open -> closed


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue880951>

___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1001895] Adding missing ISO 8859 codecs, especially Thai

2007-08-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Not sure why this is still open. The patches were checked in a long time
ago.

--
status: open -> closed

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1001895>
_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue547537] cStringIO should provide a binary option

2007-08-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Unassigning: I've never had a need for this in the past years.

--
assignee: lemburg -> 


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue547537>

___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue883466] quopri encoding & Unicode

2007-08-30 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Georg: Yes, sure.


Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue883466>

___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1528802] Turkish Character

2007-08-30 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Unassigning this.

Unless someone provides a patch to add context sensitivity to the
Unicode upper/lower conversions, I don't think anything will change.

The mapping you see in Python (for Unicode) is taken straight from the
Unicode database and there's nothing we can or want to do to change
those predefined mappings.

The 8-bit string mappings OTOH are taken from the underlying C library -
again nothing we can change.

--
assignee: lemburg -> 

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1528802>
_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1071] unicode.translate() doesn't error out on invalid translation table

2007-09-01 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Nice idea, but why don't you use a dictionary iterator (PyDict_Next())
for the fixup ?

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1071>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1071] unicode.translate() doesn't error out on invalid translation table

2007-09-02 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Ah, I hadn't noticed that you're actually manipulating the input
dictionary. You should create a copy and fix that instead of changing
the dict that the user passed in to the function.

You can then use PyDict_Next() for fast iteration over the original
dictionary.

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1071>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1505257] winerror module

2007-09-06 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

The winerror module should really be coded in C. Otherwise you don't
benefit from the lookup object approach.

The files I uploaded only server as basis for such a C module. Would be
great if you could find someone to write such a module - preferably
using a generator that creates it from the header files.

I'm not interested in this anymore, so feel free to drop the whole idea.

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1505257>
_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1082] platform system may be Windows or Microsoft since Vista

2007-09-17 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

A couple of notes:

* platform.uname() needs to be fixed, not the individual query functions.

* The third entry of uname() should return "Vista" instead of
"Microsoft" on MS Vista.

* A patch should go on trunk and into 2.5.2, since this is a real bug
and not a feature change.

Any other changes to accommodate for differences between used marketing
names and underlying OS names should go into system_alias().

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1082>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1082] platform system may be Windows or Microsoft since Vista

2007-09-17 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Yes, please. Thanks.

--
assignee: lemburg -> jafo

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1082>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1082] platform system may be Windows or Microsoft since Vista

2007-09-18 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Pat, we already have system_alias() for exactly the purpose you suggested.

Software relying on platform.system() reporting "Vista" will have to use
Python 2.5.2 as minimum Python version system requirement - pretty much
the same as with all other bug fixes.

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1082>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE

2007-10-15 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

My name appears in that Makefile because I wrote it and used it to
create the charmap codecs.

The reason why the Mac Japanese codec was not created for 2.x was the
size of the mapping table. 

Ideal would be to have the C version of the CJK codecs support the Mac
Japanese encoding as well. Adding back the charmap based Mac Japanese
codec would be a compromise.

The absence of the Mac Japanese codec causes (obvious) problems for many
Japanese Python users running Mac OS X.

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1276>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE

2007-10-15 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Adding Python 2.6 as version target.

--
versions: +Python 2.6

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1276>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1399] XML codec

2007-11-07 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Nice codec !

The only nit I have is the name: "xml" isn't intuitive enough. I had to
read the code to figure out what the codec actually does. 

"xml" used a encoding usually refers to having Unicode text converted to
ASCII with XML entity escapes for all non-ASCII characters.

How about "xml-auto-detect" or something along those lines ?!

--
nosy: +lemburg

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1399>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1399] XML codec

2007-11-07 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Leaving the module name as "xml" would remove that name from the
namespace of possible encodings.

"xml" as encoding name is problematic, as many people regard writing
data in XML as "encoding the data in XML".

I'd simply not use it at all, not even for a codec that converts between
 Unicode and ASCII+XML entities.

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1399>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1399] XML codec

2007-11-08 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Thanks, Walter !

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1399>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1234] semaphore errors on AIX 5.2

2007-11-14 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

The problem is also present in Python 2.4 and 2.3. Confirmed on AIX 5.3.

--
nosy: +lemburg
versions: +Python 2.3, Python 2.4

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1234>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1433] marshal roundtripping for unicode

2007-11-15 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

I think you have a wrong understanding of round-tripping. 

In Unicode it is really irrelevant if you're using a UCS2 surrogate pair
or a UCS4 representation to describe a code point. The length of the
Unicode representation may change, but the meaning won't, so you don't
lose any information.

--
nosy: +lemburg

__
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1433>
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1620174] Improve platform.py usability on Windows

2007-11-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

Rejecting the patch, since it hasn't been updated.

--
resolution:  -> rejected
status: open -> closed

_
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue1620174>
_
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-09-19 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Tom Christiansen wrote:
> 
> I'm pretty sure that anything that claims to be UTF-{8,16,32} needs  
> to reject both surrogates *and* noncharacters. Here's something from the
> published Unicode Standard's p.24 about noncharacter code points:
> 
> • Noncharacter code points are reserved for internal use, such as for 
>   sentinel values. They should never be interchanged. They do, however,
>   have well-formed representations in Unicode encoding forms and survive
>   conversions between encoding forms. This allows sentinel values to be
>   preserved internally across Unicode encoding forms, even though they are
>   not designed to be used in open interchange.
> 
> And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 
> 59:
> 
> C2 A process shall not interpret a noncharacter code point as an 
>abstract character.
> 
> • The noncharacter code points may be used internally, such as for 
>   sentinel values or delimiters, but should not be exchanged publicly.

You have to remember that Python is used to build applications. It's
up to the applications to conform to Unicode or not and the
application also defines what "exchange" means in the above context.

Python itself needs to be able to deal with assigned non-character
code points as well as unassigned code points or code points that
are part of special ranges such as the surrogate ranges.

I'm +1 on not allowing e.g. lone surrogates in UTF-8 data, because
we have a way to optionally allow these via an error handler,
but -1 on making changes that cause full range round-trip safety
of the UTF encodings to be lost without a way to turn the functionality
back on.

--
title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> 
Python lib re cannot handle Unicode properly due to   narrow/wide bug

___
Python tracker 
<http://bugs.python.org/issue12729>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12508] Codecs Anomaly

2011-09-21 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

The final parameter is an extension to the decoder API signature,
so it's not surprising that not all codecs implement it.

The ones that do should use it for all calls, since that way
the actual consumed number of bytes is correctly reported
back to the StreamReader instance.

Note: The parameter name "final" is a bit misleading. What happens
is that the number of bytes consumed by the decoder were previously
always reported as len(buffer), since the C API for decoders did
not provide a way to report back the number of bytes consumed.
This was changed when stateful decoders were added to the C API,
since these do allow reporting back the consumed bytes. A more
appropriate name for the parameter would have been
"report_bytes_consumed".

--

___
Python tracker 
<http://bugs.python.org/issue12508>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13136] speed-up conversion between unicode widths

2011-10-08 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Antoine Pitrou wrote:
> 
> New submission from Antoine Pitrou :
> 
> This patch speeds up _PyUnicode_CONVERT_BYTES by unrolling its loop.
> 
> Example micro-benchmark:
> 
> ./python -m timeit -s "a='x'*1;b='\u0102'*1000;c='\U0010'" "a+b+c"
> 
> -> before:
> 10 loops, best of 3: 14.9 usec per loop
> -> after:
> 10 loops, best of 3: 9.19 usec per loop

Before going further with this, I'd suggest you have a look at your
compiler settings. Such optimizations are normally performed by the
compiler and don't need to be implemented in C, making maintenance
harder.

The fact that Windows doesn't exhibit the same performance difference
suggests that the optimizer is not using the same level or feature
set as on Linux. MSVC is at least as good at optimizing code as gcc,
often better.

I tested using memchr() when writing those "naive" loops. It turned
out that using memchr() was slower than using the direct loops. memchr()
is inlined by the compiler just like the direct loop and the generated
code for the direct version is often easier to optimize for the compiler
than the memchr() one, since it receives more knowledge about the used
data types.

--
nosy: +lemburg

___
Python tracker 
<http://bugs.python.org/issue13136>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13134] speed up finding of one-character strings

2011-10-09 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

[Posted the reply to the right ticket; see issue13136 for the original
 post to the wrong ticket]

Antoine Pitrou wrote:
> 
> Antoine Pitrou  added the comment:
> 
>> Before going further with this, I'd suggest you have a look at your
>> compiler settings.
> 
> They are set by the configure script:
> 
> gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall
> -Wstrict-prototypes-I. -I./Include-DPy_BUILD_CORE -o
> Objects/unicodeobject.o Objects/unicodeobject.c

Which gcc version are you using ?
Is it possible that you have -fno-builtin enabled ?

>> Such optimizations are normally performed by the
>> compiler and don't need to be implemented in C, making maintenance
>> harder.
> 
> The fact that the glibc includes such optimization (in much more
> sophisticated form) suggests to me that many compilers don't perform
> these optimizations automically.

When using gcc, the glibc functions are usually not used at all,
since gcc comes with a (rather large) set of builtins which are
inlined directly, if you have optimizations enabled and inlining
is found to be more efficient than calling the glibc function:

http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

glibc includes the optimized versions since it has to implement
C library (obviously) and for cases where inlining does not
happen.

>> I tested using memchr() when writing those "naive" loops.
> 
> memchr() is mentioned in another issue, #13134.
> 
>> memchr()
>> is inlined by the compiler just like the direct loop
> 
> I don't think so. If you look at the glibc's memchr() implementation,
> it's a sophisticated routine, not a trivial loop. Perhaps you're
> thinking about memcpy().

See http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html and the
assembler output. If it's not inlined, then something must be
preventing this and it would be good to find out why.

>> and the generated
>> code for the direct version is often easier to optimize for the compiler
>> than the memchr() one, since it receives more knowledge about the used
>> data types.
> 
> ?? Data types are fixed in the memchr() definition, there's no knowledge
> to be gained by inlining.

There is: the compiler will have alignement information available and
can also benefit from using registers instead of the stack, knowledge
about processor cache lines, etc. Such information is lost when calling
a function. The function call itself will also create some overhead.

BTW: You should not only test the optimization with long strings, but also
with short ones (e.g. 2-15 chars) - which is a much more common case
in practice.

--
nosy: +lemburg

___
Python tracker 
<http://bugs.python.org/issue13134>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13136] speed-up conversion between unicode widths

2011-10-09 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Antoine Pitrou wrote:
> 
>> I tested using memchr() when writing those "naive" loops.
> 
> memchr() is mentioned in another issue, #13134.

Looks like I posted the comment to the wrong ticket.

--

___
Python tracker 
<http://bugs.python.org/issue13136>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12619] Automatically regenerate platform-specific modules

2011-10-19 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

I don't see why these modules should be auto-generated. The constants
in the modules hardly ever change and are also not affected by architecture
differences (e.g. Mac OS X, Solaris, etc.) AFAICT.

If you think they need to be auto-generated, you should make a case by
example.

Note that we cannot simply drop the modules. Some of the constants are
needed for e.g. socket, ctypes or ldd programming.

--

___
Python tracker 
<http://bugs.python.org/issue12619>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12619] Automatically regenerate platform-specific modules

2011-10-19 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
>> you should make a case by example
> 
> Did you read comments of this issue and my email thread on python-dev?

No.

> There are differents examples:
> 
>  - LONG_MAX is 9223372036854775807 even on 32 bits system
>  - On Mac OS X, FAT programs contains 32 and 64 binaries, whereas constants 
> are changed for 32 or 64 bits

That's because the h2py.py script doesn't know anything about conditional
definitions, e.g.

PTRDIFF_MIN = (-9223372036854775807-1)
PTRDIFF_MAX = (9223372036854775807)
PTRDIFF_MIN = (-2147483647-1)
PTRDIFF_MAX = (2147483647)

Not all constants are really useful because of this, but some are, e.g.
INT16_MAX, INT32_MAX, etc.

>> we cannot simply drop the modules. Some of the constants
>> are needed for e.g. socket, ctypes or ldd programming.
> 
> Ah? I removed all plat-* directories and ran the full test suite: no failure. 

Right, the modules are not tested at all, AFAIK.

> The Python socket modules contain many constants (SOCK_*, AF_*,  SO_*, ...):
> http://docs.python.org/library/socket.html#socket.AF_UNIX

True, but you probably agree that it's easier to parse a header file
into a Python module than to add each and every socket option on
the planet to the C socket module, don't you ? :-)

> Which constants are used by the ctypes modules or can be used by modules 
> using ctypes? Can you give examples? I listed usages of plat-* modules in the 
> first message of my thread on python-dev.

Not constants used by the ctypes, but constants which can be used
with the ctypes module.

> By "ldd", you mean "ld.so" (dlopen)?

Yes.

> Yes, I agree that we need to expose  dl constants. But the other constants 
> are not used.

Not in the standard lib, that's true.

Also note that the plat-* directories can contain platform specific code,
e.g. the OS2 dirs have replacements for the pwd and grp modules.

Finally, the list of standard files to include in those directories
could be extended to cover more system level constants such as
the ioctl or fcntl constants (not only the few defined in the C
code, but all platform specific ones).

--

___
Python tracker 
<http://bugs.python.org/issue12619>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13466] new timezones

2011-11-23 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Amaury Forgeot d'Arc wrote:
> 
> The error comes from the way Python computes timezone and daylight: it 
> queries the tm_gmtoff of two timestamps, one close to the first of January, 
> the other close to the first of July.  But last January the previous 
> definition of the timezone was still in force...  and indeed, when I changed 
> the code to use *next* January instead, I have the expected values.
> 
> Is there an algorithm that gives the correct answer?  Taking the 1st of 
> January closest to the current date would not work either.  Or is there 
> another way (in portable C) to approach timezones?

A fairly "correct" way is to query the time zone database at time module
import time by using the DST and GMT offset of that time.

IMO time.timezone and time.daylight should be deprecated since they
will give wrong results around DST changes (both switch times and
legal changes such as the described one) in long running processes
such as daemons.

--

___
Python tracker 
<http://bugs.python.org/issue13466>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13466] new timezones

2011-11-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc  added the comment:
> 
>> A fairly "correct" way is to query the time zone database at time module
>> import time by using the DST and GMT offset of that time.
> 
> But that does not give the *other* timezone :-(

Which other timezone ?

You set time.timezone to the GMT offset of the import time
and then subtract another 3600 seconds in case tm_isdst is
set.

>> IMO time.timezone and time.daylight should be deprecated since they
>> will give wrong results around DST changes (both switch times and
>> legal changes such as the described one) in long running processes
>> such as daemons.
> 
> time.timezone is the non-DST timezone: this value does not change around
> the DST change date. 

No, but time.daylight changes and time.timezone can change in situations
like these where a region decides to change the way DST is dealt
with, e.g. switches to the DST timezone or moves the switchover date.

Since both values are tied to a specific time I don't think it's
a good idea to have them as module globals.

> That's why the current implementation uses "absolute"
> dates like the of January: DST changes are often in March and October.

Such an algorithm can be used as fallback solution in case
tm_isdst is -1 (unknown), but not in case the DST information
is available.

> What about this algorithm:
> - pick the first of January and the first of July surrounding the current date
> - if both have tm_idst==0, the region has no DST. Use the current GMT offset 
> for
>   both timezone and altzone; daylight=0

Those two steps are not necessary. If tm_isdst == 0, you already know that
the current time zone is not DST.

> - otherwise, use the *current* time and get its DST and GMT offset. This is 
> enough
> to compute both timezone and altzone (with the relation altzone=timezone-3600)

That's what I suggested above.

--

___
Python tracker 
<http://bugs.python.org/issue13466>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13466] new timezones

2011-11-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc  added the comment:
> 
>>> But that does not give the *other* timezone :-(
>> Which other timezone ?
> I meant the other timezone *name*.
> 
> I think we don't understand each other:
> - time.timezone is the offset of the local (non-DST) timezone.
> - time.altzone is the offset of local DST timezone.

Yes, I know.

> They don't depend on the current date, they depend only on the timezone 
> database.
> localtime() only gives the timezone for a given point in time, and the time 
> module needs to present two timezones.

Right, but they should only depend on the data in the timezone
database at the time of import of the module and not determine
the values by looking at specific dates in the past.

The only problem is finding out whether the locale uses DST in
case the current import time points to a non-DST time. This can
be checked by looking at Jan 1st and June 1st after
the current import time (ie. in the future) and then testing
tm_isdst. If there is a DST change, then you set
time.altzone = time.timezone - 3600. Otherwise, you set
time.altzone = time.timezone.

--

___
Python tracker 
<http://bugs.python.org/issue13466>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13707] Clarify hash() constancy period

2012-01-04 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Terry J. Reedy wrote:
> 
> Terry J. Reedy  added the comment:
> 
> Martin, I do not understand. The default hash is based on id (as is default 
> equality comparison), not value. Are you OK with hash values changing if the 
> 'value' changes? My understanding is that changing hash values for objects in 
> sets and dicts is bad, which is why mutable builtins with value-based 
> equality do not have hash values.

Hash values are based on the object values, not their id(). See the
various type implementations as reference. The id() is only used
as hash for objects which don't have a "value" (and thus cannot be
compared).

Given that we have the invariant "a==b => hash(a)==hash(b)" in Python,
it immediately follows that hash values for objects with comparison
method cannot have a lifetime - at least not within the same process
and, depending how you look at it, also not in multi-process
applications.

--
nosy: +lemburg

___
Python tracker 
<http://bugs.python.org/issue13707>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-04 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Some comments:

1. The security implications in all this is being somewhat overemphasized.

There are many ways you can do a DoS attack on web servers. It's the
responsibility of the used web frameworks and servers to deal with
the possible cases.

It's a good idea to provide some way to protect against hash
collision attacks, but that will only solve one possible way of
causing a resource attack on a server.

There are other ways you can generate lots of CPU overhead with
little data input (e.g. think of targeting the search feature on
many Zope/Plone sites).

In order to protect against such attacks in general, we'd have to
provide a way to control CPU time and e.g. raise an exception if too
much time is being spent on a simple operation such as a key insertion.
This can be done using timers, signals or even under OS control.

The easiest way to protect against the hash collision attack is by
limiting the POST/GET/HEAD request size.

The second best way would be to limit the number of parameters that a
web framework accepts for POST/GET/HEAD request.

2. Changing the semantics of hashing in a dot release is not allowed.

If randomization of the hash start vector or some other method is
enabled by default in a dot release, this will change the semantics
of any application switching to that dot release.

The hash values of Python objects are not only used by the Python
dictionary implementation, but also by other storage mechanisms
such as on-disk dictionaries, inter-process object exchange via
share memory, memcache, etc.

Hence, if changed, the hash change should be disabled per default
for dot releases and enabled for 3.3.

3. Changing the way strings are hashed doesn't solve the problem.

Hash values of other types can easily be guessed as well, e.g.
take integers which use a trivial hash function.

We'd have to adapt all hash functions of the basic types in Python
or come up with a generic solution using e.g. double-hashing
in the dictionary/set implementations.

4. By just using a random start vector you change the absolute
hash values for specific objects, but not the overall hash sequence
or its period.

An attacker only needs to create many hash collisions, not
specific ones. It's the period of the hash function that's
important in such attacks and that doesn't change when moving to
a different start vector.

5. Hashing needs to be fast.

It's one of the most used operations in Python. Please get experts into
the boat like Tim Peters and Christian Tismer, who both have worked
on the dict implementation and the hash functions, before experimenting
with ad-hoc fixes.

6. Counting collisions could solve the issue without having to
change hashing.

Another idea would be counting the collisions and raising an
exception if the number of collisions exceed a certain
threshold.

Such a change would work for all hashable Python objects and
protect against the attack without changing any hash function.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com



::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--
nosy: +lemburg

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-04 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Marc-Andre Lemburg wrote:
> 
> 3. Changing the way strings are hashed doesn't solve the problem.
> 
> Hash values of other types can easily be guessed as well, e.g.
> take integers which use a trivial hash function.

Here's an example for integers on a 64-bit machine:

>>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 100))
>>> d = dict(g)

This takes ages to complete and only uses very little memory.
The input data has some 32MB if written down in decimal numbers
- not all that much data either.

32397634

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-04 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

The email interface ate part of my reply:

>>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 100))
>>> s = ''.join(str(x) for x in g)
>>> len(s)
32397634
>>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 100))
>>> d = dict(g)
... lots of time for coffee, pizza, taking a walk, etc. :-)

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-04 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Marc-Andre Lemburg wrote:
> 
> 1. The security implications in all this is being somewhat overemphasized.
> 
> There are many ways you can do a DoS attack on web servers. It's the
> responsibility of the used web frameworks and servers to deal with
> the possible cases.
> 
> It's a good idea to provide some way to protect against hash
> collision attacks, but that will only solve one possible way of
> causing a resource attack on a server.
> 
> There are other ways you can generate lots of CPU overhead with
> little data input (e.g. think of targeting the search feature on
> many Zope/Plone sites).
> 
> In order to protect against such attacks in general, we'd have to
> provide a way to control CPU time and e.g. raise an exception if too
> much time is being spent on a simple operation such as a key insertion.
> This can be done using timers, signals or even under OS control.
> 
> The easiest way to protect against the hash collision attack is by
> limiting the POST/GET/HEAD request size.

For GET and HEAD, web servers normally already apply such limitations
at rather low levels:

http://stackoverflow.com/questions/686217/maximum-on-http-header-values

So only HTTP methods which carry data in the body part of the HTTP
request are effected, e.g. POST and various WebDAV methods.

> The second best way would be to limit the number of parameters that a
> web framework accepts for POST/GET/HEAD request.

Depending on how parsers are implemented, applications taking
XML/JSON/XML-RPC/etc. as data input may also be vulnerable, e.g.
non validating XML parsers which place element attributes into
a dictionary or a JSON parser that has to read the JSON version of
the dict I generated earlier on.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-05 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Paul McMillan wrote:
> 
> This is not something that can be fixed by limiting the size of POST/GET. 
> 
> Parsing documents (even offline) can generate these problems. I can create 
> books that calibre (a Python-based ebook format shifting tool) can't convert, 
> but are otherwise perfectly valid for non-python devices. If I'm allowed to 
> insert usernames into a database and you ever retrieve those in a dict, 
> you're vulnerable. If I can post things one at a time that eventually get 
> parsed into a dict (like the tag example), you're vulnerable. I can generate 
> web traffic that creates log files that are unparsable (even offline) in 
> Python if dicts are used anywhere. Any application that accepts data from 
> users needs to be considered.
> 
> Even if the web framework has a dictionary implementation that randomizes the 
> hashes so it's not vulnerable, the entire python standard library uses dicts 
> all over the place. If this is a problem which must be fixed by the 
> framework, they must reinvent every standard library function they hope to 
> use.
> 
> Any non-trivial python application which parses data needs the fix. The 
> entire standard library needs the fix if is to be relied upon by applications 
> which accept data. It makes sense to fix Python.

Agreed: Limiting the size of POST requests only applies to *web* applications.
Other applications will need other fixes.

Trying to fix the problem in general by tweaking the hash function to
(apparently) make it hard for an attacker to guess a good set of
colliding strings/integers/etc. is not really a good solution. You'd
only be making it harder for script kiddies, but as soon as someone
crypt-analysis the used hash algorithm, you're lost again.

You'd need to use crypto hash functions or universal hash functions
if you want to achieve good security, but that's not an option for
Python objects, since the hash functions need to be as fast as possible
(which rules out crypto hash functions) and cannot easily drop the invariant
"a=b => hash(a)=hash(b)" (which rules out universal hash functions, AFAICT).

IMO, the strategy to simply cap the number of allowed collisions is
a better way to achieve protection against this particular resource
attack. The probability of having valid data reach such a limit is
low and, if configurable, can be made 0.

> Of course we must fix all the basic hashing functions in python, not just the 
> string hash. There aren't that many. 

... not in Python itself, but if you consider all the types in Python
extensions and classes implementing __hash__ in user code, the number
of hash functions to fix quickly becomes unmanageable.

> Marc-Andre:
> If you look at my proposed code, you'll notice that we do more than simply 
> shift the period of the hash. It's not trivial for an attacker to create 
> colliding hash functions without knowing the key.

Could you post it on the ticket ?

BTW: I wonder how long it's going to take before someone figures out
that our merge sort based list.sort() is vulnerable as well... its
worst-case performance is O(n log n), making attacks somewhat harder.
The popular quicksort which Python used for a long time has O(n²),
making it much easier to attack, but fortunately, we replaced it
with merge sort in Python 2.3, before anyone noticed ;-)

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-06 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Before continuing down the road of adding randomness to hash
functions, please have a good read of the existing dictionary
implementation:

"""
Major subtleties ahead:  Most hash schemes depend on having a "good" hash
function, in the sense of simulating randomness.  Python doesn't:  its most
important hash functions (for strings and ints) are very regular in common
cases:

[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
>>>

This isn't necessarily bad!  To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints.
The same is approximately true when keys are "consecutive" strings.  So this
gives better-than-random behavior in common cases, and that's very desirable.
...
"""

There's also a file called dictnotes.txt which has more interesting
details about how the implementation is designed.

Please note that the term "collision" is used in a slightly different
way: it refers to trying to find an empty slot in the dictionary
table. Having a collision implies that the hash values of two distinct
objects are the same, but you also get collisions in case two distinct
objects with different hash values get mapped to the same table entry.

An attack can be based on trying to find many objects with the same
hash value, or trying to find many objects that, as they get inserted
into a dictionary, very often cause collisions due to the collision
resolution algorithm not finding a free slot.

In both cases, the (slow) object comparisons needed to find an
empty slot is what makes the attack practical, if the application
puts too much trust into large blobs of input data - which is
the actual security issues we're trying to work around here...

Given the dictionary implementation notes, I'm even less certain
that the randomization change is a good idea. It will likely
introduce a performance hit due to both the added complexity in
calculating the hash as well as the reduced cache locality of
the data in the dict table.

I'll upload a patch that demonstrates the collisions counting
strategy to show that detecting the problem is easy. Whether
just raising an exception is a good idea, is another issue.

It may be better to change the tp_hash slot in Python 3.3
to take an argument, so that the dict implementation can
use the hash function as universal hash family function
(see http://en.wikipedia.org/wiki/Universal_hash).

The dict implementation could then alter the hash parameter
and recreate the dict table in case the number of collisions
exceeds a certain limit, thereby actively taking action
instead of just relying on randomness solving the issue in
most cases.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-06 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Demo patch implementing the collision limit idea for Python 2.7.

--
Added file: http://bugs.python.org/file24151/hash-attack.patch

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-06 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

The hash-attack.patch solves the problem for the integer case
I posted earlier on and doesn't cause any problems with the
test suite.

Traceback (most recent call last):
  File "", line 1, in 
KeyError: 'too many hash collisions'

It also doesn't change the hashing or dict repr in existing
applications.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-06 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Stupid email interface again... here's the full text:

The hash-attack.patch solves the problem for the integer case
I posted earlier on and doesn't cause any problems with the
test suite.

>>> d = dict((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 100))
>>> d = dict((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000))
Traceback (most recent call last):
  File "", line 1, in 
KeyError: 'too many hash collisions'

It also doesn't change the hashing or dict repr in existing
applications.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-06 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
> hash-attack.patch does never decrement the collision counter.

Why should it ? It's only used as local variable in the lookup function.

Note that the limit only triggers on a per-key basis. It's not
a limit on the total number of collisions in the table, so you don't
need to keep the number of collisions stored on the object.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-06 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Here's an example of hash-attack.patch finding an on-purpose
programming error (hashing all objects to the same value):

http://stackoverflow.com/questions/4865325/counting-collisions-in-a-python-dictionary
(see the second example on the page for @Winston Ewert's solution)

With the patch you get:

Traceback (most recent call last):
  File "testcollisons.py", line 20, in 
d[o] = 1
KeyError: 'too many hash collisions'

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-07 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Paul McMillan wrote:
> 
>> I'll upload a patch that demonstrates the collisions counting
>> strategy to show that detecting the problem is easy. Whether
>> just raising an exception is a good idea, is another issue.
> 
> I'm in cautious agreement that collision counting is a better
> strategy. The dict implementation performance would suffer from
> randomization.
> 
>> The dict implementation could then alter the hash parameter
>> and recreate the dict table in case the number of collisions
>> exceeds a certain limit, thereby actively taking action
>> instead of just relying on randomness solving the issue in
>> most cases.
> 
> This is clever. You basically neuter the attack as you notice it but
> everything else is business as usual. I'm concerned that this may end
> up being costly in some edge cases (e.g. look up how many collisions
> it takes to force the recreation, and then aim for just that many
> collisions many times). Unfortunately, each dict object has to
> discover for itself that it's full of offending hashes. Another
> approach would be to neuter the offending object by changing its hash,
> but this would require either returning multiple values, or fixing up
> existing dictionaries, neither of which seems feasible.

I ran some experiments with the collision counting patch and
could not trigger it in normal applications, not even in cases
that are documented in the dict implementation to have a poor
collision resolution behavior (integers with zeros the the low bits).
The probability of having to deal with dictionaries that create
over a thousand collisions for one of the key objects in a
real life application appears to be very very low.

Still, it may cause problems with existing applications for the
Python dot releases, so it's probably safer to add it in a
disabled-per-default form there (using an environment variable
to adjust the setting). For 3.3 it could be enabled per default
and it would also make sense to allow customizing the limit
using a sys module setting.

The idea with adding a parameter to the hash method/slot in order
to have objects provide a hash family function instead of a fixed
unparametrized hash function would probably have to be implemented
as additional hash method, e.g. .__uhash__() and tp_uhash ("u"
for universal).

The builtin types should then grow such methods
in order to make hashing safe against such attacks. For objects
defined in 3rd party extensions, we would need to encourage
implementing the slot/method as well. If it's not implemented,
the dict implementation would have to fallback to raising an
exception.

Please note that I'm just sketching things here. I don't have
time to work on a full-blown patch, just wanted to show what
I meant with the collision counting idea and demonstrate that
it actually works as intended.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-08 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Tim Peters wrote:
> 
> Tim Peters  added the comment:
> 
> [Marc-Andre]
>> BTW: I wonder how long it's going to take before
>> someone figures out that our merge sort based
>> list.sort() is vulnerable as well... its worst-
>> case performance is O(n log n), making attacks
>> somewhat harder.
> 
> I wouldn't worry about that, because nobody could stir up anguish
> about it by writing a paper ;-)
> 
> 1. O(n log n) is enormously more forgiving than O(n**2).
> 
> 2. An attacker need not be clever at all:  O(n log n) is not only
> sort()'s worst case, it's also its _expected_ case when fed randomly
> ordered data.
> 
> 3. It's provable that no comparison-based sorting algorithm can have
> better worst-case asymptotic behavior when fed randomly ordered data.
> 
> So if anyone whines about this, tell 'em to go do something useful instead :-)

Right on all accounts :-)

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-08 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Christian Heimes wrote:
> Marc-Andre:
> Have you profiled your suggestion? I'm interested in the speed implications. 
> My gut feeling is that your idea could be slower, since you have added more 
> instructions to a tight loop, that is execute on every lookup, insert, update 
> and deletion of a dict key. The hash modification could have a smaller 
> impact, since the hash is cached. I'm merely speculating here until we have 
> some numbers to compare.

I haven't done any profiling on this yet, but will run some
tests.

The lookup functions in the dict implementation are optimized
to make the first non-collision case fast. The patch doesn't touch this
loop. The only change is in the collision case, where an increment
and comparison is added (and then only after the comparison which
is the real cost factor in the loop). I did add a printf() to
see how often this case occurs - it's a surprisingly rare case,
which suggests that Tim, Christian and all the others that have
invested considerable time into the implementation have done
a really good job here.

BTW: I noticed that a rather obvious optimization appears to be
missing from the Python dict initialization code: when passing in
a list of (key, value) pairs, the implementation doesn't make
use of the available length information and still starts with an
empty (small) dict table and then iterates over the pairs, increasing
the table size as necessary. It would be better to start with a
table that is presized to O(len(data)). The dict implementation
already provides such a function, but it's not being used
in the case dict(pair_list). Anyway, just an aside.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-09 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Marc-Andre Lemburg wrote:
> 
> Marc-Andre Lemburg  added the comment:
> 
> Christian Heimes wrote:
>> Marc-Andre:
>> Have you profiled your suggestion? I'm interested in the speed implications. 
>> My gut feeling is that your idea could be slower, since you have added more 
>> instructions to a tight loop, that is execute on every lookup, insert, 
>> update and deletion of a dict key. The hash modification could have a 
>> smaller impact, since the hash is cached. I'm merely speculating here until 
>> we have some numbers to compare.
> 
> I haven't done any profiling on this yet, but will run some
> tests.

I ran pybench and pystone: neither shows a significant change.

I wish we had a simple to run benchmark based on Django to allow
checking such changes against real world applications. Not that I
expect different results from such a benchmark...

To check the real world impact, I guess it would be best to
run a few websites with the patch for a week and see whether the
collision exception gets raised.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> Patch version 5 fixes test_unicode for 64-bit system.

Victor, I don't think the randomization idea is going anywhere. The
code has many issues:

 * it is exceedingly complex
 * the method would need to be implemented for all hashable
   Python types
 * it causes startup time to increase (you need urandom data for
   every single hashable Python data type)
 * it causes run-time to increase due to changes in the hash
   algorithm (more operations in the tight loop)
 * causes different processes in a multi-process setup to use different
   hashes for the same object
 * doesn't appear to work well in embedded interpreters that
   regularly restarted interpreters (AFAIK, some objects persist across
   restarts and those will have wrong hash values in the newly started
   instances)

The most important issue, though, is that it doesn't really
protect Python against the attack - it only makes it less
likely that an adversary will find the init vector (or a way
around having to find it via crypt analysis).

OTOH, the collision counting patch is very simple, doesn't have
the performance issues and provides real protection against the
attack. Even better still, it can detect programming errors in
hash method implementations.

IMO, it would be better to put efforts into refining the collision
detection patch (perhaps adding support for the universal hash
method slot I mentioned) and run some real life tests with it.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
>>  * it is exceedingly complex
> 
> Which part exactly? For hash(str), it just add two extra XOR.

I'm not talking specifically about your patch, but the whole idea
and the needed changes in general.

>>  * the method would need to be implemented for all hashable Python types
> 
> It was already discussed, and it was said that only hash(str) need to
> be modified.

Really ? What about the much simpler attack on integer hash values ?

You only have to send a specially crafted JSON dictionary with integer
keys to a Python web server providing JSON interfaces in order to
trigger the integer hash attack.

The same goes for the other Python data types.

>>  * it causes startup time to increase (you need urandom data for
>>   every single hashable Python data type)
> 
> My patch reads 8 or 16 bytes from /dev/urandom which doesn't block. Do
> you have a benchmark showing a difference?
> 
> I didn't try my patch on Windows yet.

Your patch only implements the simple idea of adding an init
vector and a fixed suffix vector (which you don't need since
it doesn't prevent hash collisions).

I don't think that's good enough, since
it doesn't change how the hash algorithm works on the actual
data, but instead just shifts the algorithm to a different
sequence. If you apply the same logic to the integer hash
function, you'll see that more clearly.

Paul's algorithm is much more secure in this respect, but it
requires more random startup data.

>>  * it causes run-time to increase due to changes in the hash
>>   algorithm (more operations in the tight loop)
> 
> I posted a micro-benchmark on hash(str) on python-dev: the overhead is
> nul. Did you have numbers showing that the overhead is not nul?

For the simple solution, that's an expected result, but if you want
more safety, then you'll see a hit due to the random data getting
XOR'ed in every single loop.

>>  * causes different processes in a multi-process setup to use different
>>   hashes for the same object
> 
> Correct. If you need to get the same hash, you can disable the
> randomized hash (PYTHONHASHSEED=0) or use a fixed seed (e.g.
> PYTHONHASHSEED=42).

So you have the choice of being able to work in a multi-process
environment and be vulnerable to the attack or not. I think we
can do better :-)

Note that web servers written in Python tend to be long running
processes, so an attacker has lots of time to test various
seeds.

>>  * doesn't appear to work well in embedded interpreters that
>>   regularly restarted interpreters (AFAIK, some objects persist across
>>   restarts and those will have wrong hash values in the newly started
>>   instances)
> 
> test_capi runs _testembed which restarts a embedded interpreters 3
> times, and the test pass (with my patch version 5). Can you write a
> script showing the problem if there is a real problem?
> 
> In an older version of my patch, the hash secret was recreated at each
> initiliazation. I changed my patch to only generate the secret once.

Ok, that should fix the case.

Two more issue that I forgot:

 * enabling randomized hashing can make debugging a lot harder, since
   it's rather difficult to reproduce the same state in a controlled
   way (unless you record the hash seed somewhere in the logs)

and even though applications should not rely on the order of dict
repr()s or str()s, they do often enough:

 * randomized hashing will result in repr() and str() of dictionaries
   to be random as well

>> The most important issue, though, is that it doesn't really
>> protect Python against the attack - it only makes it less
>> likely that an adversary will find the init vector (or a way
>> around having to find it via crypt analysis).
> 
> I agree that the patch is not perfect. As written in the patch, it
> just makes the attack more complex. I consider that it is enough.

Wouldn't you rather see a fix that works for all hash functions
and Python objects ? One that doesn't cause performance
issues ?

The collision counting idea has this potential.

> Perl has a simpler protection than the one proposed in my patch. Is
> Perl vulnerable to the hash collision vulnerability?

I don't know what Perl did or how hashing works in Perl, so cannot
comment on the effect of their fix. FWIW, I don't think that we
should use Perl or Java as reference here.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Mark Shannon wrote:
> 
> Mark Shannon  added the comment:
> 
>>>>  * the method would need to be implemented for all hashable Python types
>>> It was already discussed, and it was said that only hash(str) need to
>>> be modified.
>>
>> Really ? What about the much simpler attack on integer hash values ?
>>
>> You only have to send a specially crafted JSON dictionary with integer
>> keys to a Python web server providing JSON interfaces in order to
>> trigger the integer hash attack.
> 
> JSON objects are decoded as dicts with string keys, integers keys are 
> not possible.
> 
>  >>> json.loads(json.dumps({1:2}))
> {'1': 2}

Thanks for the correction. Looks like XML-RPC also doesn't accept
integers as dict keys. That's good :-)

However, as Paul already noted, such attacks can also occur in other
places or parsers in an application, e.g. when decoding FORM parameters
that use integers to signal a line or parameter position (example:
value_1=2&value_2=3...) which are then converted into a dictionary
mapping the position integer to the data.

marshal and pickle are vulnerable, but then you normally don't expose
those to untrusted data.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Antoine Pitrou wrote:
> 
> Antoine Pitrou  added the comment:
> 
>> OTOH, the collision counting patch is very simple, doesn't have
>> the performance issues and provides real protection against the
>> attack.
> 
> I don't know about real protection: you can still slow down dict
> construction by 1000x (the number of allowed collisions per lookup),
> which can be enough combined with a brute-force DOS.

On my slow dev machine 1000 collisions run in around 22ms:

python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 1000))"
100 loops, best of 3: 22.4 msec per loop

Using this for a DOS attack would be rather noisy, much unlike
sending a single POST.

Note that the choice of 1000 as limit is rather arbitrary. I just
chose it because it's high enough because it's very unlikely to be
hit by an application that is not written to trigger it and it's low
enough to still provide a good run-time behavior. Perhaps an
even lower figure would be better.

> Also, how about false positives? Having legitimate programs break
> because of legitimate data would be a disaster.

Yes, which is why the patch should be disabled by default (using
an env var) in dot-releases. It's probably also a good idea to
make the limit configurable to adjust to ones needs.

Still, it is *very* unlikely that you run into real data causing
more than 1000 collisions for a single insert.

For full protection the universal hash method idea would have
to be implemented (adding a parameter to the hash methods, so
that they can be parametrized). This would then allow switching
the dict to an alternative hash implementation resolving the collision
problem, in case the implementation detects high number of
collisions.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Mark Dickinson wrote:
> 
> Mark Dickinson  added the comment:
> 
> [Antoine]
>> Also, how about false positives? Having legitimate programs break
>> because of legitimate data would be a disaster.
> 
> This worries me, too.
> 
> [MAL]
>> Yes, which is why the patch should be disabled by default (using
>> an env var) in dot-releases.
> 
> Are you proposing having it enabled by default in Python 3.3?

Possibly, yes. Depends on whether anyone comes up with a problem in
the alpha, beta, RC release cycle.

It would be great to have the universal hash method approach for
Python 3.3. That way Python could self-heal itself in case it
finds too many collisions. My guess is that it's still better
to raise an exception, though, since it would uncover either
attacks or programming errors.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Antoine Pitrou wrote:
> 
> Antoine Pitrou  added the comment:
> 
>> On my slow dev machine 1000 collisions run in around 22ms:
>>
>> python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 
>> 1000))"
>> 100 loops, best of 3: 22.4 msec per loop
>>
>> Using this for a DOS attack would be rather noisy, much unlike
>> sending a single POST.
> 
> Note that sending one POST is not enough, unless the attacker is content
> with blocking *one* worker process for a couple of seconds or minutes
> (which is a rather tiny attack if you ask me :-)). Also, you can combine
> many dicts in a single JSON list, so that the 1000 limit isn't
> overreached for any of the dicts.

Right, but such an approach only scales linearly and doesn't
exhibit the quadric nature of the collision resolution.

The above with 1 items takes 5 seconds on my machine.
The same with 10 items is still running after 16 minutes.

> So in all cases the attacker would have to send many of these POST
> requests in order to overwhelm the target machine. That's how DOS
> attacks work AFAIK.

Depends :-) Hiding a few tens of such requests in the input stream
of a busy server is easy. Doing the same with thousands of requests
is a lot harder.

FWIW: The above dict string version just has some 263kB for the 10
case, 114kB if gzip compressed.

>> Yes, which is why the patch should be disabled by default (using
>> an env var) in dot-releases. It's probably also a good idea to
>> make the limit configurable to adjust to ones needs.
> 
> Agreed if it's disabled by default then it's not a problem, but then
> Python is vulnerable by default...

Yes, but at least the user has an option to switch on the added
protection. We'd need some field data to come to a decision.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

2012-01-12 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Frank Sievertsen wrote:
> 
> I don't want my software to stop working because someone managed to enter 
> 1000 bad strings into it. Think of a software that handles names of customers 
> or filenames. We don't want it to break completely just because someone 
> entered a few clever names.

Collision counting is just a simple way to trigger an action. As I mentioned
in my proposal on this ticket, raising an exception is just one way to deal
with the problem in case excessive collisions are found. A better way is to
add a universal hash method, so that the dict can adapt to the data and
modify the hash functions for just that dict (without breaking other
dicts or changing the standard hash functions).

Note that raising an exception doesn't completely break your software.
It just signals a severe problem with the input data and a likely
attack on your software. As such, it's no different than turning on DOS
attack prevention in your router.

In case you do get an exception, a web server will simply return a 500 error
and continue working normally.

For other applications, you may see a failure notice in your logs. If
you're sure that there are no possible ways to attack the application using
such data, then you can simply disable the feature to prevent such
exceptions.

> Randomization fixes most of these problems.

See my list of issues with this approach (further up on this ticket).

> However, it breaks the steadiness of hash(X) between two runs of the same 
> software. There's probably code out there that assumes that hash(X) always 
> returns the same value: database- or serialization-modules, for example.
> 
> There might be good reasons to also have a steady hash-function available. 
> The broken code is hard to fix if no such a function is available at all. 
> Maybe it's possible to add a second steady hash-functions later again?

This is one of the issues I mentioned.

> For the moment I think the best way is to turn on randomization of hash() by 
> default, but having a way to turn it off.

--

___
Python tracker 
<http://bugs.python.org/issue13703>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8898] The email package should defer to the codecs module for all aliases

2011-05-22 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Ezio Melotti wrote:
> 
> Ezio Melotti  added the comment:
> 
> I suggest to:
>   1) remove the alias for tactis;
>   2) add the aliases for latin_* and the tests for the aliases;
>   3) fix the email.charset to use the new aliases instead of its own dict.
> 
> 2) and 3) should go on 3.3 only, 1) could be considered a bug and fixed on 
> 2.7/3.2 too, but since the codec is already missing, removing the alias won't 
> change anything (i.e. it will raise a LookupError with or without alias).

+1

--
title: The email package should defer to the codecs module for all aliases -> 
The email package should defer to the codecs module for   all aliases

___
Python tracker 
<http://bugs.python.org/issue8898>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12100] Incremental encoders of CJK codecs reset the codec at each call to encode()

2011-05-22 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

I think it's better to use a StringIO instance for the tests.

Regarding resetting the incremental codec every time .encode() is called: 
Hye-Shik will have to comment. Perhaps there's an internal reason why they do 
this.

--

___
Python tracker 
<http://bugs.python.org/issue12100>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8898] The email package should defer to the codecs module for all aliases

2011-05-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

R. David Murray wrote:
> 
> R. David Murray  added the comment:
> 
> euc_jp and euc_kr seem to be backward (that is, codecs translates them to the 
> _ version, instead of translating the _ version to the - version).  I worry 
> that there might be other deviations from the standard email names.  I would 
> suggest we pull the list of preferred MIME names from the IANA charset 
> registry and make a test out of them in the email package.  If changing the 
> name returned by codecs is determined to not be acceptable, then those 
> entries will need to remain in the charset module ALIASES table and the 
> codecs-check logic adjusted accordingly.
> 
> Unfortunately the IANA registry does not list MIME names for all of the 
> charsets in common use, and the canonical names are not always the ones 
> commonly used in email.  Hopefully the codecs registry is using the most 
> common name for those, and hopefully if there are differences it won't break 
> any user code, since any reasonable email code should be coping with the 
> aliases in any case.

The way I understand the patch was that the email package will
start to use the encoding aliases for determining the codec
name instead of its own list. That is: only for decoding the
input data, not for creating a correct MIME encoding name in
output data.

--
title: The email package should defer to the codecs module for all aliases -> 
The email package should defer to the codecs module for   all aliases

___
Python tracker 
<http://bugs.python.org/issue8898>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12158] platform: add linux_version()

2011-05-23 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> New submission from STINNER Victor :
> 
> Sometimes, we need to know the version of the Linux kernel. Recent examples: 
> test if SOCK_CLOEXEC or O_CLOEXEC are supported by the kernel or not. Linux < 
> 2.6.23 *silently* ignores O_CLOEXEC flag of open().
> 
> linux_version() is already implemented in test_socket, but it looks like 
> test_posix does also need it.
> 
> Attached patch adds platform.linux_version(). It returns (a, b, c) (integers) 
> or None (if not Linux).
> 
> It raises an error if the version string cannot be parsed.

The APIs in platform generally try not to raise errors, but instead
return a default value you pass in as parameter in case the
data cannot be fetched from the system.

The returned value should be a version string in a fixed
format, not a tuple. I'd suggest to use _norm_version()
for this.

Please also check whether this works on a few Linux systems.

I've checked it on openSUSE, Ubuntu.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

2011-06-20: EuroPython 2011, Florence, Italy   28 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--

___
Python tracker 
<http://bugs.python.org/issue12158>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12158] platform: add linux_version()

2011-05-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
>> The returned value should be a version string in a fixed format,
>> not a tuple. I'd suggest to use _norm_version() for this.
> 
> How do you compare version strings? I prefer tuples, as sys.version_info, 
> because the comparaison is more natural:
> 
>>>> '2.6.9' > '2.6.20'
> True
>>>> (2, 6, 9) > (2, 6, 20)
> False

The APIs are mostly used for creating textual representations
of system information, hence the use of strings.

You can add an additional linux_version_info() API if you want to
have tuples.

--

___
Python tracker 
<http://bugs.python.org/issue12158>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12158] platform: add linux_version()

2011-05-23 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
> Use "%s.%s.%s" % linux_version() if you would like to format the version. The 
> format is well defined. (You should only do that under Linux.)

No, please follow the API conventions in that module and use a string.

You can then use linux_version().split('.') in code that want to
do version comparisons.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

2011-06-20: EuroPython 2011, Florence, Italy   28 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--

___
Python tracker 
<http://bugs.python.org/issue12158>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8796] Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter

2011-05-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Closing the ticket again.

We still need codecs.open() to support applications that target Python 2.x and 
3.x.

You can reopen it after Python 2.x has been end-of-life'd.

--
resolution:  -> postponed
status: open -> closed

___
Python tracker 
<http://bugs.python.org/issue8796>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8796] Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter

2011-05-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Correcting the title: this ticket is about codecs.open(), not StreamRead and 
StreamWriter, both of which are essential parts of the Python codec machinery 
and are needed to be able to implement per-codec implementations of codecs 
which read from and write to streams.

TextIOWrapper() is conceptually something completely different. It's more 
something like StreamReaderWriter().

The point about having them use incremental codecs for encoding and decoding is 
a good one and would need to be investigated. If possible, we could use 
incremental encoders/decoders for the standard StreamReader/Writer base classes 
or add new IncrementalStreamReader/Writer classes which then use the 
IncrementalEncode/Decoder per default.

Please open a new ticket for this.

Thanks.

--

___
Python tracker 
<http://bugs.python.org/issue8796>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12160] codecs doc: what is StreamCodec?

2011-05-23 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> New submission from STINNER Victor :
> 
> Codec.encode() and Codec.decode() refer to StreamCode, but I cannot find this 
> class in the doc nor in the code.
> 
> I suppose that it should be replaced by IncrementalEncoder and 
> IncrementalDecoder. If I'm correct, see attached patch.

Thanks for spotting this.

It should read StreamReader/StreamWriter, since these were designed
to keep state.

--

___
Python tracker 
<http://bugs.python.org/issue12160>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12158] platform: add linux_version()

2011-05-23 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
>> You can then use linux_version().split('.') in code that want
>> to do version comparisons.
> 
> It doesn't give the expected result:
> 
>>>> ('2', '6', '9') < ('2', '6', '20')
> False
>>>> ('2', '6', '9') > ('2', '6', '20')
> True

Sorry, I forgot the tuple(int(x) for ...) part.

> By the way, if you would like to *display* the Linux version, it's better to 
> use release() which gives more information:

No, release() doesn't have any defined format.

>>>> platform.linux_version()
> (2, 6, 38)
>>>> platform.release()
> '2.6.38-2-amd64'
> 
> About the name convention: mac_ver() and win32_ver() do return tuples. If you 
> prefer linux_version_tuple(), it's as you want. But return a tuple of strings 
> is useless: if you would like a string, use release() and parse the string 
> yourself.

Please look again: they both return the version and other infos as
strings.

> Note: "info" suffix is not currently used, whereas there are python_version() 
> and python_version_tuple().

Good point. I was thinking of the sys module function to
return the Python version as tuple.

>> Do we really need to expose a such Linux-centric and sparingly
>> used function to the platform module?
> 
> The platform module has already 2 functions specific to Linux: 
> linux_distribution() and libc_ver(). But if my proposed API doesn't fit 
> platform conventions, yeah, we can move the function to test.support.

Indeed and in retrospect, adding linux_distribution() was a mistake,
since it causes too much maintenance.

The linux_version() is likely going to cause similar issues, since
on the systems I checked, some return three part versions,
others four parts and then again other add a distribution specific
revision counter to it.

Then you have pre-releases, release candidates and development
versions:

http://en.wikipedia.org/wiki/Linux_kernel#Version_numbering

Reconsidering, I think it's better not to add the API to prevent
opening up another can of worms.

--

___
Python tracker 
<http://bugs.python.org/issue12158>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12158] platform: add linux_version()

2011-05-23 Thread Marc-Andre Lemburg


Changes by Marc-Andre Lemburg :


--
resolution:  -> rejected
status: open -> closed

___
Python tracker 
<http://bugs.python.org/issue12158>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8796] Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter

2011-05-23 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Antoine Pitrou wrote:
> 
> Antoine Pitrou  added the comment:
> 
>> TextIOWrapper() is conceptually something completely different. It's
>> more something like StreamReaderWriter().
> 
> That's a rather strange assertion. Can you expand?
> TextIOWrapper supports read-only, write-only, read-write, unseekable and
> seekable streams.

StreamReader and StreamWriter classes provide the base codec
implementations for stateful interaction with streams. They
define the interface and provide a working implementation for
those codecs that choose not to implement their own variants.

Each codec can, however, implement variants which are optimized
for the specific encoding or intercept certain stream methods
to add functionality or improve the encoding/decoding
performance.

Both are essential parts of the codec interface.

TextIOWrapper and StreamReaderWriter are merely wrappers
around streams that make use of the codecs. They don't
provide any codec logic themselves. That's the conceptual
difference.

--
title: Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter -> 
Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter

___
Python tracker 
<http://bugs.python.org/issue8796>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8796] Deprecate codecs.open()

2011-05-24 Thread Marc-Andre Lemburg


Changes by Marc-Andre Lemburg :


--
title: Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter -> 
Deprecate codecs.open()

___
Python tracker 
<http://bugs.python.org/issue8796>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12100] Incremental encoders of CJK codecs reset the codec at each call to encode()

2011-05-24 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
>> I think it's better to use a StringIO instance for the tests.
> 
> For which test excatly? An encoder produces bytes, I don't the relation with 
> StringIO.

Sorry, BytesIO in Python3-speak. In Python2 you'd use StringIO.

--
title: Incremental encoders of CJK codecs reset the codec at each call to 
encode() -> Incremental encoders of CJK codecs reset the codec at each call 
to encode()

___
Python tracker 
<http://bugs.python.org/issue12100>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12171] Reset method of the incremental encoders of CJK codecs calls the decoder reset function

2011-05-25 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc  added the comment:
> 
> Do we need an additional method? It seems that this reset() could also be 
> written encoder.encode('', final=True)

+1

I think that's a much more natural way to implement "finalize the
encoding output without adding any data to it".

--
title: Reset method of the incremental encoders of CJK codecs calls the decoder 
reset function -> Reset method of the incremental encoders of CJK codecs
calls the decoder reset function

___
Python tracker 
<http://bugs.python.org/issue12171>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12171] Reset method of the incremental encoders of CJK codecs calls the decoder reset function

2011-05-25 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

STINNER Victor wrote:
> 
> STINNER Victor  added the comment:
> 
> Le mercredi 25 mai 2011 à 08:23 +, Marc-Andre Lemburg a écrit :
>>> Do we need an additional method? It seems that this reset() could
>>> also be written encoder.encode('', final=True)
>>
>> +1
>>
>> I think that's a much more natural way to implement "finalize the
>> encoding output without adding any data to it".
> 
> And so, reset() should discard the output? I can easily adapt my patch
> to discard the output (but still call encreset() instead of decreset()).

I'm not sure what you mean by "discard the output".

Calling .reset() should still add the closing sequence to the output
buffer, if needed.

The purpose of .reset() is flush all data and put the codec into a
clean state (comparable to the state you get when you start using
it).

--

___
Python tracker 
<http://bugs.python.org/issue12171>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue9561] distutils: set encoding to utf-8 for input and output files

2011-05-26 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Éric Araujo wrote:
> 
> Éric Araujo  added the comment:
> 
> Definitely.  We can fix real bugs in distutils, but sometimes it’s best to 
> avoid disruptive changes and let distutils with its buggy behavior and let 
> the packaging module have the best behavior.

This is a real bug, since we agreed long ago that distutils should
read and write files using the UTF-8 encoding.

--
nosy: +lemburg

___
Python tracker 
<http://bugs.python.org/issue9561>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8898] The email package should defer to the codecs module for all aliases

2011-05-26 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

R. David Murray wrote:
> 
> R. David Murray  added the comment:
> 
> What is not-a-charset?
>
> I apparently misunderstood what normalize_encodings does.  It isn't doing a 
> lookup in the codecs registry and returning the canonical name for the codec. 
>  Does that mean we actually have to fetch the codec in order to get the 
> canonical name?  I suspect so, and that is probably OK, since in most cases 
> the codec is eventually going to get called while processing the email that 
> triggered the ALIASES lookup.
> 
> I also notice that there is a table of aliases in the codec module 
> documentation, so that will need to be updated as well.

As far as the aliases.py part of the patch goes, I'm fine with that
since it corrects a few real bugs and adds the missing Latin-N
codec names.

Regarding using this table in the email package, I'm not really
clear on what you want to achieve.

If you are looking for a way to determine whether Python has a codec
installed for a certain charset name, then codecs.lookup() will
tell you this (and it also applies all the aliasing and normalization
needed).

If you want to avoid the actual codec module import (codecs.lookup()
imports the module), you can mimic the logic used by the lookup function
of the encodings package. Not sure, whether that's worth it, though,
since it is rather likely that you're going to use the codec you've
just looked up soon after the test and codecs.lookup() caches the
found codecs.

If you want to convert an arbitrary encoding name to a registered
standard IANA MIME charset name, then the aliases.py module is not
going to be of much help, since we are using our own canonical
names which do not necessarily map to the MIME charset names.

You'd have to add a new mime_alias map to the email package
for that. I'd suggest to use the same approach as for the
aliases.py module, which is to first normalize the encoding
name using normalize_encoding() and then running that through
the mime_alias map.

Hope that helps.

--

___
Python tracker 
<http://bugs.python.org/issue8898>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8898] The email package should defer to the codecs module for all aliases

2011-05-26 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

R. David Murray wrote:
> 
> R. David Murray  added the comment:
> 
> Well, my thought was to avoid having multiple charset alias lists in the 
> stdlib, and reusing the one in codecs, which is larger than the one in email, 
> seemed to make sense.  This came up because a bug was reported where email 
> (silently) failed to encode a string because the charset alias, while present 
> in codecs, wasn't present in the email ALIASES table.
> 
> I suppose that as an alternative I could add full support for the IANA 
> aliases list to email.  Email is the most likely place to run in to variant 
> charset aliases anyway.
> 
> If that's the way we go, then this issue should be changed over to covering 
> just updating codecs with the missing aliases, and a new issue opened for 
> adding full IANA alias support to email.

I think it would be useful to have a mapping from the Python
canoncial name (the one the encodings package uses) to the
"preferred MIME name" as referenced in the IANA list:

http://www.iana.org/assignments/character-sets

This mapping could also be added to the encodings package
together with a function that translates a given encoding
name to its canoncial Python name (codec_module_name())
and another one to translate it to the "preferred MIME name"
according to the above list (encoding_mime_name()).

Note that we don't support all the aliases mentioned in the IANA
list because many of the are outdated and some have proved to be
wrong (the aliased encodings are actually different in a few
places). There are also a few encodings in the list which we
don't support at all.

Since we only rarely get requests for supporting new aliases or
encodings, I think it's safe to say that the existing set
is fairly complete from a practical point of view.

--
title: The email package should defer to the codecs module for all aliases -> 
The email package should defer to the codecs module for   all aliases

___
Python tracker 
<http://bugs.python.org/issue8898>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8796] Deprecate codecs.open()

2011-05-27 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Roundup Robot wrote:
> 
> Roundup Robot  added the comment:
> 
> New changeset 3555cf6f9c98 by Victor Stinner in branch 'default':
> Issue #8796: codecs.open() calls the builtin open() function instead of using
> http://hg.python.org/cpython/rev/3555cf6f9c98

Viktor, could you please back out this change again.

I am -1 on deprecating the StreamReader/Writer parts of the codec API
as I've mentioned numerous times and *don't* want to see these
deprecated in the code or the documentation.

I'm -0 on the change to codecs.open(). Have you checked whether the
returned objects are compatible ?

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

2011-05-23: Released eGenix mx Base 3.2.0  http://python.egenix.com/
2011-05-25: Released mxODBC 3.1.1  http://python.egenix.com/
2011-06-20: EuroPython 2011, Florence, Italy   24 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--

___
Python tracker 
<http://bugs.python.org/issue8796>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8898] The email package should defer to the codecs module for all aliases

2011-05-27 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Michele Orrù wrote:
> 
> Michele Orrù  added the comment:
> 
> Any idea about how to unittest mime.aliases?

Test the APIs you probably created for accessing it.

> Also, since I've just created a new file, are there some buracratic issues? I 
> mean, do I have to add something at the top of the file?
> (I'm just signing the Contributor Agreement)

You just need to put the usual copyright line at the top of
the file, together with the sentence from the agreement.

Apart from that, you also need to make sure that the other build
setups include the new file (PCbuild, Makefile.pre.in, etc.). If you
don't know how to do this, you can ask someone else to take
care of this, since it usually requires domain knowledge (e.g.
to add the file to the Windows builds).

--

___
Python tracker 
<http://bugs.python.org/issue8898>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12204] str.upper converts to title

2011-05-29 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Ezio Melotti wrote:
> 
> Ezio Melotti  added the comment:
> 
> '\u1ff3'.upper() returns '\u1ffc', so we have:
>   U+1FF3 (ῳ - GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI)
>   U+1FFC (ῼ - GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI)
> The first belongs to the Ll (Letter, lowercase) category, whereas the second 
> belongs to the Lt (Letter, titlecase) category.
> 
> The entries for these two chars in the UnicodeData.txt[0] files are:
> 1FF3;GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI;Ll;0;L;03C9 
> 0345N;;;1FFC;;1FFC
> 1FFC;GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI;Lt;0;L;03A9 
> 0345N1FF3;
> 
> U+1FF3 has U+1FFC in both the third last and last field 
> (Simple_Uppercase_Mapping and Simple_Titlecase_Mapping respectively -- see 
> [1]), so .upper() is doing the right thing here.
> U+1FFC has U+1FF3 in the second last field (Simple_Lowercase_Mapping), but 
> since it's category is not Lu, but Lt, .isupper() returns False.
> 
> The Unicode Standard Annex #44[2] defines the Lt category as:
>   Lt  Titlecase_Letter  a digraphic character, with first part uppercase
> 
> I'm not sure there's anything to fix here, both function behave as 
> documented, and it might indeed be the case that .upper() returns chars with 
> category Lt, that then return False with .isupper()
> 
> [0]: http://unicode.org/Public/UNIDATA/UnicodeData.txt
> [1]: http://www.unicode.org/reports/tr44/#UnicodeData.txt
> [2]: http://www.unicode.org/reports/tr44/#GC_Values_Table

I think there's a misunderstanding here: title cased characters
are ones typically used in titles of a document. They don't
necessarily have to be upper case, though, since some characters
are never used as first letters of a word.

Note that .upper() also does not guarantee to return an upper
case character. It just applies the mapping defined in the
Unicode standard and if there is no such mapping, or Python
does not support the mapping, the method returns the
original character.

The German ß is such a character (U+00DF). It doesn't have
an uppercase mapping in actual use and only received such
a mapping in Unicode 5.1 based on rather controversial
grounds (see http://en.wikipedia.org/wiki/ẞ).

The character is normally mapped to 'SS' when converting it
to upper case or title case. This multi-character mapping
is not supported by Python, so .upper() just returns U+00DF.

I suggest to close this ticket as invalid or to add a note
to the documentation explaining how the mapping is applied
(and when not).

--
nosy: +lemburg

___
Python tracker 
<http://bugs.python.org/issue12204>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue6490] os.popen documentation in 2.6 is probably wrong

2011-05-30 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Chris Rebert wrote:
> 
> Chris Rebert  added the comment:
> 
> Per msg129958, attached is my stab at a patch to replace most uses of 
> os.popen() with the subprocess module. The test suite passes on my Mac, but 
> the patch does touch some specific-to-other-platform code, so further testing 
> is obviously needed.
> This is my first non-docs patch, please be gentle. :) [Those patches were to 
> subprocess' docs though!]
> 
> Stuff still using os.popen() that the patch doesn't fix:
> - multiprocessing
> - platform.popen() [which is itself deprecated]
> - subprocess.check_output()
> - Lib/test/test_poll.py
> - Lib/test/test_select.py
> - Lib/distutils/tests/test_cygwinccompiler.py
> 
> Also, I suppose Issue 9382 should be marked as a dupe of this one?

Thanks, but I still don't understand why os.popen() wasn't removed
from the list of deprecated APIs as per Guido's message further up
on the ticket.

If you look at the amount of code you need to add in order
to support the os.popen() functionality directly using
subprocess instead of going the indirect way via the existing
os.popen() wrapper around the subprocess functionality, I think
this shows that the wrapper is indeed a good thing to have
and something you'd otherwise implement anyway as part of
standard code refactoring.

So instead of applying such a patch, I think we should add back
the documentation for os.popen() and remove the deprecation
notices.

The deprecations for os.popenN() are still fine, since those
APIs are not used all that much, and I'm sure that no one can
really remember what all the different versions do anyway :-)

os.popen() OTOH is often used and implements a very common
use: running an external command and getting the stdout
results back for further processing.

--

___
Python tracker 
<http://bugs.python.org/issue6490>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue7511] msvc9compiler.py: ValueError: [u'path']

2011-06-06 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

mike bayer wrote:
> > 
> > mike bayer  added the comment:
> > 
> > regarding "hey this is an MS bug not Python", projects which feature 
> > optional C extensions are starting to apply workarounds for the issue on 
> > their end (I will need to commit a specific catch for this to SQLAlchemy) - 
> > users need to install our software and we need to detect compilation 
> > failures as a sign to move on without it.   I think it's preferable for 
> > Python distutils to work around an MS issue rather than N projects having 
> > to work around an MS issue exposed through distutils.  Seems like this bug 
> > has been out there a real long time...bump ?
This is not really an MS issue. Setting up the environment to be
able to compile extensions is a prerequisite on most platforms and
with most compilers.

MS VC++ supports having multiple compiler versions on the same
machine and allow compiling to x86, x64 and ia64 (at least
in more recent VC++ versions).

I think it's fair to ask the user to setup the environment correctly
before running "python setup.py install", since distutils doesn't
really know which compiler to use - e.g. you could be cross-compiling
for x64 on an x86 machine, or you may want to use VC 2008 instead of
a more recently installed VC 2010.

Wouldn't it be better to have distutils tell the user about the
possible options, instead of guessing and then possibly compiling
extensions which later on don't import or import, but don't work
as expected ?

Regarding the latest patch: This is not the right approach, since
find_vcvarsall() is supposed to return the path to the vcvarsall.bat
file and not an architecture specific setup file. It is later
called with the arch identifier, which the arch specific setup files
don't check or use.

Also note that vcvarsall.bat can take these options:

   x86 (default), x64, amd64, x86_amd64, ia64, x86_ia64

The x86_* options setup the cross compilers.

--
nosy: +lemburg

___
Python tracker 
<http://bugs.python.org/issue7511>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue7511] msvc9compiler.py: ValueError when trying to compile with VC Express

2011-06-06 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Stefan Krah wrote:
> > 
> > Stefan Krah  added the comment:
> > 
> > Marc-Andre Lemburg  wrote:
>> >> Wouldn't it be better to have distutils tell the user about the
>> >> possible options, instead of guessing and then possibly compiling
>> >> extensions which later on don't import or import, but don't work
>> >> as expected ?
> > 
> > That would be an option, yes.
> > 
> > 
>> >> Regarding the latest patch: This is not the right approach, since
>> >> find_vcvarsall() is supposed to return the path to the vcvarsall.bat
>> >> file and not an architecture specific setup file. It is later
>> >> called with the arch identifier, which the arch specific setup files
>> >> don't check or use.
> > 
> > The patch does not change anything for Visual Studio Pro. In Visual Studio
> > Express (+SDK) vcvarsall.bat is broken, so the architecture specific setup
> > files have to be used (they also work with a superfluous parameter).

I guess what I wanted to say is that find_vcvarsall() should
return None for VC Express and code using it should then
revert to using a new find_vcvars() function, which takes the
architecture as parameter and returns the path to the correct
architecture setup file.

Hacking the support into find_vcvarsall() is not the right
approach. You have to add this support one level further up.

>> >> Also note that vcvarsall.bat can take these options:
>> >>
>> >>x86 (default), x64, amd64, x86_amd64, ia64, x86_ia64
>> >>
>> >> The x86_* options setup the cross compilers.
> > 
> > I think the patch covers all architecture specific files that are
> > present in the Visual Studio Express + SDK setup.

Right, but it doesn't cover the ones available in VS Pro (see
above), which it should for completeness.

> > Visual Studio Pro is protected from all changes by checking for
> > the presence of the file bin\amd64\vcvarsamd64.bat. This
> > could probably be done more elegantly by using some obscure
> > registry value.
> > 
> > 
> > 
> > As Thorsten mentioned, another option would be to copy bin\vcvars64.bat
> > to bin\amd64\vcvarsamd64.bat if the latter is not present.
> > 
> > This is harmless, but it is perhaps not really the business of Python
> > to mess with existing installs.

Not a good idea :-)

PS: Changing the title, since I keep getting the following error messages from 
the email interface:

There were problems handling your subject line argument list:
- not of form [arg=value,value,...;arg=value,value,...]

Subject was: "Re: [issue7511] msvc9compiler.py: ValueError: [u'path']"

--
title: msvc9compiler.py: ValueError: [u'path'] -> msvc9compiler.py: ValueError 
when trying to compile with VC Express

___
Python tracker 
<http://bugs.python.org/issue7511>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12326] Linux 3: tests should avoid using sys.platform == 'linux2'

2011-06-14 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Martin v. Löwis wrote:
> 
> Martin v. Löwis  added the comment:
> 
>> The change to sys.platform=='linux' would break code even on current 
>> platforms.
> 
> Correct. Compared to introducing 'linux3', I consider this the better
> change - it likely breaks earlier (i.e. when porting to Python 3.3).
> 
>> OTOH, we have sys.platform=='win32' even on Windows 64bit; would this
>> favor keeping 'linux2' on all versions of Linux as well?
> 
> While this has better compatibility, it's also a constant source of
> irritation. Introducing 'win64' would have been a worse choice (just
> as introducing 'linux3' would: incompatibility for no gain, since
> the distinction between win32 and win64, from a Python POV, is
> irrelevant). Plus, Microsoft dislikes the term Win64 somewhat, and
> rather wants people to refer to the "Windows API".
> 
> I personally disliked 'linux2' when it was introduced, for its
> incompatibilities. Anticipating that, some day, we may have 'Linux 4',
> and so on, I still claim it is better to fix this now. We could even
> come up with a 2to3 fixer for people who dual-source their code.

I think we should consider adding a better mechanism and just
keep the old mechanism for determining sys.platform in place
(letting it break on Linux to raise awareness) and add a new better
attribute along the lines of what Martin suggested:

sys.system == 'linux', 'freebsd', 'openbsd', 'windows', etc.
  (without version)

and a new

sys.system_info == system release info (named) tuple similar to
   sys.version_info

to query a specific system version.

As already noted, direct sys.platform testing already breaks for
OpenBSD, FreeBSD and probably a few other OSes as well with
every major OS release, so the Linux breakage is not really new
in any way.

--

___
Python tracker 
<http://bugs.python.org/issue12326>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12266] str.capitalize contradicts oneself

2011-07-21 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

I think it would be better to use this code:

if (!Py_UNICODE_ISUPPER(*s)) {
*s = Py_UNICODE_TOUPPER(*s);
status = 1;
}
s++;
while (--len > 0) {
if (Py_UNICODE_ISLOWER(*s)) {
*s = Py_UNICODE_TOLOWER(*s);
status = 1;
}
s++;
}

Since this actually implements what the doc-string says.

Note that title case is not the same as upper case. Title case is
a special case that get's applied when using a string as a title
of a text and may well include characters that are lower case
but which are only used in titles.

--

___
Python tracker 
<http://bugs.python.org/issue12266>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12266] str.capitalize contradicts oneself

2011-07-21 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Ezio Melotti wrote:
> 
> Ezio Melotti  added the comment:
> 
> Do you mean  "if (!Py_UNICODE_ISLOWER(*s)) {"  (with the '!')?

Sorry, here's the correct version:

if (!Py_UNICODE_ISUPPER(*s)) {
*s = Py_UNICODE_TOUPPER(*s);
status = 1;
}
s++;
while (--len > 0) {
if (!Py_UNICODE_ISLOWER(*s)) {
*s = Py_UNICODE_TOLOWER(*s);
status = 1;
}
s++;
}

> This sounds fine to me, but with this approach all the uncased characters 
> will go through a Py_UNICODE_TO* macro, whereas with the current code only 
> the cased ones are converted.  I'm not sure this matters too much though.
> 
> OTOH if the non-lowercase cased chars are always either upper or titlecased, 
> checking for both should be equivalent.

AFAIK, there are characters that don't have a case mapping at all.
It may also be the case, that a non-cased character still has a
lower/upper case mapping, e.g. for typographical reasons.

Someone would have to check this against the current Unicode database.

--

___
Python tracker 
<http://bugs.python.org/issue12266>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue9528] Add pure Python implementation of time module to CPython

2011-08-10 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Alan Justino wrote:
> 
> I am getting a hard time trying to do some BDD with c-based datetime because 
> I cannot mock it easily to force datetime.datetime.now() to return a desired 
> value, making almost impossible to test time-based code, like the accounting 
> system that I am refactoring right now.

It's usually better to use a central helper get_current_time() in
the application, than to use datetime.now() and others
directly.

--

___
Python tracker 
<http://bugs.python.org/issue9528>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2857] add codec for java modified utf-8

2011-08-12 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Tom Christiansen wrote:
> 
> Tom Christiansen  added the comment:
> 
> Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:
> 
>   http://unicode.org/reports/tr26/
> 
> CESU-8 is *not* a a valid Unicode Transform Format and should not be called 
> UTF-8. It is a real pain in the butt, caused by people who misunderand 
> Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need 
> to be able to read it, but call it what it is, please.
> 
> Despite the talk about Lucene, I note that the Perl port of Lucene uses real 
> UTF-8, not CESU-8.

CESU-8 is a different encoding than the one we are talking about.

The only difference between UTF-8 and the modified one is the different
encoding for the U+ code point to have the output not contain
any NUL bytes.

--

___
Python tracker 
<http://bugs.python.org/issue2857>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2857] Add "java modified utf-8" codec

2011-08-12 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Corrected the title again. See my comment.

--
title: Add CESU-8 codec ("java modified utf-8") -> Add "java modified utf-8" 
codec
versions: +Python 3.3 -Python 2.7, Python 3.2

___
Python tracker 
<http://bugs.python.org/issue2857>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2857] Add "java modified utf-8" codec

2011-08-12 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Marc-Andre Lemburg wrote:
> 
> Corrected the title again. See my comment.

Please open a new ticket, if you want to add a CESU-8 codec.

Looking at the relevant use cases, I'm at most +0 on adding the
modified UTF-8 codec. I think such codecs can well live outside
the stdlib on PyPI.

--

___
Python tracker 
<http://bugs.python.org/issue2857>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-15 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

> Keep in mind that we should be able to access and use lone surrogates too, 
> therefore:
> s = '\ud800'  # should be valid
> len(s)  # should this raise an error? (or return 0.5 ;)?
> s[0]  # error here too?
> list(s)  # here too?
> 
> p = s + '\udc00'
> len(p)  # 1?
> s[0]  # '\U0001' ?
> s[1]  # IndexError?
> list(p + 'a')  # ['\ud800\udc00', 'a']?
> 
> We can still decide that strings with lone surrogates work only with a 
> limited number of methods/functions but:
> 1) it's not backward compatible;
> 2) it's not very consistent
> 
> Another thing I noticed is that (at least on wide builds) surrogate pairs are 
> not joined "on the fly":
>>>> p
> '\ud800\udc00'
>>>> len(p)
> 2
>>>> p.encode('utf-16').decode('utf-16')
> '𐀀'
>>>> len(_)
> 1

Hi Tom,

welcome to Python land :-) Here's some more background information
on how Python's Unicode implementation works:

You need to differentiate between Unicode code points stored in
Unicode objects and ones encoded in transfer formats by codecs.

We generally do allow lone surrogates, unassigned code
points, lone combining code points, etc. in Unicode objects
since Python needs to be able to work on all Unicode code points
and build strings with them.

The transfer format codecs do try to combine surrogates
on decoding data on UCS4 builds. On UCS2 builds they create
surrogate pairs as necessary. On output, those pairs will again
be joined to get round-trip safety.

It helps if you think of Python's Unicode objects using UCS2
and UCS4 instead of UTF-16/32. Python does try to make working
with UCS2 easy and in many cases behaves as if it were using
UTF-16 internally, but there are, of course, limits to this. In
practice, you only rarely get to see any of these special cases,
since non-BMP code points are usually not found in everyday
use. If they do become a problem for you, you have the option
of switching to a UCS4 build of Python.

You also have to be aware of the fact that Python started
Unicode in 1999/2000 with Unicode 2.0/3.0, so it uses the
terminology of those versions, some of which has changed in
more recent versions of Unicode.

For more background information, you might want take a look
at this talk from 2002:

http://www.egenix.com/library/presentations/#PythonAndUnicode

Related to the other tickets you opened You'll also find that
collation and compression was already on the plate back then,
but since no one step forward, it wasn't implemented.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com


2011-10-04: PyCon DE 2011, Leipzig, Germany50 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--
nosy: +lemburg
title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> 
Python lib re cannot handle Unicode properly due to   narrow/wide bug

___
Python tracker 
<http://bugs.python.org/issue12729>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12752] locale.normalize does not take unicode strings

2011-08-15 Thread Marc-Andre Lemburg


Marc-Andre Lemburg  added the comment:

Julian Taylor wrote:
> 
> New submission from Julian Taylor :
> 
> using unicode strings for locale.normalize gives following traceback with 
> python2.7:
> 
> ~$ python2.7 -c 'import locale; locale.normalize(u"en_US")'
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/python2.7/locale.py", line 358, in normalize
> fullname = localename.translate(_ascii_lower_map)
> TypeError: character mapping must return integer, None or unicode
> 
> with python2.6 it works and it also works with non-unicode strings in 2.7

This looks like a side-effect of the change Antoine made to the locale
module when trying to make the case mapping work in a non-locale
dependent way.

--
nosy: +lemburg

___
Python tracker 
<http://bugs.python.org/issue12752>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1946 matches

Mail list logo