[issue46249] [sqlite3] move set lastrowid out of the query loop
Marc-Andre Lemburg added the comment: On 04.01.2022 00:49, Erlend E. Aasland wrote: > > Erlend E. Aasland added the comment: > > I see that PEP 249 does not define the semantics of lastrowid for > executemany(). What's the precedence here, MAL? IMO, it would be nice to be > able to fetch the last row id after you've done a 1000 inserts using > executemany(). The DB-API deliberately leaves this undefined, since there are many ways you could implement this, e.g. - return the last row id for the last entry in the array passed to .executemany() - return the last row id of the last actually modified/inserted row after running .executemany() - return an array of row ids, one for each row modified/inserted - return a row id of one of the modified/inserted rows, without defining which - always return None for .executemany() Note that in some cases, the order of actions taken by the database is not predefined (e.g. some databases run such inserts in chunks across a cluster), so even the "last" semantics are not clear. > So, another option would be to keep "set-lastrowid" in the query loop, and > just remove the condition; we set it every time (but of course only build a > PyObject of it when the loop is done). Since the DB-API leaves this undefined, you are free to add your own meaning, which makes most sense for SQLite, to always return None or not implement it at all. DB-API extensions such as Cursor.lastrowid are optional and don't need to be implemented if they don't make sense for a particular use case: https://www.python.org/dev/peps/pep-0249/#optional-db-api-extensions -- ___ Python tracker <https://bugs.python.org/issue46249> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46249] [sqlite3] move set lastrowid out of the query loop
Marc-Andre Lemburg added the comment: On 04.01.2022 11:02, Erlend E. Aasland wrote: > > Erlend E. Aasland added the comment: > > Thank you for your input Marc-André. > > For SQLite, it's pretty simple: we use an API called > sqlite3_last_insert_rowid() which takes the database connection as it's > argument, not a statement pointer. This function returns "the rowid of the > most recent successful INSERT into a rowid table or virtual table on database > connection" (quote from SQLite docs). IMO, it would make sense to also use > this post executemany(). Sounds like a plan. If possible, it's usually better to have the .executemany() create a cursor with an output array providing the row ids, e.g. using "INSERT ... RETURNING ..." (PostgreSQL). That way you can access all row ids and can also provide the needed detail in case the INSERTs happen out of order to map them to the input data. For cases where you don't need sequence IDs, it's often better to not rely on auto-increment columns for IDs, but instead use random pre-generated IDs. Saves roundtrips to the database and works nicely with cluster databases as well. -- ___ Python tracker <https://bugs.python.org/issue46249> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46249] [sqlite3] move set lastrowid out of the query loop
Marc-Andre Lemburg added the comment: On 04.01.2022 21:02, Erlend E. Aasland wrote: > > Erlend E. Aasland added the comment: > >> If possible, it's usually better to have the .executemany() create a >> cursor with an output array providing the row ids, e.g. using "INSERT ... >> RETURNING ..." (PostgreSQL). That way you can access all row ids and >> can also provide the needed detail in case the INSERTs happen out of >> order to map them to the input data. > > Hm, maybe so. But it will add a lot of overhead and complexity to > executemany(), and there haven't been requests for this feature for sqlite3. > AFAIK, there hasn't been request for lastrowid for executemany() at all. > OTOH, my proposal of modifying lastrowid to always show the rowid of the > actual last inserted row is a very cheap operation, _and_ it simplifies the > code (=> increased maintainability), so I think I'll go for that. Sorry, I wasn't suggesting this for SQLite; it's just a better and more flexible option than using cursor.lastrowid where available. Sadly, the PG extension is not standards conform SQL. >> For cases where you don't need sequence IDs, it's often better to >> not rely on auto-increment columns for IDs, but instead use random >> pre-generated IDs. Saves roundtrips to the database and works nicely >> with cluster databases as well. > > Yes, but in those cases you keep track of the row id yourself, so you > probably won't need lastrowid ;) Indeed, and that's the point :-) Those auto-increment ID fields are not really such a good idea to begin with. Either you know that you will need to manipulate the rows after inserting them (in which case you can set an ID) or you don't care about the individual rows and only want to aggregate or search in them based on other fields. -- ___ Python tracker <https://bugs.python.org/issue46249> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12756] datetime.datetime.utcnow should return a UTC timestamp
Marc-Andre Lemburg added the comment: Hi Tony, from practical experience, it is a whole lot better to not deal with timezones in data processing code at all, but instead only use naive UTC datetime values everywhere, expect when you have to prepare reports or output which has a requirement to show datetime value in local time or some specific timezone. You convert all datetime values into UTC upon input, possibly store the timezone somewhere, if that is relevant for later reporting, and then forget about timezones. Your code will run faster, become a lot easier to understand and you avoid many pitfalls that TZs have, esp. when TZs are silently dropped interfacing to e.g. numeric code, databases or other external code. There's a reason why cloud code (and a lot of other code, such as data science code) has standardized on UTC :-) Cheers, -- Marc-Andre Lemburg eGenix.com -- nosy: +lemburg ___ Python tracker <https://bugs.python.org/issue12756> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46338] libc_ver() runtime error when sys.executable is empty
Marc-Andre Lemburg added the comment: On 10.01.2022 23:01, Allie Hammond wrote: > > libc_ver() in platform.py (called from platform()) causes a runtime error if > sys.executable returns null. In my case, FreeRADIUS offers a module > rlm_python3 which allows you to run python code from the C based FreeRADIUS > server - since this module doesn't use a python binary to execute > sys.executable returns null trigering this error. Interesting. I guess rlm_python3 embeds Python. Is sys.executable an empty string or None ? -- ___ Python tracker <https://bugs.python.org/issue46338> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46249] [sqlite3] move set lastrowid out of the query loop and enable it for executemany()
Marc-Andre Lemburg added the comment: On 08.01.2022 21:56, Erlend E. Aasland wrote: > > Marc-André: since Python 3.6, the sqlite3.Cursor.lastrowid attribute does no > longer comply with the recommendations of PEP 249: > > Previously, lastrowid was set to None for operations other than INSERT or > REPLACE. This changed with ab994ed8b97e1b0dac151ec827c857f5e7277565 (in > Python 3.6), so that lastrowid is _unchanged_ for operations other than > INSERT or REPLACE, and it is set to 0 after the first valid SQL (that is not > INSERT/REPLACE) is executed on the cursor. > > Now, PEP 249 only _recommends_ that lastrowid is set to None for operations > that do not modify a row, so it's probably not a big deal. No-one has ever > mentioned this change in behaviour; there have been no bug reports. > > FTR, here is the relevant quote from PEP 249: > > If the operation does not set a rowid or if the database does not support > rowids, this attribute should be set to None. > > (I interpret "should" as understood by RFC 2119.) Well, it may be a little stronger than the SHOULD in the RFC, but then again the whole DB-API is about conventions and if they don't make sense for a database backend, it is possible to deviate from the spec, esp. for optional extensions such as .lastrowid. > So, my follow-up question becomes: > I see no point in reverting to pre Python 3.6 behaviour. I would rather > change the default value to be 0 (to get rid of the dirty flag in GH-30380), > and to make the behaviour more consistent with how the actual SQLite API > behaves. > > > Do you have an opinion about such a change (in behaviour)? Is 0 a valid row ID in SQLite ? If not, then I guess this would be an alternative to None as suggested by the DB-API. If it is a valid row ID, I'd suggest to go back to resetting to None, since otherwise code might get confused: if an UPDATE does not get applied (e.g. a condition is false), code could then still take .lastrowid as referring to the UPDATE and not a previous operation, since code will now know whether the condition was met or not. -- Marc-Andre Lemburg eGenix.com -- title: [sqlite3] lastrowid improvements -> [sqlite3] move set lastrowid out of the query loop and enable it for executemany() ___ Python tracker <https://bugs.python.org/issue46249> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46249] [sqlite3] move set lastrowid out of the query loop and enable it for executemany()
Marc-Andre Lemburg added the comment: On 11.01.2022 20:46, Erlend E. Aasland wrote: > > If we are to revert to this behaviour, we'll have to start examining the SQL > we are given (search for INSERT and REPLACE keywords, determine if they are > valid (i.e. not a comment, not part of a column or table name, etc.), which > will lead to a noticeable performance hit for every new statement (not for > statements reused via the LRU cache though). I'm not sure this is a good > idea. However I will give it a good thought. > > My first thought now, is that it would be better for the sqlite3 module to > align lastrowid with the behaviour of the C API sqlite3_last_insert_rowid() > (also available as an SQL function: last_insert_rowid). OTOH, the SQLite API > is tied to the _connection_ object, so it may not make sense to align it with > lastrowid which is a _cursor_ attribute. I've had a look at the API description and find it less than useful, to be honest: https://sqlite.org/c3ref/last_insert_rowid.html You don't know on which cursor the last row was inserted, it's possible that this was or is done by a trigger and the last row is not updated in case the INSERT does not succeed for some reason, leaving it unchanged - without the user getting a notification of this failure, since the .execute() call itself will succeed for e.g. "INSERT INTO table SELECT ...;". It also seems that the function really only works for INSERTs and not for UPDATEs. > Perhaps the Right Thing To Do™ is to be conservative and just leave it as it > is. I still want to apply the optimisation, though. It does not alter the > behaviour in any kind of way, and it speeds up executemany(). I'd suggest to deprecate the cursor.lastrowid attribute and instead point people to the much more useful "INSERT INTO t (name) VALUES ('two'), ('three') RETURNING ROWID;" https://sqlite.org/lang_insert.html https://sqlite.org/forum/forumpost/058ac49cc3 (good to know that SQLite has adopted this PostgreSQL variant as well) RETURNING is also available for UPDATES: https://sqlite.org/lang_update.html If people really want to use the sqlite3_last_insert_rowid() functionality, they can use the SQL function of the same name: https://www.sqlite.org/lang_corefunc.html#last_insert_rowid which then has known semantics and doesn't conflict with the DB-API specs. But this is your call :-) -- ___ Python tracker <https://bugs.python.org/issue46249> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46249] [sqlite3] move set lastrowid out of the query loop and enable it for executemany()
Marc-Andre Lemburg added the comment: On 11.01.2022 21:30, Erlend E. Aasland wrote: > >> I'd suggest to deprecate the cursor.lastrowid attribute and >> instead point people to the much more useful [...] > > Yes, I think mentioning the RETURNING ROWID trick in the sqlite3 docs is a > very nice improvement. Mentioning the last_insert_rowid SQL function is > probably also worth consideration. > > I'm reluctant to deprecate cursor.lastrowid, though. ATM, I'm leaning towards > just keeping the current behaviour. Fair enough :-) Perhaps just documenting that the value is not necessarily what people may expect, when coming from other databases due to the different semantics with SQLite, is enough. -- Marc-Andre Lemburg eGenix.com -- ___ Python tracker <https://bugs.python.org/issue46249> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45382] platform() is not able to detect windows 11
Marc-Andre Lemburg added the comment: On 26.01.2022 01:29, Eryk Sun wrote: > > Eryk Sun added the comment: > >> Bit wmic seems nice solution. >> Is still working for windows lower than 11? > > wmic.exe is still included in Windows 10 and 11, but it's officially > deprecated [1], which means it's no longer being actively developed, and it > might be removed in a future update. PowerShell is the preferred way to use > WMI. All of these require shelling out to the OS, so why not stick to `ver` as we've done in the past. This has existed for ages and will most likely not disappear anytime soon. Is there a good reason to prefer wmic or PowerShell (which are less likely to be available or reliable) ? > --- > [1] > https://docs.microsoft.com/en-us/windows/deployment/planning/windows-10-deprecated-features -- Marc-Andre Lemburg eGenix.com -- ___ Python tracker <https://bugs.python.org/issue45382> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46659] Deprecate locale.getdefaultlocale() function
Marc-Andre Lemburg added the comment: > For these reasons, I propose to deprecate locale.getdefaultlocale(): > setlocale(), getpreferredencoding() and getlocale() should be used instead. Please see the discussion on https://bugs.python.org/issue43552: locale.getpreferredencoding() needs to be deprecated as well. Instead we should have a single locale.getencoding() as outlined there... perhaps in a separate ticket ?! Thanks. -- nosy: +lemburg ___ Python tracker <https://bugs.python.org/issue46659> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46662] Lib/sqlite3/dbapi2.py: convert_timestamp function failed to correctly parse timestamp
Marc-Andre Lemburg added the comment: On 08.02.2022 11:54, Erlend E. Aasland wrote: > > The sqlite3 timestamp converter is buggy, as already noted in the docs[^1]. > Adding timezone support is out of the question[^2][^3][^4][^5], but fixing it > to be able to discard any attached timezone info _may_ be ok; at first sight, > I don't see how this could break existing applications (like, for example > adding time zone support could do). I need to think it through. I think it's better to deprecate these converters and let users implement their own. -- ___ Python tracker <https://bugs.python.org/issue46662> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46659] Deprecate locale.getdefaultlocale() function
Marc-Andre Lemburg added the comment: Thanks, Victor. -- ___ Python tracker <https://bugs.python.org/issue46659> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1771381] bsddb can't use unicode keys
Marc-Andre Lemburg added the comment: Unassigning since I don't know the details of bsddb. -- assignee: lemburg -> _ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1771381> _ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue225476] Codec naming scheme and aliasing support
Marc-Andre Lemburg added the comment: Closing this request as the encodings package search function should not be used import external codecs (this poses a security risk). -- status: open -> closed Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue225476> ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue880951] "ez" format code for ParseTuple()
Marc-Andre Lemburg added the comment: Closing. There doesn't seem to be much interest in this. -- status: open -> closed Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue880951> ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1001895] Adding missing ISO 8859 codecs, especially Thai
Marc-Andre Lemburg added the comment: Not sure why this is still open. The patches were checked in a long time ago. -- status: open -> closed _ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1001895> _ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue547537] cStringIO should provide a binary option
Marc-Andre Lemburg added the comment: Unassigning: I've never had a need for this in the past years. -- assignee: lemburg -> Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue547537> ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue883466] quopri encoding & Unicode
Marc-Andre Lemburg added the comment: Georg: Yes, sure. Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue883466> ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1528802] Turkish Character
Marc-Andre Lemburg added the comment: Unassigning this. Unless someone provides a patch to add context sensitivity to the Unicode upper/lower conversions, I don't think anything will change. The mapping you see in Python (for Unicode) is taken straight from the Unicode database and there's nothing we can or want to do to change those predefined mappings. The 8-bit string mappings OTOH are taken from the underlying C library - again nothing we can change. -- assignee: lemburg -> _ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1528802> _ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1071] unicode.translate() doesn't error out on invalid translation table
Marc-Andre Lemburg added the comment: Nice idea, but why don't you use a dictionary iterator (PyDict_Next()) for the fixup ? __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1071> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1071] unicode.translate() doesn't error out on invalid translation table
Marc-Andre Lemburg added the comment: Ah, I hadn't noticed that you're actually manipulating the input dictionary. You should create a copy and fix that instead of changing the dict that the user passed in to the function. You can then use PyDict_Next() for fast iteration over the original dictionary. __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1071> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1505257] winerror module
Marc-Andre Lemburg added the comment: The winerror module should really be coded in C. Otherwise you don't benefit from the lookup object approach. The files I uploaded only server as basis for such a C module. Would be great if you could find someone to write such a module - preferably using a generator that creates it from the header files. I'm not interested in this anymore, so feel free to drop the whole idea. _ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1505257> _ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1082] platform system may be Windows or Microsoft since Vista
Marc-Andre Lemburg added the comment: A couple of notes: * platform.uname() needs to be fixed, not the individual query functions. * The third entry of uname() should return "Vista" instead of "Microsoft" on MS Vista. * A patch should go on trunk and into 2.5.2, since this is a real bug and not a feature change. Any other changes to accommodate for differences between used marketing names and underlying OS names should go into system_alias(). __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1082> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1082] platform system may be Windows or Microsoft since Vista
Marc-Andre Lemburg added the comment: Yes, please. Thanks. -- assignee: lemburg -> jafo __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1082> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1082] platform system may be Windows or Microsoft since Vista
Marc-Andre Lemburg added the comment: Pat, we already have system_alias() for exactly the purpose you suggested. Software relying on platform.system() reporting "Vista" will have to use Python 2.5.2 as minimum Python version system requirement - pretty much the same as with all other bug fixes. __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1082> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE
Marc-Andre Lemburg added the comment: My name appears in that Makefile because I wrote it and used it to create the charmap codecs. The reason why the Mac Japanese codec was not created for 2.x was the size of the mapping table. Ideal would be to have the C version of the CJK codecs support the Mac Japanese encoding as well. Adding back the charmap based Mac Japanese codec would be a compromise. The absence of the Mac Japanese codec causes (obvious) problems for many Japanese Python users running Mac OS X. __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1276> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1276] LookupError: unknown encoding: X-MAC-JAPANESE
Marc-Andre Lemburg added the comment: Adding Python 2.6 as version target. -- versions: +Python 2.6 __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1276> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1399] XML codec
Marc-Andre Lemburg added the comment: Nice codec ! The only nit I have is the name: "xml" isn't intuitive enough. I had to read the code to figure out what the codec actually does. "xml" used a encoding usually refers to having Unicode text converted to ASCII with XML entity escapes for all non-ASCII characters. How about "xml-auto-detect" or something along those lines ?! -- nosy: +lemburg __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1399> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1399] XML codec
Marc-Andre Lemburg added the comment: Leaving the module name as "xml" would remove that name from the namespace of possible encodings. "xml" as encoding name is problematic, as many people regard writing data in XML as "encoding the data in XML". I'd simply not use it at all, not even for a codec that converts between Unicode and ASCII+XML entities. __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1399> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1399] XML codec
Marc-Andre Lemburg added the comment: Thanks, Walter ! __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1399> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1234] semaphore errors on AIX 5.2
Marc-Andre Lemburg added the comment: The problem is also present in Python 2.4 and 2.3. Confirmed on AIX 5.3. -- nosy: +lemburg versions: +Python 2.3, Python 2.4 __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1234> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1433] marshal roundtripping for unicode
Marc-Andre Lemburg added the comment: I think you have a wrong understanding of round-tripping. In Unicode it is really irrelevant if you're using a UCS2 surrogate pair or a UCS4 representation to describe a code point. The length of the Unicode representation may change, but the meaning won't, so you don't lose any information. -- nosy: +lemburg __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1433> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1620174] Improve platform.py usability on Windows
Marc-Andre Lemburg added the comment: Rejecting the patch, since it hasn't been updated. -- resolution: -> rejected status: open -> closed _ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1620174> _ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Marc-Andre Lemburg added the comment: Tom Christiansen wrote: > > I'm pretty sure that anything that claims to be UTF-{8,16,32} needs > to reject both surrogates *and* noncharacters. Here's something from the > published Unicode Standard's p.24 about noncharacter code points: > > • Noncharacter code points are reserved for internal use, such as for > sentinel values. They should never be interchanged. They do, however, > have well-formed representations in Unicode encoding forms and survive > conversions between encoding forms. This allows sentinel values to be > preserved internally across Unicode encoding forms, even though they are > not designed to be used in open interchange. > > And here from the Unicode Standard's chapter on Conformance, section 3.2, p. > 59: > > C2 A process shall not interpret a noncharacter code point as an >abstract character. > > • The noncharacter code points may be used internally, such as for > sentinel values or delimiters, but should not be exchanged publicly. You have to remember that Python is used to build applications. It's up to the applications to conform to Unicode or not and the application also defines what "exchange" means in the above context. Python itself needs to be able to deal with assigned non-character code points as well as unassigned code points or code points that are part of special ranges such as the surrogate ranges. I'm +1 on not allowing e.g. lone surrogates in UTF-8 data, because we have a way to optionally allow these via an error handler, but -1 on making changes that cause full range round-trip safety of the UTF encodings to be lost without a way to turn the functionality back on. -- title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> Python lib re cannot handle Unicode properly due to narrow/wide bug ___ Python tracker <http://bugs.python.org/issue12729> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12508] Codecs Anomaly
Marc-Andre Lemburg added the comment: The final parameter is an extension to the decoder API signature, so it's not surprising that not all codecs implement it. The ones that do should use it for all calls, since that way the actual consumed number of bytes is correctly reported back to the StreamReader instance. Note: The parameter name "final" is a bit misleading. What happens is that the number of bytes consumed by the decoder were previously always reported as len(buffer), since the C API for decoders did not provide a way to report back the number of bytes consumed. This was changed when stateful decoders were added to the C API, since these do allow reporting back the consumed bytes. A more appropriate name for the parameter would have been "report_bytes_consumed". -- ___ Python tracker <http://bugs.python.org/issue12508> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13136] speed-up conversion between unicode widths
Marc-Andre Lemburg added the comment: Antoine Pitrou wrote: > > New submission from Antoine Pitrou : > > This patch speeds up _PyUnicode_CONVERT_BYTES by unrolling its loop. > > Example micro-benchmark: > > ./python -m timeit -s "a='x'*1;b='\u0102'*1000;c='\U0010'" "a+b+c" > > -> before: > 10 loops, best of 3: 14.9 usec per loop > -> after: > 10 loops, best of 3: 9.19 usec per loop Before going further with this, I'd suggest you have a look at your compiler settings. Such optimizations are normally performed by the compiler and don't need to be implemented in C, making maintenance harder. The fact that Windows doesn't exhibit the same performance difference suggests that the optimizer is not using the same level or feature set as on Linux. MSVC is at least as good at optimizing code as gcc, often better. I tested using memchr() when writing those "naive" loops. It turned out that using memchr() was slower than using the direct loops. memchr() is inlined by the compiler just like the direct loop and the generated code for the direct version is often easier to optimize for the compiler than the memchr() one, since it receives more knowledge about the used data types. -- nosy: +lemburg ___ Python tracker <http://bugs.python.org/issue13136> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13134] speed up finding of one-character strings
Marc-Andre Lemburg added the comment: [Posted the reply to the right ticket; see issue13136 for the original post to the wrong ticket] Antoine Pitrou wrote: > > Antoine Pitrou added the comment: > >> Before going further with this, I'd suggest you have a look at your >> compiler settings. > > They are set by the configure script: > > gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall > -Wstrict-prototypes-I. -I./Include-DPy_BUILD_CORE -o > Objects/unicodeobject.o Objects/unicodeobject.c Which gcc version are you using ? Is it possible that you have -fno-builtin enabled ? >> Such optimizations are normally performed by the >> compiler and don't need to be implemented in C, making maintenance >> harder. > > The fact that the glibc includes such optimization (in much more > sophisticated form) suggests to me that many compilers don't perform > these optimizations automically. When using gcc, the glibc functions are usually not used at all, since gcc comes with a (rather large) set of builtins which are inlined directly, if you have optimizations enabled and inlining is found to be more efficient than calling the glibc function: http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html glibc includes the optimized versions since it has to implement C library (obviously) and for cases where inlining does not happen. >> I tested using memchr() when writing those "naive" loops. > > memchr() is mentioned in another issue, #13134. > >> memchr() >> is inlined by the compiler just like the direct loop > > I don't think so. If you look at the glibc's memchr() implementation, > it's a sophisticated routine, not a trivial loop. Perhaps you're > thinking about memcpy(). See http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html and the assembler output. If it's not inlined, then something must be preventing this and it would be good to find out why. >> and the generated >> code for the direct version is often easier to optimize for the compiler >> than the memchr() one, since it receives more knowledge about the used >> data types. > > ?? Data types are fixed in the memchr() definition, there's no knowledge > to be gained by inlining. There is: the compiler will have alignement information available and can also benefit from using registers instead of the stack, knowledge about processor cache lines, etc. Such information is lost when calling a function. The function call itself will also create some overhead. BTW: You should not only test the optimization with long strings, but also with short ones (e.g. 2-15 chars) - which is a much more common case in practice. -- nosy: +lemburg ___ Python tracker <http://bugs.python.org/issue13134> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13136] speed-up conversion between unicode widths
Marc-Andre Lemburg added the comment: Antoine Pitrou wrote: > >> I tested using memchr() when writing those "naive" loops. > > memchr() is mentioned in another issue, #13134. Looks like I posted the comment to the wrong ticket. -- ___ Python tracker <http://bugs.python.org/issue13136> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12619] Automatically regenerate platform-specific modules
Marc-Andre Lemburg added the comment: I don't see why these modules should be auto-generated. The constants in the modules hardly ever change and are also not affected by architecture differences (e.g. Mac OS X, Solaris, etc.) AFAICT. If you think they need to be auto-generated, you should make a case by example. Note that we cannot simply drop the modules. Some of the constants are needed for e.g. socket, ctypes or ldd programming. -- ___ Python tracker <http://bugs.python.org/issue12619> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12619] Automatically regenerate platform-specific modules
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > STINNER Victor added the comment: > >> you should make a case by example > > Did you read comments of this issue and my email thread on python-dev? No. > There are differents examples: > > - LONG_MAX is 9223372036854775807 even on 32 bits system > - On Mac OS X, FAT programs contains 32 and 64 binaries, whereas constants > are changed for 32 or 64 bits That's because the h2py.py script doesn't know anything about conditional definitions, e.g. PTRDIFF_MIN = (-9223372036854775807-1) PTRDIFF_MAX = (9223372036854775807) PTRDIFF_MIN = (-2147483647-1) PTRDIFF_MAX = (2147483647) Not all constants are really useful because of this, but some are, e.g. INT16_MAX, INT32_MAX, etc. >> we cannot simply drop the modules. Some of the constants >> are needed for e.g. socket, ctypes or ldd programming. > > Ah? I removed all plat-* directories and ran the full test suite: no failure. Right, the modules are not tested at all, AFAIK. > The Python socket modules contain many constants (SOCK_*, AF_*, SO_*, ...): > http://docs.python.org/library/socket.html#socket.AF_UNIX True, but you probably agree that it's easier to parse a header file into a Python module than to add each and every socket option on the planet to the C socket module, don't you ? :-) > Which constants are used by the ctypes modules or can be used by modules > using ctypes? Can you give examples? I listed usages of plat-* modules in the > first message of my thread on python-dev. Not constants used by the ctypes, but constants which can be used with the ctypes module. > By "ldd", you mean "ld.so" (dlopen)? Yes. > Yes, I agree that we need to expose dl constants. But the other constants > are not used. Not in the standard lib, that's true. Also note that the plat-* directories can contain platform specific code, e.g. the OS2 dirs have replacements for the pwd and grp modules. Finally, the list of standard files to include in those directories could be extended to cover more system level constants such as the ioctl or fcntl constants (not only the few defined in the C code, but all platform specific ones). -- ___ Python tracker <http://bugs.python.org/issue12619> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13466] new timezones
Marc-Andre Lemburg added the comment: Amaury Forgeot d'Arc wrote: > > The error comes from the way Python computes timezone and daylight: it > queries the tm_gmtoff of two timestamps, one close to the first of January, > the other close to the first of July. But last January the previous > definition of the timezone was still in force... and indeed, when I changed > the code to use *next* January instead, I have the expected values. > > Is there an algorithm that gives the correct answer? Taking the 1st of > January closest to the current date would not work either. Or is there > another way (in portable C) to approach timezones? A fairly "correct" way is to query the time zone database at time module import time by using the DST and GMT offset of that time. IMO time.timezone and time.daylight should be deprecated since they will give wrong results around DST changes (both switch times and legal changes such as the described one) in long running processes such as daemons. -- ___ Python tracker <http://bugs.python.org/issue13466> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13466] new timezones
Marc-Andre Lemburg added the comment: Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc added the comment: > >> A fairly "correct" way is to query the time zone database at time module >> import time by using the DST and GMT offset of that time. > > But that does not give the *other* timezone :-( Which other timezone ? You set time.timezone to the GMT offset of the import time and then subtract another 3600 seconds in case tm_isdst is set. >> IMO time.timezone and time.daylight should be deprecated since they >> will give wrong results around DST changes (both switch times and >> legal changes such as the described one) in long running processes >> such as daemons. > > time.timezone is the non-DST timezone: this value does not change around > the DST change date. No, but time.daylight changes and time.timezone can change in situations like these where a region decides to change the way DST is dealt with, e.g. switches to the DST timezone or moves the switchover date. Since both values are tied to a specific time I don't think it's a good idea to have them as module globals. > That's why the current implementation uses "absolute" > dates like the of January: DST changes are often in March and October. Such an algorithm can be used as fallback solution in case tm_isdst is -1 (unknown), but not in case the DST information is available. > What about this algorithm: > - pick the first of January and the first of July surrounding the current date > - if both have tm_idst==0, the region has no DST. Use the current GMT offset > for > both timezone and altzone; daylight=0 Those two steps are not necessary. If tm_isdst == 0, you already know that the current time zone is not DST. > - otherwise, use the *current* time and get its DST and GMT offset. This is > enough > to compute both timezone and altzone (with the relation altzone=timezone-3600) That's what I suggested above. -- ___ Python tracker <http://bugs.python.org/issue13466> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13466] new timezones
Marc-Andre Lemburg added the comment: Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc added the comment: > >>> But that does not give the *other* timezone :-( >> Which other timezone ? > I meant the other timezone *name*. > > I think we don't understand each other: > - time.timezone is the offset of the local (non-DST) timezone. > - time.altzone is the offset of local DST timezone. Yes, I know. > They don't depend on the current date, they depend only on the timezone > database. > localtime() only gives the timezone for a given point in time, and the time > module needs to present two timezones. Right, but they should only depend on the data in the timezone database at the time of import of the module and not determine the values by looking at specific dates in the past. The only problem is finding out whether the locale uses DST in case the current import time points to a non-DST time. This can be checked by looking at Jan 1st and June 1st after the current import time (ie. in the future) and then testing tm_isdst. If there is a DST change, then you set time.altzone = time.timezone - 3600. Otherwise, you set time.altzone = time.timezone. -- ___ Python tracker <http://bugs.python.org/issue13466> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13707] Clarify hash() constancy period
Marc-Andre Lemburg added the comment: Terry J. Reedy wrote: > > Terry J. Reedy added the comment: > > Martin, I do not understand. The default hash is based on id (as is default > equality comparison), not value. Are you OK with hash values changing if the > 'value' changes? My understanding is that changing hash values for objects in > sets and dicts is bad, which is why mutable builtins with value-based > equality do not have hash values. Hash values are based on the object values, not their id(). See the various type implementations as reference. The id() is only used as hash for objects which don't have a "value" (and thus cannot be compared). Given that we have the invariant "a==b => hash(a)==hash(b)" in Python, it immediately follows that hash values for objects with comparison method cannot have a lifetime - at least not within the same process and, depending how you look at it, also not in multi-process applications. -- nosy: +lemburg ___ Python tracker <http://bugs.python.org/issue13707> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Some comments: 1. The security implications in all this is being somewhat overemphasized. There are many ways you can do a DoS attack on web servers. It's the responsibility of the used web frameworks and servers to deal with the possible cases. It's a good idea to provide some way to protect against hash collision attacks, but that will only solve one possible way of causing a resource attack on a server. There are other ways you can generate lots of CPU overhead with little data input (e.g. think of targeting the search feature on many Zope/Plone sites). In order to protect against such attacks in general, we'd have to provide a way to control CPU time and e.g. raise an exception if too much time is being spent on a simple operation such as a key insertion. This can be done using timers, signals or even under OS control. The easiest way to protect against the hash collision attack is by limiting the POST/GET/HEAD request size. The second best way would be to limit the number of parameters that a web framework accepts for POST/GET/HEAD request. 2. Changing the semantics of hashing in a dot release is not allowed. If randomization of the hash start vector or some other method is enabled by default in a dot release, this will change the semantics of any application switching to that dot release. The hash values of Python objects are not only used by the Python dictionary implementation, but also by other storage mechanisms such as on-disk dictionaries, inter-process object exchange via share memory, memcache, etc. Hence, if changed, the hash change should be disabled per default for dot releases and enabled for 3.3. 3. Changing the way strings are hashed doesn't solve the problem. Hash values of other types can easily be guessed as well, e.g. take integers which use a trivial hash function. We'd have to adapt all hash functions of the basic types in Python or come up with a generic solution using e.g. double-hashing in the dictionary/set implementations. 4. By just using a random start vector you change the absolute hash values for specific objects, but not the overall hash sequence or its period. An attacker only needs to create many hash collisions, not specific ones. It's the period of the hash function that's important in such attacks and that doesn't change when moving to a different start vector. 5. Hashing needs to be fast. It's one of the most used operations in Python. Please get experts into the boat like Tim Peters and Christian Tismer, who both have worked on the dict implementation and the hash functions, before experimenting with ad-hoc fixes. 6. Counting collisions could solve the issue without having to change hashing. Another idea would be counting the collisions and raising an exception if the number of collisions exceed a certain threshold. Such a change would work for all hashable Python objects and protect against the attack without changing any hash function. Thanks, -- Marc-Andre Lemburg eGenix.com ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- nosy: +lemburg ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Marc-Andre Lemburg wrote: > > 3. Changing the way strings are hashed doesn't solve the problem. > > Hash values of other types can easily be guessed as well, e.g. > take integers which use a trivial hash function. Here's an example for integers on a 64-bit machine: >>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 100)) >>> d = dict(g) This takes ages to complete and only uses very little memory. The input data has some 32MB if written down in decimal numbers - not all that much data either. 32397634 -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: The email interface ate part of my reply: >>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 100)) >>> s = ''.join(str(x) for x in g) >>> len(s) 32397634 >>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 100)) >>> d = dict(g) ... lots of time for coffee, pizza, taking a walk, etc. :-) -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Marc-Andre Lemburg wrote: > > 1. The security implications in all this is being somewhat overemphasized. > > There are many ways you can do a DoS attack on web servers. It's the > responsibility of the used web frameworks and servers to deal with > the possible cases. > > It's a good idea to provide some way to protect against hash > collision attacks, but that will only solve one possible way of > causing a resource attack on a server. > > There are other ways you can generate lots of CPU overhead with > little data input (e.g. think of targeting the search feature on > many Zope/Plone sites). > > In order to protect against such attacks in general, we'd have to > provide a way to control CPU time and e.g. raise an exception if too > much time is being spent on a simple operation such as a key insertion. > This can be done using timers, signals or even under OS control. > > The easiest way to protect against the hash collision attack is by > limiting the POST/GET/HEAD request size. For GET and HEAD, web servers normally already apply such limitations at rather low levels: http://stackoverflow.com/questions/686217/maximum-on-http-header-values So only HTTP methods which carry data in the body part of the HTTP request are effected, e.g. POST and various WebDAV methods. > The second best way would be to limit the number of parameters that a > web framework accepts for POST/GET/HEAD request. Depending on how parsers are implemented, applications taking XML/JSON/XML-RPC/etc. as data input may also be vulnerable, e.g. non validating XML parsers which place element attributes into a dictionary or a JSON parser that has to read the JSON version of the dict I generated earlier on. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Paul McMillan wrote: > > This is not something that can be fixed by limiting the size of POST/GET. > > Parsing documents (even offline) can generate these problems. I can create > books that calibre (a Python-based ebook format shifting tool) can't convert, > but are otherwise perfectly valid for non-python devices. If I'm allowed to > insert usernames into a database and you ever retrieve those in a dict, > you're vulnerable. If I can post things one at a time that eventually get > parsed into a dict (like the tag example), you're vulnerable. I can generate > web traffic that creates log files that are unparsable (even offline) in > Python if dicts are used anywhere. Any application that accepts data from > users needs to be considered. > > Even if the web framework has a dictionary implementation that randomizes the > hashes so it's not vulnerable, the entire python standard library uses dicts > all over the place. If this is a problem which must be fixed by the > framework, they must reinvent every standard library function they hope to > use. > > Any non-trivial python application which parses data needs the fix. The > entire standard library needs the fix if is to be relied upon by applications > which accept data. It makes sense to fix Python. Agreed: Limiting the size of POST requests only applies to *web* applications. Other applications will need other fixes. Trying to fix the problem in general by tweaking the hash function to (apparently) make it hard for an attacker to guess a good set of colliding strings/integers/etc. is not really a good solution. You'd only be making it harder for script kiddies, but as soon as someone crypt-analysis the used hash algorithm, you're lost again. You'd need to use crypto hash functions or universal hash functions if you want to achieve good security, but that's not an option for Python objects, since the hash functions need to be as fast as possible (which rules out crypto hash functions) and cannot easily drop the invariant "a=b => hash(a)=hash(b)" (which rules out universal hash functions, AFAICT). IMO, the strategy to simply cap the number of allowed collisions is a better way to achieve protection against this particular resource attack. The probability of having valid data reach such a limit is low and, if configurable, can be made 0. > Of course we must fix all the basic hashing functions in python, not just the > string hash. There aren't that many. ... not in Python itself, but if you consider all the types in Python extensions and classes implementing __hash__ in user code, the number of hash functions to fix quickly becomes unmanageable. > Marc-Andre: > If you look at my proposed code, you'll notice that we do more than simply > shift the period of the hash. It's not trivial for an attacker to create > colliding hash functions without knowing the key. Could you post it on the ticket ? BTW: I wonder how long it's going to take before someone figures out that our merge sort based list.sort() is vulnerable as well... its worst-case performance is O(n log n), making attacks somewhat harder. The popular quicksort which Python used for a long time has O(n²), making it much easier to attack, but fortunately, we replaced it with merge sort in Python 2.3, before anyone noticed ;-) -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Before continuing down the road of adding randomness to hash functions, please have a good read of the existing dictionary implementation: """ Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases: [0, 1, 2, 3] >>> map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462] >>> This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable. ... """ There's also a file called dictnotes.txt which has more interesting details about how the implementation is designed. Please note that the term "collision" is used in a slightly different way: it refers to trying to find an empty slot in the dictionary table. Having a collision implies that the hash values of two distinct objects are the same, but you also get collisions in case two distinct objects with different hash values get mapped to the same table entry. An attack can be based on trying to find many objects with the same hash value, or trying to find many objects that, as they get inserted into a dictionary, very often cause collisions due to the collision resolution algorithm not finding a free slot. In both cases, the (slow) object comparisons needed to find an empty slot is what makes the attack practical, if the application puts too much trust into large blobs of input data - which is the actual security issues we're trying to work around here... Given the dictionary implementation notes, I'm even less certain that the randomization change is a good idea. It will likely introduce a performance hit due to both the added complexity in calculating the hash as well as the reduced cache locality of the data in the dict table. I'll upload a patch that demonstrates the collisions counting strategy to show that detecting the problem is easy. Whether just raising an exception is a good idea, is another issue. It may be better to change the tp_hash slot in Python 3.3 to take an argument, so that the dict implementation can use the hash function as universal hash family function (see http://en.wikipedia.org/wiki/Universal_hash). The dict implementation could then alter the hash parameter and recreate the dict table in case the number of collisions exceeds a certain limit, thereby actively taking action instead of just relying on randomness solving the issue in most cases. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Demo patch implementing the collision limit idea for Python 2.7. -- Added file: http://bugs.python.org/file24151/hash-attack.patch ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: The hash-attack.patch solves the problem for the integer case I posted earlier on and doesn't cause any problems with the test suite. Traceback (most recent call last): File "", line 1, in KeyError: 'too many hash collisions' It also doesn't change the hashing or dict repr in existing applications. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Stupid email interface again... here's the full text: The hash-attack.patch solves the problem for the integer case I posted earlier on and doesn't cause any problems with the test suite. >>> d = dict((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 100)) >>> d = dict((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000)) Traceback (most recent call last): File "", line 1, in KeyError: 'too many hash collisions' It also doesn't change the hashing or dict repr in existing applications. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > STINNER Victor added the comment: > > hash-attack.patch does never decrement the collision counter. Why should it ? It's only used as local variable in the lookup function. Note that the limit only triggers on a per-key basis. It's not a limit on the total number of collisions in the table, so you don't need to keep the number of collisions stored on the object. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Here's an example of hash-attack.patch finding an on-purpose programming error (hashing all objects to the same value): http://stackoverflow.com/questions/4865325/counting-collisions-in-a-python-dictionary (see the second example on the page for @Winston Ewert's solution) With the patch you get: Traceback (most recent call last): File "testcollisons.py", line 20, in d[o] = 1 KeyError: 'too many hash collisions' -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Paul McMillan wrote: > >> I'll upload a patch that demonstrates the collisions counting >> strategy to show that detecting the problem is easy. Whether >> just raising an exception is a good idea, is another issue. > > I'm in cautious agreement that collision counting is a better > strategy. The dict implementation performance would suffer from > randomization. > >> The dict implementation could then alter the hash parameter >> and recreate the dict table in case the number of collisions >> exceeds a certain limit, thereby actively taking action >> instead of just relying on randomness solving the issue in >> most cases. > > This is clever. You basically neuter the attack as you notice it but > everything else is business as usual. I'm concerned that this may end > up being costly in some edge cases (e.g. look up how many collisions > it takes to force the recreation, and then aim for just that many > collisions many times). Unfortunately, each dict object has to > discover for itself that it's full of offending hashes. Another > approach would be to neuter the offending object by changing its hash, > but this would require either returning multiple values, or fixing up > existing dictionaries, neither of which seems feasible. I ran some experiments with the collision counting patch and could not trigger it in normal applications, not even in cases that are documented in the dict implementation to have a poor collision resolution behavior (integers with zeros the the low bits). The probability of having to deal with dictionaries that create over a thousand collisions for one of the key objects in a real life application appears to be very very low. Still, it may cause problems with existing applications for the Python dot releases, so it's probably safer to add it in a disabled-per-default form there (using an environment variable to adjust the setting). For 3.3 it could be enabled per default and it would also make sense to allow customizing the limit using a sys module setting. The idea with adding a parameter to the hash method/slot in order to have objects provide a hash family function instead of a fixed unparametrized hash function would probably have to be implemented as additional hash method, e.g. .__uhash__() and tp_uhash ("u" for universal). The builtin types should then grow such methods in order to make hashing safe against such attacks. For objects defined in 3rd party extensions, we would need to encourage implementing the slot/method as well. If it's not implemented, the dict implementation would have to fallback to raising an exception. Please note that I'm just sketching things here. I don't have time to work on a full-blown patch, just wanted to show what I meant with the collision counting idea and demonstrate that it actually works as intended. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Tim Peters wrote: > > Tim Peters added the comment: > > [Marc-Andre] >> BTW: I wonder how long it's going to take before >> someone figures out that our merge sort based >> list.sort() is vulnerable as well... its worst- >> case performance is O(n log n), making attacks >> somewhat harder. > > I wouldn't worry about that, because nobody could stir up anguish > about it by writing a paper ;-) > > 1. O(n log n) is enormously more forgiving than O(n**2). > > 2. An attacker need not be clever at all: O(n log n) is not only > sort()'s worst case, it's also its _expected_ case when fed randomly > ordered data. > > 3. It's provable that no comparison-based sorting algorithm can have > better worst-case asymptotic behavior when fed randomly ordered data. > > So if anyone whines about this, tell 'em to go do something useful instead :-) Right on all accounts :-) -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Christian Heimes wrote: > Marc-Andre: > Have you profiled your suggestion? I'm interested in the speed implications. > My gut feeling is that your idea could be slower, since you have added more > instructions to a tight loop, that is execute on every lookup, insert, update > and deletion of a dict key. The hash modification could have a smaller > impact, since the hash is cached. I'm merely speculating here until we have > some numbers to compare. I haven't done any profiling on this yet, but will run some tests. The lookup functions in the dict implementation are optimized to make the first non-collision case fast. The patch doesn't touch this loop. The only change is in the collision case, where an increment and comparison is added (and then only after the comparison which is the real cost factor in the loop). I did add a printf() to see how often this case occurs - it's a surprisingly rare case, which suggests that Tim, Christian and all the others that have invested considerable time into the implementation have done a really good job here. BTW: I noticed that a rather obvious optimization appears to be missing from the Python dict initialization code: when passing in a list of (key, value) pairs, the implementation doesn't make use of the available length information and still starts with an empty (small) dict table and then iterates over the pairs, increasing the table size as necessary. It would be better to start with a table that is presized to O(len(data)). The dict implementation already provides such a function, but it's not being used in the case dict(pair_list). Anyway, just an aside. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Marc-Andre Lemburg wrote: > > Marc-Andre Lemburg added the comment: > > Christian Heimes wrote: >> Marc-Andre: >> Have you profiled your suggestion? I'm interested in the speed implications. >> My gut feeling is that your idea could be slower, since you have added more >> instructions to a tight loop, that is execute on every lookup, insert, >> update and deletion of a dict key. The hash modification could have a >> smaller impact, since the hash is cached. I'm merely speculating here until >> we have some numbers to compare. > > I haven't done any profiling on this yet, but will run some > tests. I ran pybench and pystone: neither shows a significant change. I wish we had a simple to run benchmark based on Django to allow checking such changes against real world applications. Not that I expect different results from such a benchmark... To check the real world impact, I guess it would be best to run a few websites with the patch for a week and see whether the collision exception gets raised. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > Patch version 5 fixes test_unicode for 64-bit system. Victor, I don't think the randomization idea is going anywhere. The code has many issues: * it is exceedingly complex * the method would need to be implemented for all hashable Python types * it causes startup time to increase (you need urandom data for every single hashable Python data type) * it causes run-time to increase due to changes in the hash algorithm (more operations in the tight loop) * causes different processes in a multi-process setup to use different hashes for the same object * doesn't appear to work well in embedded interpreters that regularly restarted interpreters (AFAIK, some objects persist across restarts and those will have wrong hash values in the newly started instances) The most important issue, though, is that it doesn't really protect Python against the attack - it only makes it less likely that an adversary will find the init vector (or a way around having to find it via crypt analysis). OTOH, the collision counting patch is very simple, doesn't have the performance issues and provides real protection against the attack. Even better still, it can detect programming errors in hash method implementations. IMO, it would be better to put efforts into refining the collision detection patch (perhaps adding support for the universal hash method slot I mentioned) and run some real life tests with it. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > STINNER Victor added the comment: > >> * it is exceedingly complex > > Which part exactly? For hash(str), it just add two extra XOR. I'm not talking specifically about your patch, but the whole idea and the needed changes in general. >> * the method would need to be implemented for all hashable Python types > > It was already discussed, and it was said that only hash(str) need to > be modified. Really ? What about the much simpler attack on integer hash values ? You only have to send a specially crafted JSON dictionary with integer keys to a Python web server providing JSON interfaces in order to trigger the integer hash attack. The same goes for the other Python data types. >> * it causes startup time to increase (you need urandom data for >> every single hashable Python data type) > > My patch reads 8 or 16 bytes from /dev/urandom which doesn't block. Do > you have a benchmark showing a difference? > > I didn't try my patch on Windows yet. Your patch only implements the simple idea of adding an init vector and a fixed suffix vector (which you don't need since it doesn't prevent hash collisions). I don't think that's good enough, since it doesn't change how the hash algorithm works on the actual data, but instead just shifts the algorithm to a different sequence. If you apply the same logic to the integer hash function, you'll see that more clearly. Paul's algorithm is much more secure in this respect, but it requires more random startup data. >> * it causes run-time to increase due to changes in the hash >> algorithm (more operations in the tight loop) > > I posted a micro-benchmark on hash(str) on python-dev: the overhead is > nul. Did you have numbers showing that the overhead is not nul? For the simple solution, that's an expected result, but if you want more safety, then you'll see a hit due to the random data getting XOR'ed in every single loop. >> * causes different processes in a multi-process setup to use different >> hashes for the same object > > Correct. If you need to get the same hash, you can disable the > randomized hash (PYTHONHASHSEED=0) or use a fixed seed (e.g. > PYTHONHASHSEED=42). So you have the choice of being able to work in a multi-process environment and be vulnerable to the attack or not. I think we can do better :-) Note that web servers written in Python tend to be long running processes, so an attacker has lots of time to test various seeds. >> * doesn't appear to work well in embedded interpreters that >> regularly restarted interpreters (AFAIK, some objects persist across >> restarts and those will have wrong hash values in the newly started >> instances) > > test_capi runs _testembed which restarts a embedded interpreters 3 > times, and the test pass (with my patch version 5). Can you write a > script showing the problem if there is a real problem? > > In an older version of my patch, the hash secret was recreated at each > initiliazation. I changed my patch to only generate the secret once. Ok, that should fix the case. Two more issue that I forgot: * enabling randomized hashing can make debugging a lot harder, since it's rather difficult to reproduce the same state in a controlled way (unless you record the hash seed somewhere in the logs) and even though applications should not rely on the order of dict repr()s or str()s, they do often enough: * randomized hashing will result in repr() and str() of dictionaries to be random as well >> The most important issue, though, is that it doesn't really >> protect Python against the attack - it only makes it less >> likely that an adversary will find the init vector (or a way >> around having to find it via crypt analysis). > > I agree that the patch is not perfect. As written in the patch, it > just makes the attack more complex. I consider that it is enough. Wouldn't you rather see a fix that works for all hash functions and Python objects ? One that doesn't cause performance issues ? The collision counting idea has this potential. > Perl has a simpler protection than the one proposed in my patch. Is > Perl vulnerable to the hash collision vulnerability? I don't know what Perl did or how hashing works in Perl, so cannot comment on the effect of their fix. FWIW, I don't think that we should use Perl or Java as reference here. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Mark Shannon wrote: > > Mark Shannon added the comment: > >>>> * the method would need to be implemented for all hashable Python types >>> It was already discussed, and it was said that only hash(str) need to >>> be modified. >> >> Really ? What about the much simpler attack on integer hash values ? >> >> You only have to send a specially crafted JSON dictionary with integer >> keys to a Python web server providing JSON interfaces in order to >> trigger the integer hash attack. > > JSON objects are decoded as dicts with string keys, integers keys are > not possible. > > >>> json.loads(json.dumps({1:2})) > {'1': 2} Thanks for the correction. Looks like XML-RPC also doesn't accept integers as dict keys. That's good :-) However, as Paul already noted, such attacks can also occur in other places or parsers in an application, e.g. when decoding FORM parameters that use integers to signal a line or parameter position (example: value_1=2&value_2=3...) which are then converted into a dictionary mapping the position integer to the data. marshal and pickle are vulnerable, but then you normally don't expose those to untrusted data. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Antoine Pitrou wrote: > > Antoine Pitrou added the comment: > >> OTOH, the collision counting patch is very simple, doesn't have >> the performance issues and provides real protection against the >> attack. > > I don't know about real protection: you can still slow down dict > construction by 1000x (the number of allowed collisions per lookup), > which can be enough combined with a brute-force DOS. On my slow dev machine 1000 collisions run in around 22ms: python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 1000))" 100 loops, best of 3: 22.4 msec per loop Using this for a DOS attack would be rather noisy, much unlike sending a single POST. Note that the choice of 1000 as limit is rather arbitrary. I just chose it because it's high enough because it's very unlikely to be hit by an application that is not written to trigger it and it's low enough to still provide a good run-time behavior. Perhaps an even lower figure would be better. > Also, how about false positives? Having legitimate programs break > because of legitimate data would be a disaster. Yes, which is why the patch should be disabled by default (using an env var) in dot-releases. It's probably also a good idea to make the limit configurable to adjust to ones needs. Still, it is *very* unlikely that you run into real data causing more than 1000 collisions for a single insert. For full protection the universal hash method idea would have to be implemented (adding a parameter to the hash methods, so that they can be parametrized). This would then allow switching the dict to an alternative hash implementation resolving the collision problem, in case the implementation detects high number of collisions. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Mark Dickinson wrote: > > Mark Dickinson added the comment: > > [Antoine] >> Also, how about false positives? Having legitimate programs break >> because of legitimate data would be a disaster. > > This worries me, too. > > [MAL] >> Yes, which is why the patch should be disabled by default (using >> an env var) in dot-releases. > > Are you proposing having it enabled by default in Python 3.3? Possibly, yes. Depends on whether anyone comes up with a problem in the alpha, beta, RC release cycle. It would be great to have the universal hash method approach for Python 3.3. That way Python could self-heal itself in case it finds too many collisions. My guess is that it's still better to raise an exception, though, since it would uncover either attacks or programming errors. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Antoine Pitrou wrote: > > Antoine Pitrou added the comment: > >> On my slow dev machine 1000 collisions run in around 22ms: >> >> python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, >> 1000))" >> 100 loops, best of 3: 22.4 msec per loop >> >> Using this for a DOS attack would be rather noisy, much unlike >> sending a single POST. > > Note that sending one POST is not enough, unless the attacker is content > with blocking *one* worker process for a couple of seconds or minutes > (which is a rather tiny attack if you ask me :-)). Also, you can combine > many dicts in a single JSON list, so that the 1000 limit isn't > overreached for any of the dicts. Right, but such an approach only scales linearly and doesn't exhibit the quadric nature of the collision resolution. The above with 1 items takes 5 seconds on my machine. The same with 10 items is still running after 16 minutes. > So in all cases the attacker would have to send many of these POST > requests in order to overwhelm the target machine. That's how DOS > attacks work AFAIK. Depends :-) Hiding a few tens of such requests in the input stream of a busy server is easy. Doing the same with thousands of requests is a lot harder. FWIW: The above dict string version just has some 263kB for the 10 case, 114kB if gzip compressed. >> Yes, which is why the patch should be disabled by default (using >> an env var) in dot-releases. It's probably also a good idea to >> make the limit configurable to adjust to ones needs. > > Agreed if it's disabled by default then it's not a problem, but then > Python is vulnerable by default... Yes, but at least the user has an option to switch on the added protection. We'd need some field data to come to a decision. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Marc-Andre Lemburg added the comment: Frank Sievertsen wrote: > > I don't want my software to stop working because someone managed to enter > 1000 bad strings into it. Think of a software that handles names of customers > or filenames. We don't want it to break completely just because someone > entered a few clever names. Collision counting is just a simple way to trigger an action. As I mentioned in my proposal on this ticket, raising an exception is just one way to deal with the problem in case excessive collisions are found. A better way is to add a universal hash method, so that the dict can adapt to the data and modify the hash functions for just that dict (without breaking other dicts or changing the standard hash functions). Note that raising an exception doesn't completely break your software. It just signals a severe problem with the input data and a likely attack on your software. As such, it's no different than turning on DOS attack prevention in your router. In case you do get an exception, a web server will simply return a 500 error and continue working normally. For other applications, you may see a failure notice in your logs. If you're sure that there are no possible ways to attack the application using such data, then you can simply disable the feature to prevent such exceptions. > Randomization fixes most of these problems. See my list of issues with this approach (further up on this ticket). > However, it breaks the steadiness of hash(X) between two runs of the same > software. There's probably code out there that assumes that hash(X) always > returns the same value: database- or serialization-modules, for example. > > There might be good reasons to also have a steady hash-function available. > The broken code is hard to fix if no such a function is available at all. > Maybe it's possible to add a second steady hash-functions later again? This is one of the issues I mentioned. > For the moment I think the best way is to turn on randomization of hash() by > default, but having a way to turn it off. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8898] The email package should defer to the codecs module for all aliases
Marc-Andre Lemburg added the comment: Ezio Melotti wrote: > > Ezio Melotti added the comment: > > I suggest to: > 1) remove the alias for tactis; > 2) add the aliases for latin_* and the tests for the aliases; > 3) fix the email.charset to use the new aliases instead of its own dict. > > 2) and 3) should go on 3.3 only, 1) could be considered a bug and fixed on > 2.7/3.2 too, but since the codec is already missing, removing the alias won't > change anything (i.e. it will raise a LookupError with or without alias). +1 -- title: The email package should defer to the codecs module for all aliases -> The email package should defer to the codecs module for all aliases ___ Python tracker <http://bugs.python.org/issue8898> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12100] Incremental encoders of CJK codecs reset the codec at each call to encode()
Marc-Andre Lemburg added the comment: I think it's better to use a StringIO instance for the tests. Regarding resetting the incremental codec every time .encode() is called: Hye-Shik will have to comment. Perhaps there's an internal reason why they do this. -- ___ Python tracker <http://bugs.python.org/issue12100> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8898] The email package should defer to the codecs module for all aliases
Marc-Andre Lemburg added the comment: R. David Murray wrote: > > R. David Murray added the comment: > > euc_jp and euc_kr seem to be backward (that is, codecs translates them to the > _ version, instead of translating the _ version to the - version). I worry > that there might be other deviations from the standard email names. I would > suggest we pull the list of preferred MIME names from the IANA charset > registry and make a test out of them in the email package. If changing the > name returned by codecs is determined to not be acceptable, then those > entries will need to remain in the charset module ALIASES table and the > codecs-check logic adjusted accordingly. > > Unfortunately the IANA registry does not list MIME names for all of the > charsets in common use, and the canonical names are not always the ones > commonly used in email. Hopefully the codecs registry is using the most > common name for those, and hopefully if there are differences it won't break > any user code, since any reasonable email code should be coping with the > aliases in any case. The way I understand the patch was that the email package will start to use the encoding aliases for determining the codec name instead of its own list. That is: only for decoding the input data, not for creating a correct MIME encoding name in output data. -- title: The email package should defer to the codecs module for all aliases -> The email package should defer to the codecs module for all aliases ___ Python tracker <http://bugs.python.org/issue8898> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12158] platform: add linux_version()
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > New submission from STINNER Victor : > > Sometimes, we need to know the version of the Linux kernel. Recent examples: > test if SOCK_CLOEXEC or O_CLOEXEC are supported by the kernel or not. Linux < > 2.6.23 *silently* ignores O_CLOEXEC flag of open(). > > linux_version() is already implemented in test_socket, but it looks like > test_posix does also need it. > > Attached patch adds platform.linux_version(). It returns (a, b, c) (integers) > or None (if not Linux). > > It raises an error if the version string cannot be parsed. The APIs in platform generally try not to raise errors, but instead return a default value you pass in as parameter in case the data cannot be fetched from the system. The returned value should be a version string in a fixed format, not a tuple. I'd suggest to use _norm_version() for this. Please also check whether this works on a few Linux systems. I've checked it on openSUSE, Ubuntu. Thanks, -- Marc-Andre Lemburg eGenix.com 2011-06-20: EuroPython 2011, Florence, Italy 28 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- ___ Python tracker <http://bugs.python.org/issue12158> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12158] platform: add linux_version()
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > STINNER Victor added the comment: > >> The returned value should be a version string in a fixed format, >> not a tuple. I'd suggest to use _norm_version() for this. > > How do you compare version strings? I prefer tuples, as sys.version_info, > because the comparaison is more natural: > >>>> '2.6.9' > '2.6.20' > True >>>> (2, 6, 9) > (2, 6, 20) > False The APIs are mostly used for creating textual representations of system information, hence the use of strings. You can add an additional linux_version_info() API if you want to have tuples. -- ___ Python tracker <http://bugs.python.org/issue12158> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12158] platform: add linux_version()
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > STINNER Victor added the comment: > > Use "%s.%s.%s" % linux_version() if you would like to format the version. The > format is well defined. (You should only do that under Linux.) No, please follow the API conventions in that module and use a string. You can then use linux_version().split('.') in code that want to do version comparisons. Thanks, -- Marc-Andre Lemburg eGenix.com 2011-06-20: EuroPython 2011, Florence, Italy 28 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- ___ Python tracker <http://bugs.python.org/issue12158> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8796] Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter
Marc-Andre Lemburg added the comment: Closing the ticket again. We still need codecs.open() to support applications that target Python 2.x and 3.x. You can reopen it after Python 2.x has been end-of-life'd. -- resolution: -> postponed status: open -> closed ___ Python tracker <http://bugs.python.org/issue8796> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8796] Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter
Marc-Andre Lemburg added the comment: Correcting the title: this ticket is about codecs.open(), not StreamRead and StreamWriter, both of which are essential parts of the Python codec machinery and are needed to be able to implement per-codec implementations of codecs which read from and write to streams. TextIOWrapper() is conceptually something completely different. It's more something like StreamReaderWriter(). The point about having them use incremental codecs for encoding and decoding is a good one and would need to be investigated. If possible, we could use incremental encoders/decoders for the standard StreamReader/Writer base classes or add new IncrementalStreamReader/Writer classes which then use the IncrementalEncode/Decoder per default. Please open a new ticket for this. Thanks. -- ___ Python tracker <http://bugs.python.org/issue8796> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12160] codecs doc: what is StreamCodec?
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > New submission from STINNER Victor : > > Codec.encode() and Codec.decode() refer to StreamCode, but I cannot find this > class in the doc nor in the code. > > I suppose that it should be replaced by IncrementalEncoder and > IncrementalDecoder. If I'm correct, see attached patch. Thanks for spotting this. It should read StreamReader/StreamWriter, since these were designed to keep state. -- ___ Python tracker <http://bugs.python.org/issue12160> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12158] platform: add linux_version()
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > STINNER Victor added the comment: > >> You can then use linux_version().split('.') in code that want >> to do version comparisons. > > It doesn't give the expected result: > >>>> ('2', '6', '9') < ('2', '6', '20') > False >>>> ('2', '6', '9') > ('2', '6', '20') > True Sorry, I forgot the tuple(int(x) for ...) part. > By the way, if you would like to *display* the Linux version, it's better to > use release() which gives more information: No, release() doesn't have any defined format. >>>> platform.linux_version() > (2, 6, 38) >>>> platform.release() > '2.6.38-2-amd64' > > About the name convention: mac_ver() and win32_ver() do return tuples. If you > prefer linux_version_tuple(), it's as you want. But return a tuple of strings > is useless: if you would like a string, use release() and parse the string > yourself. Please look again: they both return the version and other infos as strings. > Note: "info" suffix is not currently used, whereas there are python_version() > and python_version_tuple(). Good point. I was thinking of the sys module function to return the Python version as tuple. >> Do we really need to expose a such Linux-centric and sparingly >> used function to the platform module? > > The platform module has already 2 functions specific to Linux: > linux_distribution() and libc_ver(). But if my proposed API doesn't fit > platform conventions, yeah, we can move the function to test.support. Indeed and in retrospect, adding linux_distribution() was a mistake, since it causes too much maintenance. The linux_version() is likely going to cause similar issues, since on the systems I checked, some return three part versions, others four parts and then again other add a distribution specific revision counter to it. Then you have pre-releases, release candidates and development versions: http://en.wikipedia.org/wiki/Linux_kernel#Version_numbering Reconsidering, I think it's better not to add the API to prevent opening up another can of worms. -- ___ Python tracker <http://bugs.python.org/issue12158> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12158] platform: add linux_version()
Changes by Marc-Andre Lemburg : -- resolution: -> rejected status: open -> closed ___ Python tracker <http://bugs.python.org/issue12158> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8796] Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter
Marc-Andre Lemburg added the comment: Antoine Pitrou wrote: > > Antoine Pitrou added the comment: > >> TextIOWrapper() is conceptually something completely different. It's >> more something like StreamReaderWriter(). > > That's a rather strange assertion. Can you expand? > TextIOWrapper supports read-only, write-only, read-write, unseekable and > seekable streams. StreamReader and StreamWriter classes provide the base codec implementations for stateful interaction with streams. They define the interface and provide a working implementation for those codecs that choose not to implement their own variants. Each codec can, however, implement variants which are optimized for the specific encoding or intercept certain stream methods to add functionality or improve the encoding/decoding performance. Both are essential parts of the codec interface. TextIOWrapper and StreamReaderWriter are merely wrappers around streams that make use of the codecs. They don't provide any codec logic themselves. That's the conceptual difference. -- title: Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter -> Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter ___ Python tracker <http://bugs.python.org/issue8796> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8796] Deprecate codecs.open()
Changes by Marc-Andre Lemburg : -- title: Deprecate codecs.open(), codecs.StreamReader and codecs.StreamWriter -> Deprecate codecs.open() ___ Python tracker <http://bugs.python.org/issue8796> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12100] Incremental encoders of CJK codecs reset the codec at each call to encode()
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > STINNER Victor added the comment: > >> I think it's better to use a StringIO instance for the tests. > > For which test excatly? An encoder produces bytes, I don't the relation with > StringIO. Sorry, BytesIO in Python3-speak. In Python2 you'd use StringIO. -- title: Incremental encoders of CJK codecs reset the codec at each call to encode() -> Incremental encoders of CJK codecs reset the codec at each call to encode() ___ Python tracker <http://bugs.python.org/issue12100> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12171] Reset method of the incremental encoders of CJK codecs calls the decoder reset function
Marc-Andre Lemburg added the comment: Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc added the comment: > > Do we need an additional method? It seems that this reset() could also be > written encoder.encode('', final=True) +1 I think that's a much more natural way to implement "finalize the encoding output without adding any data to it". -- title: Reset method of the incremental encoders of CJK codecs calls the decoder reset function -> Reset method of the incremental encoders of CJK codecs calls the decoder reset function ___ Python tracker <http://bugs.python.org/issue12171> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12171] Reset method of the incremental encoders of CJK codecs calls the decoder reset function
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > STINNER Victor added the comment: > > Le mercredi 25 mai 2011 à 08:23 +, Marc-Andre Lemburg a écrit : >>> Do we need an additional method? It seems that this reset() could >>> also be written encoder.encode('', final=True) >> >> +1 >> >> I think that's a much more natural way to implement "finalize the >> encoding output without adding any data to it". > > And so, reset() should discard the output? I can easily adapt my patch > to discard the output (but still call encreset() instead of decreset()). I'm not sure what you mean by "discard the output". Calling .reset() should still add the closing sequence to the output buffer, if needed. The purpose of .reset() is flush all data and put the codec into a clean state (comparable to the state you get when you start using it). -- ___ Python tracker <http://bugs.python.org/issue12171> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9561] distutils: set encoding to utf-8 for input and output files
Marc-Andre Lemburg added the comment: Éric Araujo wrote: > > Éric Araujo added the comment: > > Definitely. We can fix real bugs in distutils, but sometimes it’s best to > avoid disruptive changes and let distutils with its buggy behavior and let > the packaging module have the best behavior. This is a real bug, since we agreed long ago that distutils should read and write files using the UTF-8 encoding. -- nosy: +lemburg ___ Python tracker <http://bugs.python.org/issue9561> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8898] The email package should defer to the codecs module for all aliases
Marc-Andre Lemburg added the comment: R. David Murray wrote: > > R. David Murray added the comment: > > What is not-a-charset? > > I apparently misunderstood what normalize_encodings does. It isn't doing a > lookup in the codecs registry and returning the canonical name for the codec. > Does that mean we actually have to fetch the codec in order to get the > canonical name? I suspect so, and that is probably OK, since in most cases > the codec is eventually going to get called while processing the email that > triggered the ALIASES lookup. > > I also notice that there is a table of aliases in the codec module > documentation, so that will need to be updated as well. As far as the aliases.py part of the patch goes, I'm fine with that since it corrects a few real bugs and adds the missing Latin-N codec names. Regarding using this table in the email package, I'm not really clear on what you want to achieve. If you are looking for a way to determine whether Python has a codec installed for a certain charset name, then codecs.lookup() will tell you this (and it also applies all the aliasing and normalization needed). If you want to avoid the actual codec module import (codecs.lookup() imports the module), you can mimic the logic used by the lookup function of the encodings package. Not sure, whether that's worth it, though, since it is rather likely that you're going to use the codec you've just looked up soon after the test and codecs.lookup() caches the found codecs. If you want to convert an arbitrary encoding name to a registered standard IANA MIME charset name, then the aliases.py module is not going to be of much help, since we are using our own canonical names which do not necessarily map to the MIME charset names. You'd have to add a new mime_alias map to the email package for that. I'd suggest to use the same approach as for the aliases.py module, which is to first normalize the encoding name using normalize_encoding() and then running that through the mime_alias map. Hope that helps. -- ___ Python tracker <http://bugs.python.org/issue8898> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8898] The email package should defer to the codecs module for all aliases
Marc-Andre Lemburg added the comment: R. David Murray wrote: > > R. David Murray added the comment: > > Well, my thought was to avoid having multiple charset alias lists in the > stdlib, and reusing the one in codecs, which is larger than the one in email, > seemed to make sense. This came up because a bug was reported where email > (silently) failed to encode a string because the charset alias, while present > in codecs, wasn't present in the email ALIASES table. > > I suppose that as an alternative I could add full support for the IANA > aliases list to email. Email is the most likely place to run in to variant > charset aliases anyway. > > If that's the way we go, then this issue should be changed over to covering > just updating codecs with the missing aliases, and a new issue opened for > adding full IANA alias support to email. I think it would be useful to have a mapping from the Python canoncial name (the one the encodings package uses) to the "preferred MIME name" as referenced in the IANA list: http://www.iana.org/assignments/character-sets This mapping could also be added to the encodings package together with a function that translates a given encoding name to its canoncial Python name (codec_module_name()) and another one to translate it to the "preferred MIME name" according to the above list (encoding_mime_name()). Note that we don't support all the aliases mentioned in the IANA list because many of the are outdated and some have proved to be wrong (the aliased encodings are actually different in a few places). There are also a few encodings in the list which we don't support at all. Since we only rarely get requests for supporting new aliases or encodings, I think it's safe to say that the existing set is fairly complete from a practical point of view. -- title: The email package should defer to the codecs module for all aliases -> The email package should defer to the codecs module for all aliases ___ Python tracker <http://bugs.python.org/issue8898> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8796] Deprecate codecs.open()
Marc-Andre Lemburg added the comment: Roundup Robot wrote: > > Roundup Robot added the comment: > > New changeset 3555cf6f9c98 by Victor Stinner in branch 'default': > Issue #8796: codecs.open() calls the builtin open() function instead of using > http://hg.python.org/cpython/rev/3555cf6f9c98 Viktor, could you please back out this change again. I am -1 on deprecating the StreamReader/Writer parts of the codec API as I've mentioned numerous times and *don't* want to see these deprecated in the code or the documentation. I'm -0 on the change to codecs.open(). Have you checked whether the returned objects are compatible ? Thanks, -- Marc-Andre Lemburg eGenix.com 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 24 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- ___ Python tracker <http://bugs.python.org/issue8796> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8898] The email package should defer to the codecs module for all aliases
Marc-Andre Lemburg added the comment: Michele Orrù wrote: > > Michele Orrù added the comment: > > Any idea about how to unittest mime.aliases? Test the APIs you probably created for accessing it. > Also, since I've just created a new file, are there some buracratic issues? I > mean, do I have to add something at the top of the file? > (I'm just signing the Contributor Agreement) You just need to put the usual copyright line at the top of the file, together with the sentence from the agreement. Apart from that, you also need to make sure that the other build setups include the new file (PCbuild, Makefile.pre.in, etc.). If you don't know how to do this, you can ask someone else to take care of this, since it usually requires domain knowledge (e.g. to add the file to the Windows builds). -- ___ Python tracker <http://bugs.python.org/issue8898> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12204] str.upper converts to title
Marc-Andre Lemburg added the comment: Ezio Melotti wrote: > > Ezio Melotti added the comment: > > '\u1ff3'.upper() returns '\u1ffc', so we have: > U+1FF3 (ῳ - GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI) > U+1FFC (ῼ - GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI) > The first belongs to the Ll (Letter, lowercase) category, whereas the second > belongs to the Lt (Letter, titlecase) category. > > The entries for these two chars in the UnicodeData.txt[0] files are: > 1FF3;GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI;Ll;0;L;03C9 > 0345N;;;1FFC;;1FFC > 1FFC;GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI;Lt;0;L;03A9 > 0345N1FF3; > > U+1FF3 has U+1FFC in both the third last and last field > (Simple_Uppercase_Mapping and Simple_Titlecase_Mapping respectively -- see > [1]), so .upper() is doing the right thing here. > U+1FFC has U+1FF3 in the second last field (Simple_Lowercase_Mapping), but > since it's category is not Lu, but Lt, .isupper() returns False. > > The Unicode Standard Annex #44[2] defines the Lt category as: > Lt Titlecase_Letter a digraphic character, with first part uppercase > > I'm not sure there's anything to fix here, both function behave as > documented, and it might indeed be the case that .upper() returns chars with > category Lt, that then return False with .isupper() > > [0]: http://unicode.org/Public/UNIDATA/UnicodeData.txt > [1]: http://www.unicode.org/reports/tr44/#UnicodeData.txt > [2]: http://www.unicode.org/reports/tr44/#GC_Values_Table I think there's a misunderstanding here: title cased characters are ones typically used in titles of a document. They don't necessarily have to be upper case, though, since some characters are never used as first letters of a word. Note that .upper() also does not guarantee to return an upper case character. It just applies the mapping defined in the Unicode standard and if there is no such mapping, or Python does not support the mapping, the method returns the original character. The German ß is such a character (U+00DF). It doesn't have an uppercase mapping in actual use and only received such a mapping in Unicode 5.1 based on rather controversial grounds (see http://en.wikipedia.org/wiki/ẞ). The character is normally mapped to 'SS' when converting it to upper case or title case. This multi-character mapping is not supported by Python, so .upper() just returns U+00DF. I suggest to close this ticket as invalid or to add a note to the documentation explaining how the mapping is applied (and when not). -- nosy: +lemburg ___ Python tracker <http://bugs.python.org/issue12204> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6490] os.popen documentation in 2.6 is probably wrong
Marc-Andre Lemburg added the comment: Chris Rebert wrote: > > Chris Rebert added the comment: > > Per msg129958, attached is my stab at a patch to replace most uses of > os.popen() with the subprocess module. The test suite passes on my Mac, but > the patch does touch some specific-to-other-platform code, so further testing > is obviously needed. > This is my first non-docs patch, please be gentle. :) [Those patches were to > subprocess' docs though!] > > Stuff still using os.popen() that the patch doesn't fix: > - multiprocessing > - platform.popen() [which is itself deprecated] > - subprocess.check_output() > - Lib/test/test_poll.py > - Lib/test/test_select.py > - Lib/distutils/tests/test_cygwinccompiler.py > > Also, I suppose Issue 9382 should be marked as a dupe of this one? Thanks, but I still don't understand why os.popen() wasn't removed from the list of deprecated APIs as per Guido's message further up on the ticket. If you look at the amount of code you need to add in order to support the os.popen() functionality directly using subprocess instead of going the indirect way via the existing os.popen() wrapper around the subprocess functionality, I think this shows that the wrapper is indeed a good thing to have and something you'd otherwise implement anyway as part of standard code refactoring. So instead of applying such a patch, I think we should add back the documentation for os.popen() and remove the deprecation notices. The deprecations for os.popenN() are still fine, since those APIs are not used all that much, and I'm sure that no one can really remember what all the different versions do anyway :-) os.popen() OTOH is often used and implements a very common use: running an external command and getting the stdout results back for further processing. -- ___ Python tracker <http://bugs.python.org/issue6490> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue7511] msvc9compiler.py: ValueError: [u'path']
Marc-Andre Lemburg added the comment: mike bayer wrote: > > > > mike bayer added the comment: > > > > regarding "hey this is an MS bug not Python", projects which feature > > optional C extensions are starting to apply workarounds for the issue on > > their end (I will need to commit a specific catch for this to SQLAlchemy) - > > users need to install our software and we need to detect compilation > > failures as a sign to move on without it. I think it's preferable for > > Python distutils to work around an MS issue rather than N projects having > > to work around an MS issue exposed through distutils. Seems like this bug > > has been out there a real long time...bump ? This is not really an MS issue. Setting up the environment to be able to compile extensions is a prerequisite on most platforms and with most compilers. MS VC++ supports having multiple compiler versions on the same machine and allow compiling to x86, x64 and ia64 (at least in more recent VC++ versions). I think it's fair to ask the user to setup the environment correctly before running "python setup.py install", since distutils doesn't really know which compiler to use - e.g. you could be cross-compiling for x64 on an x86 machine, or you may want to use VC 2008 instead of a more recently installed VC 2010. Wouldn't it be better to have distutils tell the user about the possible options, instead of guessing and then possibly compiling extensions which later on don't import or import, but don't work as expected ? Regarding the latest patch: This is not the right approach, since find_vcvarsall() is supposed to return the path to the vcvarsall.bat file and not an architecture specific setup file. It is later called with the arch identifier, which the arch specific setup files don't check or use. Also note that vcvarsall.bat can take these options: x86 (default), x64, amd64, x86_amd64, ia64, x86_ia64 The x86_* options setup the cross compilers. -- nosy: +lemburg ___ Python tracker <http://bugs.python.org/issue7511> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue7511] msvc9compiler.py: ValueError when trying to compile with VC Express
Marc-Andre Lemburg added the comment: Stefan Krah wrote: > > > > Stefan Krah added the comment: > > > > Marc-Andre Lemburg wrote: >> >> Wouldn't it be better to have distutils tell the user about the >> >> possible options, instead of guessing and then possibly compiling >> >> extensions which later on don't import or import, but don't work >> >> as expected ? > > > > That would be an option, yes. > > > > >> >> Regarding the latest patch: This is not the right approach, since >> >> find_vcvarsall() is supposed to return the path to the vcvarsall.bat >> >> file and not an architecture specific setup file. It is later >> >> called with the arch identifier, which the arch specific setup files >> >> don't check or use. > > > > The patch does not change anything for Visual Studio Pro. In Visual Studio > > Express (+SDK) vcvarsall.bat is broken, so the architecture specific setup > > files have to be used (they also work with a superfluous parameter). I guess what I wanted to say is that find_vcvarsall() should return None for VC Express and code using it should then revert to using a new find_vcvars() function, which takes the architecture as parameter and returns the path to the correct architecture setup file. Hacking the support into find_vcvarsall() is not the right approach. You have to add this support one level further up. >> >> Also note that vcvarsall.bat can take these options: >> >> >> >>x86 (default), x64, amd64, x86_amd64, ia64, x86_ia64 >> >> >> >> The x86_* options setup the cross compilers. > > > > I think the patch covers all architecture specific files that are > > present in the Visual Studio Express + SDK setup. Right, but it doesn't cover the ones available in VS Pro (see above), which it should for completeness. > > Visual Studio Pro is protected from all changes by checking for > > the presence of the file bin\amd64\vcvarsamd64.bat. This > > could probably be done more elegantly by using some obscure > > registry value. > > > > > > > > As Thorsten mentioned, another option would be to copy bin\vcvars64.bat > > to bin\amd64\vcvarsamd64.bat if the latter is not present. > > > > This is harmless, but it is perhaps not really the business of Python > > to mess with existing installs. Not a good idea :-) PS: Changing the title, since I keep getting the following error messages from the email interface: There were problems handling your subject line argument list: - not of form [arg=value,value,...;arg=value,value,...] Subject was: "Re: [issue7511] msvc9compiler.py: ValueError: [u'path']" -- title: msvc9compiler.py: ValueError: [u'path'] -> msvc9compiler.py: ValueError when trying to compile with VC Express ___ Python tracker <http://bugs.python.org/issue7511> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12326] Linux 3: tests should avoid using sys.platform == 'linux2'
Marc-Andre Lemburg added the comment: Martin v. Löwis wrote: > > Martin v. Löwis added the comment: > >> The change to sys.platform=='linux' would break code even on current >> platforms. > > Correct. Compared to introducing 'linux3', I consider this the better > change - it likely breaks earlier (i.e. when porting to Python 3.3). > >> OTOH, we have sys.platform=='win32' even on Windows 64bit; would this >> favor keeping 'linux2' on all versions of Linux as well? > > While this has better compatibility, it's also a constant source of > irritation. Introducing 'win64' would have been a worse choice (just > as introducing 'linux3' would: incompatibility for no gain, since > the distinction between win32 and win64, from a Python POV, is > irrelevant). Plus, Microsoft dislikes the term Win64 somewhat, and > rather wants people to refer to the "Windows API". > > I personally disliked 'linux2' when it was introduced, for its > incompatibilities. Anticipating that, some day, we may have 'Linux 4', > and so on, I still claim it is better to fix this now. We could even > come up with a 2to3 fixer for people who dual-source their code. I think we should consider adding a better mechanism and just keep the old mechanism for determining sys.platform in place (letting it break on Linux to raise awareness) and add a new better attribute along the lines of what Martin suggested: sys.system == 'linux', 'freebsd', 'openbsd', 'windows', etc. (without version) and a new sys.system_info == system release info (named) tuple similar to sys.version_info to query a specific system version. As already noted, direct sys.platform testing already breaks for OpenBSD, FreeBSD and probably a few other OSes as well with every major OS release, so the Linux breakage is not really new in any way. -- ___ Python tracker <http://bugs.python.org/issue12326> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12266] str.capitalize contradicts oneself
Marc-Andre Lemburg added the comment: I think it would be better to use this code: if (!Py_UNICODE_ISUPPER(*s)) { *s = Py_UNICODE_TOUPPER(*s); status = 1; } s++; while (--len > 0) { if (Py_UNICODE_ISLOWER(*s)) { *s = Py_UNICODE_TOLOWER(*s); status = 1; } s++; } Since this actually implements what the doc-string says. Note that title case is not the same as upper case. Title case is a special case that get's applied when using a string as a title of a text and may well include characters that are lower case but which are only used in titles. -- ___ Python tracker <http://bugs.python.org/issue12266> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12266] str.capitalize contradicts oneself
Marc-Andre Lemburg added the comment: Ezio Melotti wrote: > > Ezio Melotti added the comment: > > Do you mean "if (!Py_UNICODE_ISLOWER(*s)) {" (with the '!')? Sorry, here's the correct version: if (!Py_UNICODE_ISUPPER(*s)) { *s = Py_UNICODE_TOUPPER(*s); status = 1; } s++; while (--len > 0) { if (!Py_UNICODE_ISLOWER(*s)) { *s = Py_UNICODE_TOLOWER(*s); status = 1; } s++; } > This sounds fine to me, but with this approach all the uncased characters > will go through a Py_UNICODE_TO* macro, whereas with the current code only > the cased ones are converted. I'm not sure this matters too much though. > > OTOH if the non-lowercase cased chars are always either upper or titlecased, > checking for both should be equivalent. AFAIK, there are characters that don't have a case mapping at all. It may also be the case, that a non-cased character still has a lower/upper case mapping, e.g. for typographical reasons. Someone would have to check this against the current Unicode database. -- ___ Python tracker <http://bugs.python.org/issue12266> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9528] Add pure Python implementation of time module to CPython
Marc-Andre Lemburg added the comment: Alan Justino wrote: > > I am getting a hard time trying to do some BDD with c-based datetime because > I cannot mock it easily to force datetime.datetime.now() to return a desired > value, making almost impossible to test time-based code, like the accounting > system that I am refactoring right now. It's usually better to use a central helper get_current_time() in the application, than to use datetime.now() and others directly. -- ___ Python tracker <http://bugs.python.org/issue9528> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2857] add codec for java modified utf-8
Marc-Andre Lemburg added the comment: Tom Christiansen wrote: > > Tom Christiansen added the comment: > > Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at: > > http://unicode.org/reports/tr26/ > > CESU-8 is *not* a a valid Unicode Transform Format and should not be called > UTF-8. It is a real pain in the butt, caused by people who misunderand > Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need > to be able to read it, but call it what it is, please. > > Despite the talk about Lucene, I note that the Perl port of Lucene uses real > UTF-8, not CESU-8. CESU-8 is a different encoding than the one we are talking about. The only difference between UTF-8 and the modified one is the different encoding for the U+ code point to have the output not contain any NUL bytes. -- ___ Python tracker <http://bugs.python.org/issue2857> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2857] Add "java modified utf-8" codec
Marc-Andre Lemburg added the comment: Corrected the title again. See my comment. -- title: Add CESU-8 codec ("java modified utf-8") -> Add "java modified utf-8" codec versions: +Python 3.3 -Python 2.7, Python 3.2 ___ Python tracker <http://bugs.python.org/issue2857> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2857] Add "java modified utf-8" codec
Marc-Andre Lemburg added the comment: Marc-Andre Lemburg wrote: > > Corrected the title again. See my comment. Please open a new ticket, if you want to add a CESU-8 codec. Looking at the relevant use cases, I'm at most +0 on adding the modified UTF-8 codec. I think such codecs can well live outside the stdlib on PyPI. -- ___ Python tracker <http://bugs.python.org/issue2857> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Marc-Andre Lemburg added the comment: > Keep in mind that we should be able to access and use lone surrogates too, > therefore: > s = '\ud800' # should be valid > len(s) # should this raise an error? (or return 0.5 ;)? > s[0] # error here too? > list(s) # here too? > > p = s + '\udc00' > len(p) # 1? > s[0] # '\U0001' ? > s[1] # IndexError? > list(p + 'a') # ['\ud800\udc00', 'a']? > > We can still decide that strings with lone surrogates work only with a > limited number of methods/functions but: > 1) it's not backward compatible; > 2) it's not very consistent > > Another thing I noticed is that (at least on wide builds) surrogate pairs are > not joined "on the fly": >>>> p > '\ud800\udc00' >>>> len(p) > 2 >>>> p.encode('utf-16').decode('utf-16') > '𐀀' >>>> len(_) > 1 Hi Tom, welcome to Python land :-) Here's some more background information on how Python's Unicode implementation works: You need to differentiate between Unicode code points stored in Unicode objects and ones encoded in transfer formats by codecs. We generally do allow lone surrogates, unassigned code points, lone combining code points, etc. in Unicode objects since Python needs to be able to work on all Unicode code points and build strings with them. The transfer format codecs do try to combine surrogates on decoding data on UCS4 builds. On UCS2 builds they create surrogate pairs as necessary. On output, those pairs will again be joined to get round-trip safety. It helps if you think of Python's Unicode objects using UCS2 and UCS4 instead of UTF-16/32. Python does try to make working with UCS2 easy and in many cases behaves as if it were using UTF-16 internally, but there are, of course, limits to this. In practice, you only rarely get to see any of these special cases, since non-BMP code points are usually not found in everyday use. If they do become a problem for you, you have the option of switching to a UCS4 build of Python. You also have to be aware of the fact that Python started Unicode in 1999/2000 with Unicode 2.0/3.0, so it uses the terminology of those versions, some of which has changed in more recent versions of Unicode. For more background information, you might want take a look at this talk from 2002: http://www.egenix.com/library/presentations/#PythonAndUnicode Related to the other tickets you opened You'll also find that collation and compression was already on the plate back then, but since no one step forward, it wasn't implemented. Cheers, -- Marc-Andre Lemburg eGenix.com 2011-10-04: PyCon DE 2011, Leipzig, Germany50 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- nosy: +lemburg title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> Python lib re cannot handle Unicode properly due to narrow/wide bug ___ Python tracker <http://bugs.python.org/issue12729> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue12752] locale.normalize does not take unicode strings
Marc-Andre Lemburg added the comment: Julian Taylor wrote: > > New submission from Julian Taylor : > > using unicode strings for locale.normalize gives following traceback with > python2.7: > > ~$ python2.7 -c 'import locale; locale.normalize(u"en_US")' > Traceback (most recent call last): > File "", line 1, in > File "/usr/lib/python2.7/locale.py", line 358, in normalize > fullname = localename.translate(_ascii_lower_map) > TypeError: character mapping must return integer, None or unicode > > with python2.6 it works and it also works with non-unicode strings in 2.7 This looks like a side-effect of the change Antoine made to the locale module when trying to make the case mapping work in a non-locale dependent way. -- nosy: +lemburg ___ Python tracker <http://bugs.python.org/issue12752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com