[issue45120] Windows cp encodings "UNDEFINED" entries update
New submission from Rafael Belo : There is a mismatch in specification and behavior in some windows encodings. Some older windows codepages specifications present "UNDEFINED" mapping, whereas in reality, they present another behavior which is updated in a section named "bestfit". For example CP1252 has a corresponding bestfit1525: CP1252: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT bestfit1525: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt >From which, in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", >whereas in bestfit1252, they map to \u0081 \u008d \u008f \u0090 \u009d >respectively. In the Windows API, the function 'MultiByteToWideChar' exhibits the bestfit1252 behavior. This issue and PR proposes a correction for this behavior, updating the windows codepages where some code points where defined as "UNDEFINED" to the corresponding bestfit mapping. Related issue: https://bugs.python.org/issue28712 -- components: Demos and Tools, Library (Lib), Unicode, Windows messages: 401181 nosy: ezio.melotti, lemburg, paul.moore, rafaelblsilva, steve.dower, tim.golden, vstinner, zach.ware priority: normal severity: normal status: open title: Windows cp encodings "UNDEFINED" entries update type: behavior ___ Python tracker <https://bugs.python.org/issue45120> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45120] Windows cp encodings "UNDEFINED" entries update
Rafael Belo added the comment: As encodings are indeed a complex topic, debating this seems like a necessity. I researched this topic when i found an encoding issue regarding a mysql connector: https://github.com/PyMySQL/mysqlclient/pull/502 In MySQL itself there is a mislabel of "latin1" and "cp1252", what mysql calls "latin1" presents the behavior of cp1252. As Inada Naoki pointed out: """ See this: https://dev.mysql.com/doc/refman/8.0/en/charset-we-sets.html MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d. So latin1 in MySQL is actually cp1252. """ You can verify this by passing the byte 0x80 through it to get the string representation, a quick test i find useful: On MYSQL: select convert(unhex('80') using latin1); -- -> returns "€" On Postgresql: select convert_from(E'\\x80'::bytea, 'WIN1252'); -- -> returns "€" select convert_from(E'\\x80'::bytea, 'LATIN1'); -- -> returns the C1 control character "0xc2 0x80" I decided to try to fix this behavior on python because i always found it to be a little odd to receive errors in those codepoints. A discussion i particularly find useful is this one: https://comp.lang.python.narkive.com/C9oHYxxu/latin1-and-cp1252-inconsistent Which i think they didn't notice the "WindowsBestFit" folder at: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/ Digging through the commits to look for dates, i realized Amaury Forgeot d'Arc, created a tool to generate the windows encodings based on calls to "MultiByteToWideChar" which indeed generates the same mapping available on the unicode website, i've attached the file generated by it. Since there might be legacy systems which rely on this "specific" behavior, i don't think "back-porting" this update to older python versions is a good idea. That is the reason i think this should come in new versions, and treated as a "new behavior". The benefit i see in updating this is to prevent even further confusion, with the expected behavior when dealing with those encodings. -- Added file: https://bugs.python.org/file50282/cp1252_from_genwincodec.py ___ Python tracker <https://bugs.python.org/issue45120> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45120] Windows cp encodings "UNDEFINED" entries update
Rafael Belo added the comment: Eryk Regarding the codecsmodule.c i don't really know its inner workings and how it is connected to other modules, and as of it, changes on that level for this use case are not critical. But it is nice to think and evaluate on that level too, since there might be some tricky situations on windows systems because of that grey zone. My proposal really aims to enhance the Lib/encodings/ module. And as Marc-Andre Lemburg advised, to only change those mappings in case of official corrections on the standard itself. Now i think that really following those standards "strictly" seems to be a good idea. On top of that, adding them under different naming seems like a better idea anyway, since those standards can be seen as different if you take a strict look at the Unicode definitions. Adding them would suffice for the needs that might arise, would still allow for catching mismatched encodings, and can even be "backported" to older python versions. I will adjust the PR accordingly to these comments, thanks for the feedback! -- ___ Python tracker <https://bugs.python.org/issue45120> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com