[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-06 Thread Rafael Belo

New submission from Rafael Belo :

There is a mismatch in specification and behavior in some windows encodings.

Some older windows codepages specifications present "UNDEFINED" mapping, 
whereas in reality, they present another behavior which is updated in a section 
named "bestfit".

For example CP1252 has a corresponding bestfit1525: 
CP1252: 
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
bestfit1525: 
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt


>From which, in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", 
>whereas in bestfit1252, they map to \u0081 \u008d \u008f \u0090 \u009d 
>respectively. 

In the Windows API, the function 'MultiByteToWideChar' exhibits the bestfit1252 
behavior.


This issue and PR proposes a correction for this behavior, updating the windows 
codepages where some code points where defined as "UNDEFINED" to the 
corresponding bestfit mapping. 


Related issue: https://bugs.python.org/issue28712

--
components: Demos and Tools, Library (Lib), Unicode, Windows
messages: 401181
nosy: ezio.melotti, lemburg, paul.moore, rafaelblsilva, steve.dower, 
tim.golden, vstinner, zach.ware
priority: normal
severity: normal
status: open
title: Windows cp encodings "UNDEFINED" entries update
type: behavior

___
Python tracker 
<https://bugs.python.org/issue45120>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-16 Thread Rafael Belo

Rafael Belo  added the comment:

As encodings are indeed a complex topic, debating this seems like a necessity. 
I researched this topic when i found an encoding issue regarding a mysql 
connector: https://github.com/PyMySQL/mysqlclient/pull/502

In MySQL itself there is a mislabel of "latin1" and "cp1252",  what mysql calls 
"latin1" presents the behavior of cp1252. As Inada Naoki pointed out:

"""
See this: https://dev.mysql.com/doc/refman/8.0/en/charset-we-sets.html

MySQL's latin1 is the same as the Windows cp1252 character set. This means it 
is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers 
Authority) latin1, except that IANA latin1 treats the code points between 0x80 
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign 
characters for those positions. For example, 0x80 is the Euro sign. For the 
“undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 
0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

So latin1 in MySQL is actually cp1252.
"""

You can verify this by passing the byte 0x80 through it to get the string 
representation, a quick test i find useful:

On MYSQL: 
select convert(unhex('80') using latin1); -- -> returns "€"

On Postgresql: 
select convert_from(E'\\x80'::bytea, 'WIN1252'); -- -> returns "€"
select convert_from(E'\\x80'::bytea, 'LATIN1'); -- -> returns the C1 control 
character "0xc2 0x80"

I decided to try to fix this behavior on python because i always found it to be 
a little odd to receive errors in those codepoints. A discussion i particularly 
find useful is this one: 
https://comp.lang.python.narkive.com/C9oHYxxu/latin1-and-cp1252-inconsistent

Which i think they didn't notice the "WindowsBestFit" folder at:
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/

Digging through the commits to look for dates, i realized Amaury Forgeot d'Arc, 
created a tool to generate the windows encodings based on calls to 
"MultiByteToWideChar" which indeed generates the same mapping available on the 
unicode website, i've attached the file generated by it. 


Since there might be legacy systems which rely on this "specific" behavior, i 
don't think "back-porting" this update to older python versions is a good idea. 
That is the reason i think this should come in new versions, and treated as a 
"new behavior".

The benefit i see in updating this is to prevent even further confusion, with 
the expected behavior when dealing with those encodings.

--
Added file: https://bugs.python.org/file50282/cp1252_from_genwincodec.py

___
Python tracker 
<https://bugs.python.org/issue45120>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue45120] Windows cp encodings "UNDEFINED" entries update

2021-09-17 Thread Rafael Belo


Rafael Belo  added the comment:

Eryk 

Regarding the codecsmodule.c i don't really know its inner workings and how it 
is connected to other modules, and as of it, changes on that level for this use 
case are not critical. But it is nice to think and evaluate on that level too, 
since there might be some tricky situations on windows systems because of that 
grey zone. 

My proposal really aims to enhance the Lib/encodings/ module. And as Marc-Andre 
Lemburg advised, to only change those mappings in case of official corrections 
on the standard itself. Now i think that really following those standards 
"strictly" seems to be a good idea. 

On top of that, adding them under different naming seems like a better idea 
anyway, since those standards can be seen as different if you take a strict 
look at the Unicode definitions. Adding them would suffice for the needs that 
might arise, would still allow for catching mismatched encodings, and can even 
be "backported" to older python versions.

I will adjust the PR accordingly to these comments, thanks for the feedback!

--

___
Python tracker 
<https://bugs.python.org/issue45120>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com