Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman Tue, 28 Apr 2009 12:08:47 -0700

On approximately 4/28/2009 10:53 AM, came the following characters fromthe keyboard of James Y Knight:

On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:
James Y Knight wrote:
Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...
Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?
I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively) thatall encodings must be supersets of ASCII and use the same codepoints,though.
However, ISO-2022 being inappropriate for LC_CTYPE usage is the entirereason why EUC-JP was created, so I'm pretty sure that it is in factinappropriate, and I cannot find any evidence of it ever being used onany system.

It would seem from the definition of ISO-2022 that what it calls "escapesequences" is in your POSIX spec called "locking-shift encoding".Therefore, the second bullet item under the "Character Encoding" headingprohibits use of ISO-2022, for whatever uses that document defines(which, since you referenced it, I assume means locales, and possiblyfile system encodings, but I'm not familiar with the structure of allthe POSIX standards documents).

A locking-shift encoding (where the state of the character is determinedby a shift code that may affect more than the single character followingit) cannot be defined with the current character set description fileformat. Use of a locking-shift encoding with any of the standardutilities in the XCU specification or with any of the functions in theXSH specification that do not specifically mention the effects ofstate-dependent encoding is implementation-dependent.

 From http://en.wikipedia.org/wiki/EUC-JP:
"To get the EUC form of an ISO-2022 character, the most significant bitof each 7-bit byte of the original ISO 2022 codes is set (by adding 128to each of these original 7-bit codes); this allows software to easilydistinguish whether a particular byte in a character string belongs tothe ISO-646 code or the ISO-2022 (EUC) code."
Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
I'm a bit scared at the prospect that U+DCAF could turn into "/", that
just screams security vulnerability to me.  So I'd like to propose that
only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.
It would be actually U+DC2f that would turn into /.
Yes, I meant to say DC2F, sorry for the confusion.
I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.
I think it has to be excluded from mapping in order to not introducesecurity issues.
However...
There's also SHIFT-JIS to worry about...which apparently some peopleactually want to use as their default encoding, despite it being brokento do so. RedHat apparently refuses to provide it as a locale charset(due to its brokenness), and it's also not available by default on myDebian system. People do unfortunately seem to actually use it in reallife.
https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The "python-escape" error handler when given a non-decodable byte from0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given anon-decodable byte from 0x00 to 0x7F, it will be converted toU+0000-U+007F. On the encoding side, values from U+DC80 to U+DCFF areencoded into 0x80 to 0xFF, and all other characters are treated inwhatever way the encoding would normally treat them.
This proposal obviously works for all non-overlapping ASCII supersets,where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works forShift-JIS and other similar ASCII-supersets with overlaps in trailingbytes of a multibyte sequence. So, a sequence like"\x81\xFD".decode("shift-jis", "python-escape") will turn intou"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".
The character sets this *doesn't* work for are: ebcdic code pages(obviously completely unsuitable for a locale encoding on unix),

Why is that obvious? The only thing I saw that could exclude EBCDICwould be the requirement that the codes be positive in a char, but on asystem where the C compiler treats char as unsigned, EBCDIC would qualify.

Of course, the use of EBCDIC would also restrict the other possible codepages to those derived from EBCDIC (rather than the bulk of code pagesthat are derived from ASCII), due to:

If the encoded values associated with each member of the portablecharacter set are not invariant across all locales supported by theimplementation, the results achieved by an application accessing thoselocales are unspecified.

iso2022-* (covered above), and shift-jisx0213 (because it has replaced \with yen, and - with overline).
If it's desirable to work with shift_jisx0213, a modification of theproposal can be made: Change the second sentence to: "When given anon-decodable byte from 0x00 to 0x7F, that byte must be the second orlater byte in a multibyte sequence. In such a case, the error handlerwill produce the encoding of that byte if it was standing alone (thus inmost encodings, \x00-\x7f turn into U+00-U+7F)."
It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 likesome people do actually use shift_jisx0213, unfortunately.




--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to