On approximately 4/28/2009 10:53 AM, came the following characters from
the keyboard of James Y Knight:
On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:
James Y Knight wrote:
Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...
Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?
I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively) that
all encodings must be supersets of ASCII and use the same codepoints,
though.
However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire
reason why EUC-JP was created, so I'm pretty sure that it is in fact
inappropriate, and I cannot find any evidence of it ever being used on
any system.
It would seem from the definition of ISO-2022 that what it calls "escape
sequences" is in your POSIX spec called "locking-shift encoding".
Therefore, the second bullet item under the "Character Encoding" heading
prohibits use of ISO-2022, for whatever uses that document defines
(which, since you referenced it, I assume means locales, and possibly
file system encodings, but I'm not familiar with the structure of all
the POSIX standards documents).
A locking-shift encoding (where the state of the character is determined
by a shift code that may affect more than the single character following
it) cannot be defined with the current character set description file
format. Use of a locking-shift encoding with any of the standard
utilities in the XCU specification or with any of the functions in the
XSH specification that do not specifically mention the effects of
state-dependent encoding is implementation-dependent.
From http://en.wikipedia.org/wiki/EUC-JP:
"To get the EUC form of an ISO-2022 character, the most significant bit
of each 7-bit byte of the original ISO 2022 codes is set (by adding 128
to each of these original 7-bit codes); this allows software to easily
distinguish whether a particular byte in a character string belongs to
the ISO-646 code or the ISO-2022 (EUC) code."
Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
I'm a bit scared at the prospect that U+DCAF could turn into "/", that
just screams security vulnerability to me. So I'd like to propose that
only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.
It would be actually U+DC2f that would turn into /.
Yes, I meant to say DC2F, sorry for the confusion.
I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.
I think it has to be excluded from mapping in order to not introduce
security issues.
However...
There's also SHIFT-JIS to worry about...which apparently some people
actually want to use as their default encoding, despite it being broken
to do so. RedHat apparently refuses to provide it as a locale charset
(due to its brokenness), and it's also not available by default on my
Debian system. People do unfortunately seem to actually use it in real
life.
https://bugzilla.redhat.com/show_bug.cgi?id=136290
So, I'd like to propose this:
The "python-escape" error handler when given a non-decodable byte from
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a
non-decodable byte from 0x00 to 0x7F, it will be converted to
U+0000-U+007F. On the encoding side, values from U+DC80 to U+DCFF are
encoded into 0x80 to 0xFF, and all other characters are treated in
whatever way the encoding would normally treat them.
This proposal obviously works for all non-overlapping ASCII supersets,
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for
Shift-JIS and other similar ASCII-supersets with overlaps in trailing
bytes of a multibyte sequence. So, a sequence like
"\x81\xFD".decode("shift-jis", "python-escape") will turn into
u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".
The character sets this *doesn't* work for are: ebcdic code pages
(obviously completely unsuitable for a locale encoding on unix),
Why is that obvious? The only thing I saw that could exclude EBCDIC
would be the requirement that the codes be positive in a char, but on a
system where the C compiler treats char as unsigned, EBCDIC would qualify.
Of course, the use of EBCDIC would also restrict the other possible code
pages to those derived from EBCDIC (rather than the bulk of code pages
that are derived from ASCII), due to:
If the encoded values associated with each member of the portable
character set are not invariant across all locales supported by the
implementation, the results achieved by an application accessing those
locales are unspecified.
iso2022-* (covered above), and shift-jisx0213 (because it has replaced \
with yen, and - with overline).
If it's desirable to work with shift_jisx0213, a modification of the
proposal can be made: Change the second sentence to: "When given a
non-decodable byte from 0x00 to 0x7F, that byte must be the second or
later byte in a multibyte sequence. In such a case, the error handler
will produce the encoding of that byte if it was standing alone (thus in
most encodings, \x00-\x7f turn into U+00-U+7F)."
It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like
some people do actually use shift_jisx0213, unfortunately.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com