Re: [PATCH] Better encoding/decoding for GHC

Mark Lentczner Tue, 12 Apr 2011 11:19:12 -0700

Indeed, POSIX has made a mess of things, hasn't it?

That said, I don't think applying PEP-383 here would make things better for
Haskell. Please bear with this background:


*Background*
Haskell 98 and Haskell 2000 both define the type Char this way:

> The character type Char is an enumeration whose values represent Unicode
> characters.

The footnote references the Unicode standard. However, Unicode doesn't
formally define a concept "Unicode character". It has several definitions
that might seem like obvious interpretations, including (from Ch. 3, §3.4):

   - definition D7: *Abstract character* — "A unit of information used for
   the organization, control, or rep- resentation of textual data."
   - definition D10: *Code point* — any integer in the range 0 to 10FFFF16,
   but not all of these are characters, and some are explicitly "Noncharacters"
   - definition D11: *Encoded character* — an abstract character assigned to
   a code point

But none of these are quite the right match for either what seems to have
been intended by the standard, or what is actually implemented in GHC. For
that, there is quite a good definition in Unicode (from Ch. 3, §3.9):

   - definition D76: *Unicode scalar value* — "Any Unicode code point except
   high-surrogate and low-surrogate code points." Equivalently any integer in
   the range 0 to D7FF16 and E00016 to 10FFFF16.

This choice for Char fits well because it is immune to the specifics of
abstract character assignment (past or future), it can be efficiently
implemented, and it is precisely the set of values that can be encoded and
decoded from any Unicode encoding form (see definition D79). It has one
other property, which seems crucial (from Ch. 3, §3.2):

   - conformance requirement C1: "A process shall not interpret a
   high-surrogate code point or a low-surrogate code point as an abstract
   character."

Note that this is important enough to the Unicode committee to make it the
very first conformance requirement!

GHC behaves as if Char is defined this way. It is an excellent choice, as it
makes Haskell's Char and String values be extremely well behaved for
character data in a modern world. It is by far the best character data type
of any modern programming language in use.


*The Proposal*
Back to the proposal: PEP-383 is viable in Python because the "unicode" data
type isn't really Unicode.[1] Python's type is really either a broken UTF-16
or a broken UTF-32, depending on how the interpreter was compiled on your
system. PEP-383 only muddies the type further, relying on non-standard
encoding (UTF-8b). In Python they can almost get away with this because
generally all Python users must be very aware of encoding issues and have to
use the encoding package when handling strings. The PEP extends the encoders
to handle the strange values.

To introduce "surrogate escape" values into GHC's Char and String data types
would significantly undermine their integrity. For example, any code that
wanted to use these strings with any other Unicode conformal process,
(whether in Haskell, in libraries, or external to the process) would be
unable to do so. Imagine, for example, a build tool that manages large
collections of file paths by storing them temporarily in an sqllite
database. Strings with "surrogate escapes" would fail.

If we want to round trip characters that don't decode using the inferred
encoding, then we should use the private use area. In particular, I'd
suggest F700 through F7FF (Apple uses some code points in the F800 through
F8FF range). It seems highly unlikely that any non-Unicode encoding for a
file system would use these values, and hence, there is no collision
possible. If this approach is taken with file systems that do have a Unicode
encoding, and used to handle round-tripping paths with illegal encoding
sequences, then those sequences would not round-trip - but I don't think
that is a realistic situation to worry about.

Lastly, I'm curious how the proposed code infers the encoding from the
locale. Is that OS dependent? I don't think the concept of locale in POSIX
actually includes encoding information explicitly.

- Mark

[1] See, for example, my library for manipulating Unicode strings in that I
built for the IETF precis working group: http://hg.secondlife.com/newprep ,
specifically the file newprep/codepoint.py

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Re: [PATCH] Better encoding/decoding for GHC

Reply via email to