Indeed, POSIX has made a mess of things, hasn't it? That said, I don't think applying PEP-383 here would make things better for Haskell. Please bear with this background:
*Background* Haskell 98 and Haskell 2000 both define the type Char this way: > The character type Char is an enumeration whose values represent Unicode > characters. The footnote references the Unicode standard. However, Unicode doesn't formally define a concept "Unicode character". It has several definitions that might seem like obvious interpretations, including (from Ch. 3, §3.4): - definition D7: *Abstract character* — "A unit of information used for the organization, control, or rep- resentation of textual data." - definition D10: *Code point* — any integer in the range 0 to 10FFFF16, but not all of these are characters, and some are explicitly "Noncharacters" - definition D11: *Encoded character* — an abstract character assigned to a code point But none of these are quite the right match for either what seems to have been intended by the standard, or what is actually implemented in GHC. For that, there is quite a good definition in Unicode (from Ch. 3, §3.9): - definition D76: *Unicode scalar value* — "Any Unicode code point except high-surrogate and low-surrogate code points." Equivalently any integer in the range 0 to D7FF16 and E00016 to 10FFFF16. This choice for Char fits well because it is immune to the specifics of abstract character assignment (past or future), it can be efficiently implemented, and it is precisely the set of values that can be encoded and decoded from any Unicode encoding form (see definition D79). It has one other property, which seems crucial (from Ch. 3, §3.2): - conformance requirement C1: "A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character." Note that this is important enough to the Unicode committee to make it the very first conformance requirement! GHC behaves as if Char is defined this way. It is an excellent choice, as it makes Haskell's Char and String values be extremely well behaved for character data in a modern world. It is by far the best character data type of any modern programming language in use. *The Proposal* Back to the proposal: PEP-383 is viable in Python because the "unicode" data type isn't really Unicode.[1] Python's type is really either a broken UTF-16 or a broken UTF-32, depending on how the interpreter was compiled on your system. PEP-383 only muddies the type further, relying on non-standard encoding (UTF-8b). In Python they can almost get away with this because generally all Python users must be very aware of encoding issues and have to use the encoding package when handling strings. The PEP extends the encoders to handle the strange values. To introduce "surrogate escape" values into GHC's Char and String data types would significantly undermine their integrity. For example, any code that wanted to use these strings with any other Unicode conformal process, (whether in Haskell, in libraries, or external to the process) would be unable to do so. Imagine, for example, a build tool that manages large collections of file paths by storing them temporarily in an sqllite database. Strings with "surrogate escapes" would fail. If we want to round trip characters that don't decode using the inferred encoding, then we should use the private use area. In particular, I'd suggest F700 through F7FF (Apple uses some code points in the F800 through F8FF range). It seems highly unlikely that any non-Unicode encoding for a file system would use these values, and hence, there is no collision possible. If this approach is taken with file systems that do have a Unicode encoding, and used to handle round-tripping paths with illegal encoding sequences, then those sequences would not round-trip - but I don't think that is a realistic situation to worry about. Lastly, I'm curious how the proposed code infers the encoding from the locale. Is that OS dependent? I don't think the concept of locale in POSIX actually includes encoding information explicitly. - Mark [1] See, for example, my library for manipulating Unicode strings in that I built for the IETF precis working group: http://hg.secondlife.com/newprep , specifically the file newprep/codepoint.py
_______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc