Hi Mark, Thanks for your detailed response.
(A minor point: I think your definition D10, rather than D76, is closest to what GHC implements as Char, since you can for example evaluate (length "\xD800") with no complaints - this comes back to Bryan's earlier reply to this thread. Of course, you can very well argue that D76 is a better choice and perhaps what the Haskell standard intended.) On 12 April 2011 19:18, Mark Lentczner <mark.lentcz...@gmail.com> wrote: > Indeed, POSIX has made a mess of things, hasn't it? > To introduce "surrogate escape" values into GHC's CharĀ and String data types > would significantly undermine their integrity. For example, any code that > wanted to use these strings with any other Unicode conformal process, > (whether in Haskell, in libraries, or external to the process) would be > unable to do so. Imagine, for example, a build tool that manages large > collections of file paths by storing them temporarily in an sqllite > database. Strings with "surrogate escapes" would fail. It is true that if a String contains a surrogate escape (say, 0xDC80) then: 1. Encoding it with, say, mkTextEncoding "UTF-8//SURROGATE" will result in a byte sequence containing 0x80 2. Encoding it with mkTextEncoding "UTF-8" will result in an error This is definitely unfortunate because as well as the problem you describe, it means that e.g. printing Strings to a UTF-8 encoded terminal may cause an exception to be raised. > If we want to round trip characters that don't decode using the inferred > encoding, then we should use the private use area. In particular, I'd > suggest F700 through F7FF (Apple uses some code points in the F800 through > F8FF range). So to summarise, your proposal is to: 1. Use 0xF700 to 0xF7FF instead 2. When encoding these private-use Chars, do *not* throw an exception, but instead simply encode them as you would any other unicode codepoint Which gives the benefit that we do not bake in Char behaviour that explicitly contravenes your D76, thus leading to better interoperability with other Unicode implementations. I would be perfectly happy with this modified proposal. The only possible issue I can see is that it might arguably be less confusing for the user of Haskell to get an exception (as currently happens), rather than see the private-use characters occurring in their output. But this is certainly arguable and I don't feel strongly for either view point, so in light of the benefits you outline I support the change. > Lastly, I'm curious how the proposed code infers the encoding from the > locale. Is that OS dependent? I don't think the concept of locale in POSIX > actually includes encoding information explicitly. This code already exists in GHC. The behaviour at the moment is platform dependent and as follows: 1. On OS X/Linux it uses locale_charset() [1] or nl_langinfo(CODESET) [2]. Typically the encoding method is specified as a dot-separated suffix on the LANG/LC_* environment variables. That said, on OS X we could probably just assume UTF-8 since I've never seen any other setting, and the file system forces filenames to be UTF-8 encoded -- but this is not part of my proposal. 2. On Windows, it uses the current code page as returned by GetACP [3]. However, the locale encoding is not really relevant on Windows since wherever possible we go via the Windows wide API, which explicitly uses UTF-16, and use of the code page mechanism is deprecated. Cheers, Max [1] http://www.haible.de/bruno/packages-libcharset.html [2] IEEE Std 1003.1, 2004 http://pubs.opengroup.org/onlinepubs/009695399/functions/nl_langinfo.html and http://pubs.opengroup.org/onlinepubs/009695399/basedefs/langinfo.h.html [3] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx _______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc