Re: [PATCH] Better encoding/decoding for GHC

Max Bolingbroke Tue, 12 Apr 2011 14:05:06 -0700

Hi Mark,

Thanks for your detailed response.

(A minor point: I think your definition D10, rather than D76, is
closest to what GHC implements as Char, since you can for example
evaluate (length "\xD800") with no complaints - this comes back to
Bryan's earlier reply to this thread. Of course, you can very well
argue that D76 is a better choice and perhaps what the Haskell
standard intended.)

On 12 April 2011 19:18, Mark Lentczner <mark.lentcz...@gmail.com> wrote:
> Indeed, POSIX has made a mess of things, hasn't it?
> To introduce "surrogate escape" values into GHC's Char and String data types
> would significantly undermine their integrity. For example, any code that
> wanted to use these strings with any other Unicode conformal process,
> (whether in Haskell, in libraries, or external to the process) would be
> unable to do so. Imagine, for example, a build tool that manages large
> collections of file paths by storing them temporarily in an sqllite
> database. Strings with "surrogate escapes" would fail.

It is true that if a String contains a surrogate escape (say, 0xDC80) then:

  1. Encoding it with, say, mkTextEncoding "UTF-8//SURROGATE" will
result in a byte sequence containing 0x80
  2. Encoding it with mkTextEncoding "UTF-8" will result in an error

This is definitely unfortunate because as well as the problem you
describe, it means that e.g. printing Strings to a UTF-8 encoded
terminal may cause an exception to be raised.

> If we want to round trip characters that don't decode using the inferred
> encoding, then we should use the private use area. In particular, I'd
> suggest F700 through F7FF (Apple uses some code points in the F800 through
> F8FF range).

So to summarise, your proposal is to:

  1. Use 0xF700 to 0xF7FF instead
  2. When encoding these private-use Chars, do *not* throw an
exception, but instead simply encode them as you would any other
unicode codepoint

Which gives the benefit that we do not bake in Char behaviour that
explicitly contravenes your D76, thus leading to better
interoperability with other Unicode implementations.

I would be perfectly happy with this modified proposal. The only
possible issue I can see is that it might arguably be less confusing
for the user of Haskell to get an exception (as currently happens),
rather than see the private-use characters occurring in their output.
But this is certainly arguable and I don't feel strongly for either
view point, so in light of the benefits you outline I support the
change.

> Lastly, I'm curious how the proposed code infers the encoding from the
> locale. Is that OS dependent? I don't think the concept of locale in POSIX
> actually includes encoding information explicitly.

This code already exists in GHC. The behaviour at the moment is
platform dependent and as follows:

  1. On OS X/Linux it uses locale_charset() [1] or
nl_langinfo(CODESET) [2]. Typically the encoding method is specified
as a dot-separated suffix on the LANG/LC_* environment variables. That
said, on OS X we could probably just assume UTF-8 since I've never
seen any other setting, and the file system forces filenames to be
UTF-8 encoded -- but this is not part of my proposal.

  2. On Windows, it uses the current code page as returned by GetACP
[3]. However, the locale encoding is not really relevant on Windows
since wherever possible we go via the Windows wide API, which
explicitly uses UTF-16, and use of the code page mechanism is
deprecated.

Cheers,
Max

[1] http://www.haible.de/bruno/packages-libcharset.html
[2] IEEE Std 1003.1, 2004
http://pubs.opengroup.org/onlinepubs/009695399/functions/nl_langinfo.html
and http://pubs.opengroup.org/onlinepubs/009695399/basedefs/langinfo.h.html
[3] http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Re: [PATCH] Better encoding/decoding for GHC

Reply via email to