Re: [PATCH] Better encoding/decoding for GHC

Simon Marlow Thu, 14 Apr 2011 05:40:53 -0700

On 12/04/2011 22:04, Max Bolingbroke wrote:

Hi Mark,


Thanks for your detailed response.

(A minor point: I think your definition D10, rather than D76, is
closest to what GHC implements as Char, since you can for example
evaluate (length "\xD800") with no complaints - this comes back to
Bryan's earlier reply to this thread. Of course, you can very well
argue that D76 is a better choice and perhaps what the Haskell
standard intended.)

On 12 April 2011 19:18, Mark Lentczner<mark.lentcz...@gmail.com>  wrote:

Indeed, POSIX has made a mess of things, hasn't it?
To introduce "surrogate escape" values into GHC's Char and String data types
would significantly undermine their integrity. For example, any code that
wanted to use these strings with any other Unicode conformal process,
(whether in Haskell, in libraries, or external to the process) would be
unable to do so. Imagine, for example, a build tool that manages large
collections of file paths by storing them temporarily in an sqllite
database. Strings with "surrogate escapes" would fail.


It is true that if a String contains a surrogate escape (say, 0xDC80) then:

   1. Encoding it with, say, mkTextEncoding "UTF-8//SURROGATE" will
result in a byte sequence containing 0x80
   2. Encoding it with mkTextEncoding "UTF-8" will result in an error

This is definitely unfortunate because as well as the problem you
describe, it means that e.g. printing Strings to a UTF-8 encoded
terminal may cause an exception to be raised.

If we want to round trip characters that don't decode using the inferred
encoding, then we should use the private use area. In particular, I'd
suggest F700 through F7FF (Apple uses some code points in the F800 through
F8FF range).


So to summarise, your proposal is to:

   1. Use 0xF700 to 0xF7FF instead
   2. When encoding these private-use Chars, do *not* throw an
exception, but instead simply encode them as you would any other
unicode codepoint

Which gives the benefit that we do not bake in Char behaviour that
explicitly contravenes your D76, thus leading to better
interoperability with other Unicode implementations.

I would be perfectly happy with this modified proposal. The only
possible issue I can see is that it might arguably be less confusing
for the user of Haskell to get an exception (as currently happens),
rather than see the private-use characters occurring in their output.
But this is certainly arguable and I don't feel strongly for either
view point, so in light of the benefits you outline I support the
change.

Suffice to say, this conversation is now over my head :-) So I defer toyou guys; I'm happy with whatever solution you come up with.

   2. On Windows, it uses the current code page as returned by GetACP
[3]. However, the locale encoding is not really relevant on Windows
since wherever possible we go via the Windows wide API, which
explicitly uses UTF-16, and use of the code page mechanism is
deprecated.

The notable exceptions being console and file output, which are probablyalso the most visible aspects. For console output at least it would benice if we used the wide console API (indeed, it would be really nice ifwe didn't go through the horrible msvcrt/mingw layers for I/O at all...)


Cheers,
        Simon

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Re: [PATCH] Better encoding/decoding for GHC

Reply via email to