Re: [PATCH] Better encoding/decoding for GHC

Mark Lentczner Mon, 18 Apr 2011 13:46:40 -0700

>
> (A minor point: I think your definition D10, rather than D76, is closest to
> what GHC implements as Char, since you can for example evaluate (length
> "\xD800") with no complaints


Yikes - I thought earlier versions of GHC wouldn't evaluate "\xD800". So you
are right - GHC seems to be D10, but yes, I do believe it would be best if
Haskell (and GHC) defined Char in terms of D76.

So to summarise, your proposal is to:
>
I want to make sure that all agree on the "stance" the code should take:

   1. The system infers, to the best it can, the encoding used for file
   paths. This encoding might be wrong, though on modern systems, if it is
   inferred as a Unicode encoding, it is almost certainly right. Nonetheless,
   there is no guarantee that file paths are valid encodings.
   2. The system presents to user code file paths that were valid encodings
   as valid Strings, and user code can present such Strings back with perfect
   round-trip fidelity.
   3. The system presents to user code file paths that are not valid
   encodings as valid Strings, by mapping the invalid encodings onto the
   private use area U+F700 to U+F7FF. These will of course be indistinguishable
   from valid file paths that contained such characters (only possible if the
   encoding is a Unicode encoding), and thus are not round-trippable.
   4. If user code presents file paths as Strings that do not encode into
   the inferred encoding, an exception is thrown. This includes when the
   inferred encoding cannot encode the private use area*. *When the inferred
   encoding is a Unicode encoding (UTF-*), the private use characters will be
   encoded normally (and thus differently if they were generated due to an
   original illegally encoded file path).

The crux of the issue is the handling in #4. If we believe our inferred
encoding is generally right, and that invalid encodings are rare to
non-existant (and perhaps indicative of bigger problems on the whole) - then
as stated above is the way to go.

> Lastly, I'm curious how the proposed code infers the encoding from
> the locale.
> This code already exists in GHC. The behaviour at the moment is platform
> dependent and as follows:
>
Thanks for those details! It looks good to me.

- Mark

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Re: [PATCH] Better encoding/decoding for GHC

Reply via email to