On 15 May 2011 18:08, Mark Lentczner <mark.lentcz...@gmail.com> wrote:
> other hand, Haskell software generally does presume valid Unicode, and the
> broken surrogates will break things, for example the Text package. PUA
> characters will work with all Haskell software.

This is a key point - I wonder whether you have in mind a particular
bit of code using the "text" package that will fail if we use lone
surrogates as escapes?

I definitely sympathise with your arguments, but there is also a
practical argument against using the private-use area. Currently, I
rely on the fact that iconv will signal an error if I try to encode a
lone surrogate, so that I can spot it and encode it as the appropriate
raw byte instead. However, if we escape using the private-use area
then iconv will encode the escapes without error, which doesn't give
me the chance to replace them with the correct raw byte.

There are two ways around this that I can see:

 1. Use the private-use characters for the Char data type, but ensure
that all of GHC's internal buffers (which are basically Char arrays)
used for text encoding represent all of roundtripping private-use
characters as lone surrogates instead. Encoding these will still be an
error, so we can get control over iconv again. We will need to be
careful to turn these lone surrogates into private-use characters when
extracting a String from the Char array, and conversely we need to map
private-use characters to lone surrogates when building a Char array
from a String.

This option still uses lone surrogates but the user of the String data
type will never see them.

 2. In the iconv decoder, before calling iconv proper, do a pre-pass
that replaces all private-use characters with lone surrogates in the
Char array. Replace them with private-use characters before returning
from the decoder.

Option 1 is bad because it imposes a new complex global invariant, but
will probably have minimal performance impact. Option 2 is more
localised but will kill performance.

Need to think about this one...

Max

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Reply via email to