I'll push back... and apologize for perhaps making this all seem more
complicated that it probably is.... :-)

I think, all things given, the use of private use area (PUA) characters is
far preferable. With the exception of small ranges used by Apple &
Microsoft, PUA characters exchanging in the wild are pretty rare in my
experience. We can increase the unlikeliness of colliding with them by using
an known unused range. I've looked at relevant on-line sources(1,2) and
suggest:

U+F1E00 ~ U+F1EFF -- for "Fie! we need to encode bad encodings!"

We can (I'll be happy to) register this with the unofficial registory(2).

The reason broken surrogates work for Python is that Python's strings aren't
really Unicode. Such values aren't likely to break Python software. On the
other hand, Haskell software generally does presume valid Unicode, and the
broken surrogates will break things, for example the Text package. PUA
characters will work with all Haskell software.

I'm fine with your decoding choice, which I think is: Decode U+F1E80 ~
U+F1EFF back to bytes 0x7F ~ 0xFF and throw an error on U+F1E00 ~ U+F1E7F.

U+262E,
  Mark

1)
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters
2) http://www.evertype.com/standards/csur/index.html
_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Reply via email to