I'll push back... and apologize for perhaps making this all seem more complicated that it probably is.... :-)
I think, all things given, the use of private use area (PUA) characters is far preferable. With the exception of small ranges used by Apple & Microsoft, PUA characters exchanging in the wild are pretty rare in my experience. We can increase the unlikeliness of colliding with them by using an known unused range. I've looked at relevant on-line sources(1,2) and suggest: U+F1E00 ~ U+F1EFF -- for "Fie! we need to encode bad encodings!" We can (I'll be happy to) register this with the unofficial registory(2). The reason broken surrogates work for Python is that Python's strings aren't really Unicode. Such values aren't likely to break Python software. On the other hand, Haskell software generally does presume valid Unicode, and the broken surrogates will break things, for example the Text package. PUA characters will work with all Haskell software. I'm fine with your decoding choice, which I think is: Decode U+F1E80 ~ U+F1EFF back to bytes 0x7F ~ 0xFF and throw an error on U+F1E00 ~ U+F1E7F. U+262E, Mark 1) http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters 2) http://www.evertype.com/standards/csur/index.html
_______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc