On Wed, May 18, 2011 at 2:28 AM, Max Bolingbroke <batterseapo...@hotmail.com > wrote:
> > U+F1E00 ~ U+F1EFF -- for "Fie! we need to encode bad encodings!" > > > > We can (I'll be happy to) register this with the unofficial registory(2). > I've prepared a draft for the registry and submitted it…. Only to have it pointed out to me that the registry has a region reserved (rather than allocated) for precisely this use! (I missed it as the main pages only discuss the allocated ranges, and don't mention the reserved ranges.) The range is U+EF80 through U+EFFF, called "Reserved for encoding hacks". See *Roadmap to the ConScript Unicode Registery<http://www.evertype.com/standards/csur/conscript-table.html> *. John Cowan informs me that our use is precisely what this range has been reserved for. This range is only 128 code points, and they didn't anticipate needing to deal with encoding issues with octets 0x00 through 0x7F. So long as we restrict ourselves to ASCII superset encodings, this is true. If we want to be more general, we could use U+EF00 through U+EFFF and lobby for reserving the additional 128 points. I've already enquired about this possibility. On a related note, If we want to be able to round trip file names that contain proper UTF-8 encoded characters from this range, we can: Treat the byte sequences 0xEE 0xBE 0x80 through 0xEE 0xBF 0xBF as if they were encoding errors, and replace such bytes with the encoding hack characters for each octet. In such a way, *all* octet sequences are round-trippable, and all are to legal Unicode strings: 41 -> U+00A1 -- ASCII character CE B1 -> U+03B1 -- Greek character E0 A4 85 -> U+090F -- Devanagari character C0 -> U+EFC0 -- illegal UTF-8 byte C2 20 -> U+EFC2 U+0020 -- malformed UTF-8 sequence C2 F0 -> U+EFC2 U+EFF0 -- malformed UTF-8 sequence EE BE 80 -> U+EFEE U+EFBE U+EF80 -- special handling of encoding hack character - Mark
_______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc