Hi, As you may know, I've been working on improving GHC's support for Unicode. In particular, I have been trying to achieve the following:
1. Use the locale encoding to decode command line arguments, environment variables and file names from e.g. the System.Directory functions 2. Implement FFI-specification behaviour for the Foreign.C.String.*CString functions, where they use the locale encoding to interpret the byte sequences they marshal The problem with doing 1. as it stands is that command line arguments and environment variables often contain file names, and on Unix file names should be treated as byte sequences -- i.e. they do not necessarily have an interpretation in the locale encoding. Thus, if we started doing 1. unilaterally today Haskell programs would die with an exception when supplied a file name (byte sequence) not valid in the locale encoding. This is upsetting. I have implemented a solution to this problem using the surrogate-bytes mechanism of Python's PEP-383 [1] and UTF-8b [2]. The idea is that: a) When decoding a byte sequence to a String (which in GHC is typically a sequence of 16-bit values representing a UTF-16 encoded Unicode string), any bytes in the input which are undecodable are represented in the String as a unicode codepoint in the range 0x00 to 0x7F (for bytes < 128) or 0xDC80 to 0xDD00 (for other bytes). This is a so-called surrogate escape mechanism because these codepoints are in the UTF-16 reserved region of surrogate codepoints. b) When encoding a String back to a byte sequence, any value in the range 0xDC80 to 0xDD00 is just turned back into a byte in the output in the range 128-255 inclusive. As long as the locale encoding represents ASCII characters as single bytes with the normal values, this scheme ensures that arbitrary byte sequences can be roundtripped through Strings with no loss of information. This property allows us to implement point 1. above with no fear that Haskell programs will fail if filenames happen to be uninterpretable in the locale encoding. The reason that bytes < 128 are not encoded as surrogates is that they may be security-sensitive, and so we wish to disallow "smuggling" them through the surrogate-bytes mechanism. In this I follow the recommendations of PEP-383. The really nice things about the surrogate-escape mechanism as opposed to some other solutions to 1. that have been floated are that: *) We do not have to introduce new APIs taking [Word8] everywhere, or redefine FilePath as an abstract type. Almost all users will be blissfully unaware that the surrogate bytes mechanism even exists. *) If the world ever standardises on a single encoding (i.e. UTF-8) for file names, terminals etc then we make the surrogate bytes error handler a noop and all user code will (in the overwhelming majority of cases) continue to work perfectly, with no APIs based around [Word8] hanging around for legacy reasons. Surrogate escapes are *only used* if you attempt to interpret bytes from one encoding in the wrong encoding, so if there is only one encoding in use, there are no surrogate bytes. Patches ====== The relevant changes have been validated on both Windows and OS X, and are contained in the encoding branch of the following GHC repos: . utils/hsc2hs utils/haddock libraries/base libraries/bytestring libraries/Cabal libraries/directory libraries/haskeline libraries/unix libraries/Win32 testsuite. I have submitted patches upstream to Cabal [3], Haskeline [4] and bytestring (personal "darcs send" to Don Stewart). These patches can be applied independently of any of my other changes being accepted, and generally serve to improve unicode compatibility for those libraries by using existing mechanisms (e.g. the *W APIs on Windows). All other patches are for repos "owned" by GHC-HQ and I expect them to be applied simultaneously should they be accepted. These changes fix (and test) tickets #5061, #1414, #3309, #3308, #3307, #4006, #4855 [5], and fix some other latent bugs that I discovered on the way (e.g. unicode environment blocks are not supported by the "process" library on Windows). Major API changes ============= The decoder/encoder in GHC's TextEncoding now gets an additional field: "recover". This holds a function that specifies how invalid-sequence errors are handled. The default is still that invalid sequences will cause exceptions, but you are also able to create TextEncodings with different values for this field, like so: mkTextEncoding "UTF-8" -- An encoding that throws an exception upon invalid sequence mkTextEncoding "UTF-8//IGNORE" -- An encoding that ignores invalid sequences mkTextEncoding "UTF-8//TRANSLIT" -- An encoding that replaces invalid sequences with a substitution showing the error mkTextEncoding "UTF-8//SURROGATE" -- An encoding that uses the surrogate escape mechanism to roundtrip arbitrary byte sequences GHC.IO.Encoding also exports these two new predefined TextEncodings: 1. fileSystemEncoding. This is the locale encoding with the surrogate-escape mechanism used for handling errors. This encoding is used to implement System.Posix.Internals.withFilePath and related functions that marshal byte sequences that are likely to include file names. 2. foreignEncoding. This is the locale encoding with the transliteration error handling mechanism, used by Foreign.C.String.*CString to implement FFI-spec behaviour. The other change to the TextEncoding type is that "encode" now returns a reason as to why encoding stopped, which is either because the input or output buffers underflowed or "encode" encountered an invalid sequence in the input. The other major change is a new module, GHC.Foreign, that exports generalised version of the Foreign.C.String.*CString functions that allow you to interpret byte sequences in any TextEncoding, rather than just the locale encoding. These can easily be used to implement functions to encode Strings into [Word8] and vice-versa -- for example: {{{ decode :: TextEncoding -> [Word8] -> IO String decode enc xs = withArrayLen xs (\sz p -> peekCStringLen enc (castPtr p, sz)) `catch` \e -> return (show (e :: IOException)) encode :: TextEncoding -> String -> IO [Word8] encode enc cs = withCStringLen enc cs (\(p, sz) -> peekArray sz (castPtr p)) `catch` \e -> return (const [] (e :: IOException)) }}} Next steps ======== I'd like to get feedback on my patches, in particular as to whether they are suitable for merging into GHC's master branch. Please comment away! Cheers, Max [1] http://www.python.org/dev/peps/pep-0383/ [2] http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html [3] http://hackage.haskell.org/trac/hackage/ticket/830 [4] http://trac.haskell.org/haskeline/ticket/113 [5] http://hackage.haskell.org/trac/ghc/wiki/Status/Encoding-Tickets _______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc