[PATCH] Better encoding/decoding for GHC

Max Bolingbroke Tue, 12 Apr 2011 05:05:56 -0700

Hi,

As you may know, I've been working on improving GHC's support for
Unicode. In particular, I have been trying to achieve the following:


  1. Use the locale encoding to decode command line arguments,
environment variables and file names from e.g. the System.Directory
functions

  2. Implement FFI-specification behaviour for the
Foreign.C.String.*CString functions, where they use the locale
encoding to interpret the byte sequences they marshal

The problem with doing 1. as it stands is that command line arguments
and environment variables often contain file names, and on Unix file
names should be treated as byte sequences -- i.e. they do not
necessarily have an interpretation in the locale encoding. Thus, if we
started doing 1. unilaterally today Haskell programs would die with an
exception when supplied a file name (byte sequence) not valid in the
locale encoding. This is upsetting.

I have implemented a solution to this problem using the
surrogate-bytes mechanism of Python's PEP-383 [1] and UTF-8b [2]. The
idea is that:

  a) When decoding a byte sequence to a String (which in GHC is
typically a sequence of 16-bit values representing a UTF-16 encoded
Unicode string), any bytes in the input which are undecodable are
represented in the String as a unicode codepoint in the range 0x00 to
0x7F (for bytes < 128) or 0xDC80 to 0xDD00 (for other bytes). This is
a so-called surrogate escape mechanism because these codepoints are in
the UTF-16 reserved region of surrogate codepoints.

  b) When encoding a String back to a byte sequence, any value in the
range 0xDC80 to 0xDD00 is just turned back into a byte in the output
in the range 128-255 inclusive.

As long as the locale encoding represents ASCII characters as single
bytes with the normal values, this scheme ensures that arbitrary byte
sequences can be roundtripped through Strings with no loss of
information. This property allows us to implement point 1. above with
no fear that Haskell programs will fail if filenames happen to be
uninterpretable in the locale encoding.

The reason that bytes < 128 are not encoded as surrogates is that they
may be security-sensitive, and so we wish to disallow "smuggling" them
through the surrogate-bytes mechanism. In this I follow the
recommendations of PEP-383.

The really nice things about the surrogate-escape mechanism as opposed
to some other solutions to 1. that have been floated are that:

  *) We do not have to introduce new APIs taking [Word8] everywhere,
or redefine FilePath as an abstract type. Almost all users will be
blissfully unaware that the surrogate bytes mechanism even exists.

  *) If the world ever standardises on a single encoding (i.e. UTF-8)
for file names, terminals etc then we make the surrogate bytes error
handler a noop and all user code will (in the overwhelming majority of
cases) continue to work perfectly, with no APIs based around [Word8]
hanging around for legacy reasons. Surrogate escapes are *only used*
if you attempt to interpret bytes from one encoding in the wrong
encoding, so if there is only one encoding in use, there are no
surrogate bytes.

Patches
======

The relevant changes have been validated on both Windows and OS X, and
are contained in the encoding branch of the following GHC repos:

. utils/hsc2hs utils/haddock libraries/base libraries/bytestring
libraries/Cabal libraries/directory libraries/haskeline libraries/unix
libraries/Win32 testsuite.

I have submitted patches upstream to Cabal [3], Haskeline [4] and
bytestring (personal "darcs send" to Don Stewart). These patches can
be applied independently of any of my other changes being accepted,
and generally serve to improve unicode compatibility for those
libraries by using existing mechanisms (e.g. the *W APIs on Windows).
All other patches are for repos "owned" by GHC-HQ and I expect them to
be applied simultaneously should they be accepted.

These changes fix (and test) tickets #5061, #1414, #3309, #3308,
#3307, #4006, #4855 [5], and fix some other latent bugs that I
discovered on the way (e.g. unicode environment blocks are not
supported by the "process" library on Windows).

Major API changes
=============

The decoder/encoder in GHC's TextEncoding now gets an additional
field: "recover". This holds a function that specifies how
invalid-sequence errors are handled. The default is still that invalid
sequences will cause exceptions, but you are also able to create
TextEncodings with different values for this field, like so:

mkTextEncoding "UTF-8" -- An encoding that throws an exception upon
invalid sequence
mkTextEncoding "UTF-8//IGNORE" -- An encoding that ignores invalid sequences
mkTextEncoding "UTF-8//TRANSLIT" -- An encoding that replaces invalid
sequences with a substitution showing the error
mkTextEncoding "UTF-8//SURROGATE" -- An encoding that uses the
surrogate escape mechanism to roundtrip arbitrary byte sequences

GHC.IO.Encoding also exports these two new predefined TextEncodings:

  1. fileSystemEncoding. This is the locale encoding with the
surrogate-escape mechanism used for handling errors. This encoding is
used to implement System.Posix.Internals.withFilePath and related
functions that marshal byte sequences that are likely to include file
names.

  2. foreignEncoding. This is the locale encoding with the
transliteration error handling mechanism, used by
Foreign.C.String.*CString to implement FFI-spec behaviour.

The other change to the TextEncoding type is that "encode" now returns
a reason as to why encoding stopped, which is either because the input
or output buffers underflowed or "encode" encountered an invalid
sequence in the input.

The other major change is a new module, GHC.Foreign, that exports
generalised version of the Foreign.C.String.*CString functions that
allow you to interpret byte sequences in any TextEncoding, rather than
just the locale encoding. These can easily be used to implement
functions to encode Strings into [Word8] and vice-versa -- for
example:

{{{
decode :: TextEncoding -> [Word8] -> IO String
decode enc xs = withArrayLen xs (\sz p -> peekCStringLen enc (castPtr
p, sz)) `catch` \e -> return (show (e :: IOException))

encode :: TextEncoding -> String -> IO [Word8]
encode enc cs = withCStringLen enc cs (\(p, sz) -> peekArray sz
(castPtr p)) `catch` \e -> return (const [] (e :: IOException))
}}}

Next steps
========

I'd like to get feedback on my patches, in particular as to whether
they are suitable for merging into GHC's master branch. Please comment
away!

Cheers,
Max

[1] http://www.python.org/dev/peps/pep-0383/
[2] 
http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html
[3] http://hackage.haskell.org/trac/hackage/ticket/830
[4] http://trac.haskell.org/haskeline/ticket/113
[5] http://hackage.haskell.org/trac/ghc/wiki/Status/Encoding-Tickets

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

[PATCH] Better encoding/decoding for GHC

Reply via email to