Re: [PATCH] Better encoding/decoding for GHC

Max Bolingbroke Wed, 11 May 2011 05:37:05 -0700

On 11 May 2011 00:40, Mark Lentczner <mark.lentcz...@gmail.com> wrote:
> That is why the Python approach hides these beasts in a non-legal part of
> the code space.


Naturally. The choice is clear. The escapes should use either:

 1. The surrogate code points, in which case we can roundtrip any
string but we might confuse Unicode processing passes on the Haskell
side, as they won't expect to see these code points
 2. Some region of the private-use plane, in which case we can't
roundtrip those rare strings that contain these private-use
characters, but stands less chance of confusing software that expects
each Char to be a valid unicode character

I thought you were arguing against choice 1 and in favour of 2 in your
initial message?

These alternatives (+ the alternative of using bytestrings for all FS
operations) were discussed by the Python guys at PEP383 time (see a
summary at e.g.
http://www.rhinocerus.net/forum/lang-python/557978-re-pep-383-non-decodable-bytes-system-character-interfaces.html)

> POSIX certainly *doesn't* make this assumption, and EBCDIC POSIX is
> possible, though may be hard to find in the wild.

As you say is not *excluded* by POSIX but this is a rarity even
amongst the already small population of *nix users. I am perfectly
happy to have my solution fail on EBCDIC machines! In any case, the
failure would only manifest when the EBCDIC user uses filenames that
are not decodable in EBCDIC... I cannot believe that this would ever
be a problem in practice.

> I believe that if you change your locale to a different character set, I
> don't think code sees different byte strings from the file system.

I am certain it does not, this is the whole problem prompting the
roundtripping debate.

> I'll test
> this tonight… If true, then assuming that the inferred locale is applicable
> to file paths seems equally broken.

I don't see why this should be so, as my understanding is that most
people who use a non-standard encoding do so system-wide. Anything
else would be hard to work with as you would have to be constantly
swapping between locales to get the right interpretation for your
filenames/file contents.

The Python people do this:

 1. On Windows, everything (command line arguments, file system
operations) goes through the Windows APIs, which are defined as using
UTF-16, so there is no problem
 2. On OS X, they assume UTF-8 for everything, which is fine as the
file system invariably uses UTF-8
 3. On Linux, they use the system-configured encoding (which you term
the "inferred locale")

The QT guys also use the locale encoding to decode/encode filenames
(see http://bugreports.qt.nokia.com/browse/QTBUG-5832).

In short there is considerable precedent for the choice I am arguing
for and I *think* it is what most users would expect - though it is so
easy to be mistaken about other peoples expectations!

Cheers,
Max

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Re: [PATCH] Better encoding/decoding for GHC

Reply via email to