Re: [PATCH] Better encoding/decoding for GHC

Max Bolingbroke Mon, 09 May 2011 12:40:23 -0700

On 7 May 2011 17:38, Mark Lentczner <mark.lentcz...@gmail.com> wrote:
> We have a choice. The current proposal maps the two following classes of
> file paths onto the same string, and so when encoding back to the system we
> must choose which it is -- the other class getting the short-end of the
> stick:
>
> File paths that don't decode.
> File paths with a small range of private use characters.


It was always my intention to allow roundtripping of arbitrary
bytestrings through String. I don't think that the middle ground
(where you can *read in* a filename without error but not write it out
correctly) is a good idea, as this gives the charset decoder weird
behaviour without any payoff that I can see (i.e. no roundtripping).

> If the inferred encoding is one that has invalid encodings in this
> range (for example EBCDIC, though these kinds of encoding are rare), then
> this hack still results in some illegal encoded names failing to be
> encodable back to the system.

The locale encoding is *invariably* a ASCII superset, which is why the
Python hack works. To my knowledge, pretty much all software for *nix
makes this assumption -- without it you can't even call something like
printf of an ASCII string without going through iconv!

> If we encode in favor of all valid encoded strings, then bad encodings fail.
> However, I tried on Mac, and one can't actually create a file name with a
> bad UTF-8 sequence! I bet the same is true for Windows.

Yes, the problem is restricted to systems other than these two - i.e. Linux.

> In the end, I don't think it matters much which way we go here. The private
> use characters are highly unlikely to be in use. But then again, so are
> non-UTF8 file paths from the system. If we presume UTF-8 as the encoding,

I would like to assume UTF-8 everywhere but apparently many people
consider such an assumption broken, which is why we are talking about
doing this roundtripping in the first place.

I'm going to push a version of the patch that uses the 0xED00-0xEDFF
region of the Unicode BMP PUA for escaped characters. I do think this
is the better option (rather than using the surrogate codepoints),
because we will avoid confusing any Unicode transformations written in
Haskell that expect that Chars are actual Unicode characters.

Thanks for your input,
Max

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Re: [PATCH] Better encoding/decoding for GHC

Reply via email to