Re: [PATCH] Better encoding/decoding for GHC

Mark Lentczner Tue, 10 May 2011 16:41:27 -0700

> File paths that don't decode.
> File paths with a small range of private use characters.


>
> It was always my intention to allow roundtripping of arbitrary
> bytestrings through String. I don't think that the middle ground
> (where you can *read in* a filename without error but not write it out
> correctly) is a good idea, as this gives the charset decoder weird
> behaviour without any payoff that I can see (i.e. no roundtripping).
>

Well, if non-decodable bytes are encoded in the private use area, then
indeed there will be arbitrary byte strings that don't round trip no matter
how you slice it. Either file paths that don't decode *or* file paths that
contain those private use characters (rather, byte sequences that look like
good UTF8 encoded versions of those characters) will fail to round trip.
That is why the Python approach hides these beasts in a non-legal part of
the code space.

The locale encoding is *invariably* a ASCII superset, which is why the
> Python hack works. To my knowledge, pretty much all software for *nix
> makes this assumption -- without it you can't even call something like
> printf of an ASCII string without going through iconv!
>

POSIX certainly *doesn't* make this assumption, and EBCDIC POSIX is
possible, though may be hard to find in the wild. printf in POSIX and
strings in C are defined in such a way that you can call printf("Hello
World") on an EBCDIC system and see the expected greeting. But agreed,
unlikely to see this (though one does wonder about the Linux variants that
run on IBM370 machines…)

I would like to assume UTF-8 everywhere but apparently many people
> consider such an assumption broken, which is why we are talking about
> doing this roundtripping in the first place


I believe that if you change your locale to a different character set, I
don't think code sees different byte strings from the file system. I'll test
this tonight… If true, then assuming that the inferred locale is applicable
to file paths seems equally broken. Poking around the web, seems most
distros of linux system have been set to UTF-8 file paths for a few years…
I'll look further.

- Mark

_______________________________________________
Cvs-ghc mailing list
Cvs-ghc@haskell.org
http://www.haskell.org/mailman/listinfo/cvs-ghc

Re: [PATCH] Better encoding/decoding for GHC

Reply via email to