On 7 May 2011 17:38, Mark Lentczner <mark.lentcz...@gmail.com> wrote: > We have a choice. The current proposal maps the two following classes of > file paths onto the same string, and so when encoding back to the system we > must choose which it is -- the other class getting the short-end of the > stick: > > File paths that don't decode. > File paths with a small range of private use characters.
It was always my intention to allow roundtripping of arbitrary bytestrings through String. I don't think that the middle ground (where you can *read in* a filename without error but not write it out correctly) is a good idea, as this gives the charset decoder weird behaviour without any payoff that I can see (i.e. no roundtripping). > If the inferred encoding is one that has invalid encodings in this > range (for example EBCDIC, though these kinds of encoding are rare), then > this hack still results in some illegal encoded names failing to be > encodable back to the system. The locale encoding is *invariably* a ASCII superset, which is why the Python hack works. To my knowledge, pretty much all software for *nix makes this assumption -- without it you can't even call something like printf of an ASCII string without going through iconv! > If we encode in favor of all valid encoded strings, then bad encodings fail. > However, I tried on Mac, and one can't actually create a file name with a > bad UTF-8 sequence! I bet the same is true for Windows. Yes, the problem is restricted to systems other than these two - i.e. Linux. > In the end, I don't think it matters much which way we go here. The private > use characters are highly unlikely to be in use. But then again, so are > non-UTF8 file paths from the system. If we presume UTF-8 as the encoding, I would like to assume UTF-8 everywhere but apparently many people consider such an assumption broken, which is why we are talking about doing this roundtripping in the first place. I'm going to push a version of the patch that uses the 0xED00-0xEDFF region of the Unicode BMP PUA for escaped characters. I do think this is the better option (rather than using the surrogate codepoints), because we will avoid confusing any Unicode transformations written in Haskell that expect that Chars are actual Unicode characters. Thanks for your input, Max _______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc