> File paths that don't decode. > File paths with a small range of private use characters.
> > It was always my intention to allow roundtripping of arbitrary > bytestrings through String. I don't think that the middle ground > (where you can *read in* a filename without error but not write it out > correctly) is a good idea, as this gives the charset decoder weird > behaviour without any payoff that I can see (i.e. no roundtripping). > Well, if non-decodable bytes are encoded in the private use area, then indeed there will be arbitrary byte strings that don't round trip no matter how you slice it. Either file paths that don't decode *or* file paths that contain those private use characters (rather, byte sequences that look like good UTF8 encoded versions of those characters) will fail to round trip. That is why the Python approach hides these beasts in a non-legal part of the code space. The locale encoding is *invariably* a ASCII superset, which is why the > Python hack works. To my knowledge, pretty much all software for *nix > makes this assumption -- without it you can't even call something like > printf of an ASCII string without going through iconv! > POSIX certainly *doesn't* make this assumption, and EBCDIC POSIX is possible, though may be hard to find in the wild. printf in POSIX and strings in C are defined in such a way that you can call printf("Hello World") on an EBCDIC system and see the expected greeting. But agreed, unlikely to see this (though one does wonder about the Linux variants that run on IBM370 machines…) I would like to assume UTF-8 everywhere but apparently many people > consider such an assumption broken, which is why we are talking about > doing this roundtripping in the first place I believe that if you change your locale to a different character set, I don't think code sees different byte strings from the file system. I'll test this tonight… If true, then assuming that the inferred locale is applicable to file paths seems equally broken. Poking around the web, seems most distros of linux system have been set to UTF-8 file paths for a few years… I'll look further. - Mark
_______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc