On 11 May 2011 00:40, Mark Lentczner <mark.lentcz...@gmail.com> wrote: > That is why the Python approach hides these beasts in a non-legal part of > the code space.
Naturally. The choice is clear. The escapes should use either: 1. The surrogate code points, in which case we can roundtrip any string but we might confuse Unicode processing passes on the Haskell side, as they won't expect to see these code points 2. Some region of the private-use plane, in which case we can't roundtrip those rare strings that contain these private-use characters, but stands less chance of confusing software that expects each Char to be a valid unicode character I thought you were arguing against choice 1 and in favour of 2 in your initial message? These alternatives (+ the alternative of using bytestrings for all FS operations) were discussed by the Python guys at PEP383 time (see a summary at e.g. http://www.rhinocerus.net/forum/lang-python/557978-re-pep-383-non-decodable-bytes-system-character-interfaces.html) > POSIX certainly *doesn't* make this assumption, and EBCDIC POSIX is > possible, though may be hard to find in the wild. As you say is not *excluded* by POSIX but this is a rarity even amongst the already small population of *nix users. I am perfectly happy to have my solution fail on EBCDIC machines! In any case, the failure would only manifest when the EBCDIC user uses filenames that are not decodable in EBCDIC... I cannot believe that this would ever be a problem in practice. > I believe that if you change your locale to a different character set, I > don't think code sees different byte strings from the file system. I am certain it does not, this is the whole problem prompting the roundtripping debate. > I'll test > this tonight… If true, then assuming that the inferred locale is applicable > to file paths seems equally broken. I don't see why this should be so, as my understanding is that most people who use a non-standard encoding do so system-wide. Anything else would be hard to work with as you would have to be constantly swapping between locales to get the right interpretation for your filenames/file contents. The Python people do this: 1. On Windows, everything (command line arguments, file system operations) goes through the Windows APIs, which are defined as using UTF-16, so there is no problem 2. On OS X, they assume UTF-8 for everything, which is fine as the file system invariably uses UTF-8 3. On Linux, they use the system-configured encoding (which you term the "inferred locale") The QT guys also use the locale encoding to decode/encode filenames (see http://bugreports.qt.nokia.com/browse/QTBUG-5832). In short there is considerable precedent for the choice I am arguing for and I *think* it is what most users would expect - though it is so easy to be mistaken about other peoples expectations! Cheers, Max _______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc