(Crud - Simon just pointed out that I accidentally sent my reply to just him, not the list. D'oh! -- Sorry for the tardy reply-all, all!)
On Wed, Apr 20, 2011 at 2:59 AM, Simon Marlow <marlo...@gmail.com> wrote: > So that means filenames that are not legal in the current encoding won't > round-trip? But wasn't that the problem that Max was originally trying to > solve? > I think the major issue was mapping file paths to strings, without requiring every application to perform its own decoding/encoding. That in turn, brought up the issue of what to do about file paths octet sequences that don't match the expected (or any) encoding. (Which existed for every application before.. they probably just ignored it!) We have a choice. The current proposal maps the two following classes of file paths onto the same string, and so when encoding back to the system we must choose which it is -- the other class getting the short-end of the stick: 1. File paths that don't decode. 2. File paths with a small range of private use characters. If we encode in favor of files paths that don't decode (that is, encode U+F700 ~ U+F7FF as the bytes 0x00 ~ 0xFF), then we incur a raft of security issues, as input that passes various checks ("there is no / in the file name, for example") can be bypassed. The Python hack is to not encode 0x00 ~ 0x7F. If the inferred encoding is one that has invalid encodings in this range (for example EBCDIC, though these kinds of encoding are rare), then this hack still results in some illegal encoded names failing to be encodable back to the system. If we encode in favor of all valid encoded strings, then bad encodings fail. However, I tried on Mac, and one can't actually create a file name with a bad UTF-8 sequence! I bet the same is true for Windows. I'd also be in favor of just presuming the encoding is UTF-8 (as it is on Mac and modern Linux, and Windows doesn't matter as we get the paths in Unicode anyway) rather than using the user's locale. In my tests, the file path system calls do not respect the locale setting, nor do I think the stdlib calls do either. In the end, I don't think it matters much which way we go here. The private use characters are highly unlikely to be in use. But then again, so are non-UTF8 file paths from the system. If we presume UTF-8 as the encoding, then the security risk is lower, as we only have to worry about bytes 0x7f ~ 0xff (assuming we do decode failure of multi-byte UTF-8 sequences early). - Mark
_______________________________________________ Cvs-ghc mailing list Cvs-ghc@haskell.org http://www.haskell.org/mailman/listinfo/cvs-ghc