Hi, Joachim Breitner <nome...@debian.org> writes:
> Am Mittwoch, den 14.05.2014, 17:00 +0200 schrieb Robert Bihlmeyer: >> I don't think running a program with LC_CTYPE=*.UTF-8 means that all >> filenames that it encounters have to be valid UTF-8. > > the problem is: What else should they be? The "String" type represents > unicode characters, so using that for a file name requires them to be > decoded somehow. I agree, there is no straight-forward solution. Interestingly, most of the invalid UTF-8 I tried survived the roundtrip through String. What doesn't work in these cases is outputting this String -- but I wouldn't expect it to. But getFileStatus accepts the String and stats the right file (can be proven with "strace -fe stat" for example). Up to now I found exactly one class of byte sequences that do not work: illegal (sub-optimal) encodings of ASCII characters. The attached tar contains a filename with the two bytes C0 and B7 followed by '.txt'. C0B7 is an invalid encoding of 37 i.e. '7'. It looks like GHC accepts the invalid encoding and stores the result as the normal character '7'. The error points in this direction: dirtest.hs: 7.txt: getFileStatus: does not exist (No such file or directory) Contrary to that, a sub-optimal encoding of 'ö' (U+00F6) as E0 83 B6 works fine, as do the numerous other illegal combinations of high-bit-set characters I tried. So my assumption is that there is special casing if the result of UTF-8 decoding is an ASCII character. > I guess the solution, which you have found already, for uses where > arbitrary filenames need to work is to use a type that is meant for > that, i.e. ByteString. Maybe deprecating the interfaces that assume UTF-8 clean filenames is the solution. One (unfortunately) still can't assume that all the world is UTF-8. But most illegal sequences are round-trippable -- e.g. the E0 83 B6 from above is not re-encoded/corrected to C3 B6. Therefore, my question is whether the ASCII special case could be removed. br, -- Robert Bihlmeyer ASSIST Arrow ECS Internet Security AG <r.bihlme...@arrowecs.at> A-1100 Wien, Wienerbergstraße 11 Tel: +43 1 370 94 40 Fax: +43 1 370 94 40-333 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org