On Mon, Aug 12, 2013 at 12:50:35PM +0200, Vincent Lefevre wrote: > On 2013-08-12 02:51:52 +0200, Adam Borowski wrote: > > Detecting non-UTF files is easy: > > * false positives are impossible > > * false negatives are extremely unlikely: combinations of letters that would > > happen to match a valid utf character don't happen naturally, and even if > > they did, every single combination in the file tested would need to match > > valid utf. > > Not that unlikely, and it is rather annoying that Firefox (and > therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620. > IMHO, in case of ambiguity, UTF-8 should always be preferred by > default (applications could have options to change the preferences).
That's the opposite of what I'm talking about: it is hard to reliably detect ancient encodings, because they tend to assign a character to every possible bit stream. On the other hand, only certain combinations of bytes with the 8th bit set are valid UTF-8, and thus it is possible to detect UTF-8 with good accuracy. It is obviously trivial to fool such detection deliberately, but such combinations don't happen in real languages, and thus if something validates as UTF-8, it is safe to assume it indeed is. > > On the other hand, detecting text files is hard. > > Deciding whether a file is a text file may be hard even for a human. > What about text files with ANSI control sequences? Same as, say, a Word97 document: not text for my purposes. It might be just coloured plain text, but there is no generic way to handle that. Binary formats go more into subgoal 1 of my proposal: arbitrary Unicode input that matches your syntax should be accepted, and go out uncorrupted (not the same as unmodified). > > One could use location: like, declaring stuff in /etc/ and > > /usr/share/doc/ to be text unless proven otherwise, but that's an > > incomplete hack. Only hashbangs can be considered reliable, but > > scripts are not where most documentation goes. > > > > Also, should HTML be considered text or not? Updating http-equiv is not > > rocket surgery, detecting HTML with fancy extensions can be. > > I think better questions could be: why do you want to regard a file as > text? For what purpose(s)? For the "all shipped text files in UTF-8" > rule only? A shipped config file will have some settings the user may edit and comments he may read. Being able to see what's going on is a prerequisite here. A perl/python/etc script is something our kind of folks often edit and/or read. A plain text file ships no encoding information, thus it can't be either rendered nor edited comfortably if the encoding is different from the system one. HTML can include http-equiv which take care of rendering, but editing is still a problem. And if you edit it, or, say, fill in some fields from a database, you risk data loss. If everything is UTF-8 end-to-end, this risk goes away. (I do care about plain text more, though.) > What about examples whose purpose is to have a file in a charset > different from UTF-8? Well, we don't convert those :) I don't expect a package with a test suite that includes charset stuff to make such an error by itself, but if there's a need, we could add a syntax for exclusions. For example, writing "verbatim" in the charset field. > > 4a. perl and pod > > > > Considering perl to be text raises one more issue: pod. By perl's design, > > pod without a specified encoding is considered to be ISO-8859-1, even if > > the file contains "use utf8;". This is surprising, and many authors use > > UTF-8 like everywhere else, leading to obvious results ("man gdm3" for one > > example). Thus, there should be a tool (preferably the one mentioned > > above) that checks perl files for pod with undeclared encoding, and raises > > alarm if the file contains any bytes with high bit set. If a conversion > > encoding is specified, such a declaration could be added automatically. > > Yes, undeclared encoding when not ASCII should be regarded as a bug. And if it's declared but not UTF-8, I'd convert it at package build time. -- ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130812131659.ga21...@angband.pl