On Fri, 09 May 2003 02:31:43 +0200, Martin v. Löwis wrote: > Bob Hilliard wrote: > > 1. How can I determine what character encoding is used in a > > document without manually scanning the entire file?
First off, for the examples you mentioned (foldoc and the jargon file) the iso-8859-1 hypothesis is likely to be correct more than 99% of the time, and you should ask yourself whether the residue matters much for the tasks you want to accomplish. Also the amount of 8-bit data is likely so small that you can simply review it all, case by case. The above paragraph was added as an afterthought, +after+ I had written the rest of this message, and now I've spent too much time on writing this to simply discard the rest of the text. But you can stop reading here if you like. :-) > You can't do that automatically, in generally. If you know what text > you expect, and you know the bytes you have in the file, you can > try a number of encodings, and see which of the encodings gives the > characters you expect. As a manual procedure, this is best done with the > help of /usr/share/i18n/charmaps. This lists the Unicode character > position, the encoding-specific byte [sequence], and the character name. > So if you know you have \xe7, and you know it is c-cedilla, it could > be iso-8859-1. It could also be iso-8859-{2,3,9,14,15,16}, > cp125{0,2,4,6}, DEC-MCS, SAMI-WS2, etc. Actually, language guessers like <http://packages.debian.org/mguesser> are by nature also coding system guessers. If you write Italian in UTF-8, it's going to look different from Italian in ISO-646-it (if such a thing exists) or Italian in ISO-8859-1. (The difference between 8859-1 and 8859-15 is of course so minor that it is decidable only in special circumstances, regardless of whether you are a human or a computer). So in practice, the guesser program has to guess at the language and the coding system at the same time. This is often quite doable, although probably some very sparsely populated coding systems need a lot of input before they can be (learned or) categorized. (Dunno what this would be -- I'm guessing some Far East codings might be problematic.) As ever, there are language pairs which can't be decided in all circumstances, either. Danish and Norwegian Bokmål are so closely related as to be indistinguishable, especially in small samples, occasionally even to native speakers. Language categorization is typically based on n-gram analysis; you break up the stream into overlapping fixed-length str, tri, rin, ing, ngs, gs , s o, of, of , f c, ch, cha, har, ara, rac, act, cte, ter samples, and the frequency distribution of these (in this example, 3-grams, aka trigrams) is often sufficient to make a good guess, provided you have solid training data to compare against. <http://odur.let.rug.nl/~vannoord/TextCat/> has some more background and an on-line demo. The list of supported languages also has samples of each -- quite instructive to look at. (Write me off-list for more pointers.) I have not seen any academic treatment of the coding system aspect of this problem, but the systems I've tried would generally cope with it, more or less. Given some fair assumptions about the coherence of a file's contents, you would only need to submit the first couple of lines -- if even that -- to the guesser in order to get pretty accurate results most of the time. The one big issue which remains to be solved is to gather a representative amount of training material for each language/encoding pair you want to be able to recognize. Oh, and of course, don't expect human-produced text to be anything like coherent in practice. |^5d k001 |-|4x0r d00dz are only the tip of a very nonlinear iceberg. Even publishing-quality material is often not really "quality" when you start to look into it. Hope this helps, /* era */ (Sorry, not on the list -- if you have a reply for me personally, please mail or at least Cc: me.) -- Join the civilized world -- ban spam like we did! <http://www.euro.cauce.org/> tee -a $HOME/.signature <$HOME/.plan >http://www.iki.fi/era/index.html