Paul Libbrecht schrieb:
Clearly, then, something that matches words in a dictionary and decides
on the language based on the language of the majority could do a decent
job to decide the analyzer.
Does such a tool exist?
I once played around with http://ngramj.sourceforge.net/ for language
guessing. It did a good job. It doesn't use dictionaries for language
identification but a statistical approach using ngrams.
I don't have any precise numbers, but out of about 10000 documents in
different languages (most in English, German and French, few in other
european languages like Polish) there were only some 10 not identified
correctly.
Till
--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de