Hi, No, Tika doesn't do LangID. I haven't used ngramj, so I can't speak for its accuracy nor speed (but I know the code has been around for years). Another LangID implementation is at the URL below my name.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ________________________________ From: revathy arun <revas...@gmail.com> To: solr-user@lucene.apache.org Sent: Tuesday, February 17, 2009 6:39:40 PM Subject: Re: Multilanguage Does Apache Tika help find the language of the given document? On 2/17/09, Till Kinstler <kinst...@gbv.de> wrote: > > Paul Libbrecht schrieb: > > Clearly, then, something that matches words in a dictionary and decides on >> the language based on the language of the majority could do a decent job to >> decide the analyzer. >> >> Does such a tool exist? >> > > I once played around with http://ngramj.sourceforge.net/ for language > guessing. It did a good job. It doesn't use dictionaries for language > identification but a statistical approach using ngrams. > I don't have any precise numbers, but out of about 10000 documents in > different languages (most in English, German and French, few in other > european languages like Polish) there were only some 10 not identified > correctly. > > Till > > -- > Till Kinstler > Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) > Platz der Göttinger Sieben 1, D 37073 Göttingen > kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de >