Hi,

No, Tika doesn't do LangID.  I haven't used ngramj, so I can't speak for its 
accuracy nor speed (but I know the code has been around for years).  Another 
LangID implementation is at the URL below my name.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 




________________________________
From: revathy arun <revas...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage

Does Apache Tika help find the language of the given document?



On 2/17/09, Till Kinstler <kinst...@gbv.de> wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the analyzer.
>>
>> Does such a tool exist?
>>
>
> I once played around with http://ngramj.sourceforge.net/ for language
> guessing. It did a good job. It doesn't use dictionaries for language
> identification but a statistical approach using ngrams.
> I don't have any precise numbers, but out of about 10000 documents in
> different languages (most in English, German and French, few in other
> european languages like Polish) there were only some 10 not identified
> correctly.
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>

Reply via email to