There are a number of options for freeware here, just do some searching on your favorite Internet search engine.

TextCat is one of the more popular, as I seem to recall:  
http://odur.let.rug.nl/~vannoord/TextCat/

I believe Karl Wettin submitted a Lucene patch for a Language guesser: http://issues.apache.org/jira/browse/LUCENE-826 but it is marked as won't fix.

Nutch has a Language Identification plugin as well (the document in the link below) that probably isn't too hard to extract the source from for your needs

Also see http://www.lucidimagination.com/search/?q=multilingual+detection and also http://www.lucidimagination.com/search/?q=language +detection for help

If purchasing, several companies offer solutions, but I don't know that their quality is any better than what you can get through open source, as generally speaking, the problem is solved with a high degree of accuracy through n-gram analysis.

-Grant

On Feb 17, 2009, at 11:57 AM, revathy arun wrote:

Hi Otis,

But this is not freeware ,right?




On 2/17/09, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote:

Hi,

No, Tika doesn't do LangID. I haven't used ngramj, so I can't speak for
its accuracy nor speed (but I know the code has been around for
years).  Another LangID implementation is at the URL below my name.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




________________________________
From: revathy arun <revas...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage

Does Apache Tika help find the language of the given document?



On 2/17/09, Till Kinstler <kinst...@gbv.de> wrote:

Paul Libbrecht schrieb:

Clearly, then, something that matches words in a dictionary and decides
on
the language based on the language of the majority could do a decent job
to
decide the analyzer.

Does such a tool exist?


I once played around with http://ngramj.sourceforge.net/ for language guessing. It did a good job. It doesn't use dictionaries for language
identification but a statistical approach using ngrams.
I don't have any precise numbers, but out of about 10000 documents in different languages (most in English, German and French, few in other european languages like Polish) there were only some 10 not identified
correctly.

Till

--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de



--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to