Re: Multilanguage

Grant Ingersoll Tue, 17 Feb 2009 12:26:38 -0800

There are a number of options for freeware here, just do somesearching on your favorite Internet search engine.


TextCat is one of the more popular, as I seem to recall:  
http://odur.let.rug.nl/~vannoord/TextCat/

I believe Karl Wettin submitted a Lucene patch for a Language guesser: http://issues.apache.org/jira/browse/LUCENE-826but it is marked as won't fix.

Nutch has a Language Identification plugin as well (the document inthe link below) that probably isn't too hard to extract the sourcefrom for your needs

Also see http://www.lucidimagination.com/search/?q=multilingual+detectionand also http://www.lucidimagination.com/search/?q=language+detection for help

If purchasing, several companies offer solutions, but I don't knowthat their quality is any better than what you can get through opensource, as generally speaking, the problem is solved with a highdegree of accuracy through n-gram analysis.


-Grant

On Feb 17, 2009, at 11:57 AM, revathy arun wrote:

Hi Otis,

But this is not freeware ,right?




On 2/17/09, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote:
Hi,
No, Tika doesn't do LangID. I haven't used ngramj, so I can'tspeak for
its accuracy nor speed (but I know the code has been around for
years).  Another LangID implementation is at the URL below my name.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




________________________________
From: revathy arun <revas...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage

Does Apache Tika help find the language of the given document?



On 2/17/09, Till Kinstler <kinst...@gbv.de> wrote:
Paul Libbrecht schrieb:
Clearly, then, something that matches words in a dictionary anddecides
on
the language based on the language of the majority could do adecent job
to
decide the analyzer.

Does such a tool exist?
I once played around with http://ngramj.sourceforge.net/ forlanguageguessing. It did a good job. It doesn't use dictionaries forlanguage
identification but a statistical approach using ngrams.
I don't have any precise numbers, but out of about 10000 documentsindifferent languages (most in English, German and French, few inothereuropean languages like Polish) there were only some 10 notidentified
correctly.

Till

--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Multilanguage

Reply via email to