There are a number of options for freeware here, just do some
searching on your favorite Internet search engine.
TextCat is one of the more popular, as I seem to recall:
http://odur.let.rug.nl/~vannoord/TextCat/
I believe Karl Wettin submitted a Lucene patch for a Language guesser: http://issues.apache.org/jira/browse/LUCENE-826
but it is marked as won't fix.
Nutch has a Language Identification plugin as well (the document in
the link below) that probably isn't too hard to extract the source
from for your needs
Also see http://www.lucidimagination.com/search/?q=multilingual+detection
and also http://www.lucidimagination.com/search/?q=language
+detection for help
If purchasing, several companies offer solutions, but I don't know
that their quality is any better than what you can get through open
source, as generally speaking, the problem is solved with a high
degree of accuracy through n-gram analysis.
-Grant
On Feb 17, 2009, at 11:57 AM, revathy arun wrote:
Hi Otis,
But this is not freeware ,right?
On 2/17/09, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote:
Hi,
No, Tika doesn't do LangID. I haven't used ngramj, so I can't
speak for
its accuracy nor speed (but I know the code has been around for
years). Another LangID implementation is at the URL below my name.
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
________________________________
From: revathy arun <revas...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage
Does Apache Tika help find the language of the given document?
On 2/17/09, Till Kinstler <kinst...@gbv.de> wrote:
Paul Libbrecht schrieb:
Clearly, then, something that matches words in a dictionary and
decides
on
the language based on the language of the majority could do a
decent job
to
decide the analyzer.
Does such a tool exist?
I once played around with http://ngramj.sourceforge.net/ for
language
guessing. It did a good job. It doesn't use dictionaries for
language
identification but a statistical approach using ngrams.
I don't have any precise numbers, but out of about 10000 documents
in
different languages (most in English, German and French, few in
other
european languages like Polish) there were only some 10 not
identified
correctly.
Till
--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search