Re: Indexing documents in multiple languages

Otis Gospodnetic Wed, 28 Jan 2009 13:35:09 -0800

Alejandro,

What you really want to do is identify the language of the email, store that in 
the index and apply the appropriate analyzer.  At query time you really want to 
know the language of the query (either by detecting it or asking the user or 
...)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Alejandro Valdez <alejandro.val...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, January 27, 2009 3:05:40 PM
> Subject: Indexing documents in multiple languages
> 
> Hi, I plan to use solr to index a large number of documents extracted
> from emails bodies, such documents could be in different languages,
> and a single  document could be in more than one language. In the same
> way, the query string could be words in different languages.
> 
> I read that a common approach to index multilingual documents is to
> use some algorithm (n-gram) to determine the document language, then use a
> stemmer and finally index the document in a different index for each
> language.
> 
> As the document language and the query string can't be detected in a
> reliable way, I think that it make not sense to use a stemmer on them
> because a stemmer is tied to a specific language.
> 
> My plan is to index all the documents in the same index, without any
> stemming process (the users will have to search for the exact words that
> they are looking for).
> 
> But I'm not sure if this approach will make the index too big, too
> slow, or if there is a better way to index this kind of documents.
> 
> Any suggestion will be very appreciated.

Re: Indexing documents in multiple languages

Reply via email to