Hi, I plan to use solr to index a large number of documents extracted
from emails bodies, such documents could be in different languages,
and a single  document could be in more than one language. In the same
way, the query string could be words in different languages.

I read that a common approach to index multilingual documents is to
use some algorithm (n-gram) to determine the document language, then use a
stemmer and finally index the document in a different index for each
language.

As the document language and the query string can't be detected in a
reliable way, I think that it make not sense to use a stemmer on them
because a stemmer is tied to a specific language.

My plan is to index all the documents in the same index, without any
stemming process (the users will have to search for the exact words that
they are looking for).

But I'm not sure if this approach will make the index too big, too
slow, or if there is a better way to index this kind of documents.

Any suggestion will be very appreciated.

Reply via email to