Hi, I plan to use solr to index a large number of documents extracted from emails bodies, such documents could be in different languages, and a single document could be in more than one language. In the same way, the query string could be words in different languages.
I read that a common approach to index multilingual documents is to use some algorithm (n-gram) to determine the document language, then use a stemmer and finally index the document in a different index for each language. As the document language and the query string can't be detected in a reliable way, I think that it make not sense to use a stemmer on them because a stemmer is tied to a specific language. My plan is to index all the documents in the same index, without any stemming process (the users will have to search for the exact words that they are looking for). But I'm not sure if this approach will make the index too big, too slow, or if there is a better way to index this kind of documents. Any suggestion will be very appreciated.