I know BasisTech.com has a plugin for elasticsearch that extends stemming/lemmatization to work across 40 natural languages. I'm not sure what they have for Solr, but I think something like that may exist as well.
Cheers, -Chris. ---------------------------------------- From: "Eugene" <beyondcomp...@gmail.com> Sent: Wednesday, July 30, 2014 1:48 PM To: solr-user@lucene.apache.org Subject: Implementing custom analyzer for multi-language stemming Hello, fellow Solr and Lucene users and developers! In our project we receive text from users in different languages. We detect language automatically and use Google Translate APIs a lot (so having arbitrary number of languages in our system doesn't concern us). However we need to be able to search using stemming. Having nearly hundred of fields (several fields for each language with language-specific stemmers) listed in our search query is not an option. So we need a way to have a single index which has stemmed tokens for different languages. I have two questions: 1. Are there already (third-party) custom multi-language stemming analyzers? (I doubt that no one else ran into this issue) 2. If I'm going to implement such analyzer myself, could you please suggest a better way to 'pass' detected language value into such analyzer? Detecting language in analyzer itself is not an option, because: a) we already detect it in other place b) we do it based on combined values of many fields ('name', 'topic', 'description', etc.), while current field can be to short for reliable detection c) sometimes we just want to specify language explicitly. The obvious hack would be to prepend ISO 639-1 code to field value. But I'd like to believe that Solr allows for cleaner solution. I could think about either: a) custom query parameter (but I guess, it will require modifying request handlers, etc. which is highly undesirable) b) getting value from other field (we obviously have 'language' field and we do not have mixed-language records). If it is possible, could you please describe the mechanism for doing this or point to relevant code examples? Thank you very much and have a good day!