You would have to carefully analyze the source code and tables of these two stemmers to determine if one might incorrectly stem words in the other language. Technically, that could be fine for indexing, but it might give users some unexpected results for queries. There might also be cases where the second stemmer would stem a term that was already stemmed by the first stemmer.

You could avoid the latter issue by using the duplicate token technique. For a single stemmer this is generally:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

For two (or more) languages:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

This would produce the stemmed term for both languages, or either language, or neither, as the case may be.

-- Jack Krupansky

-----Original Message----- From: Croci Francesco Luigi (ID SWS)
Sent: Friday, March 14, 2014 8:17 AM
To: solr-user@lucene.apache.org
Subject: analyzer with multiple stem-filters for more languages

It is possible to define an analyzer with more than one Stem-filter for more languages?

Something like this:

<analyzer type="index">
               ...
<filter class="solr.PorterStemFilterFactory"/>  (default for english)
<filter class="solr.SnowballPorterFilterFactory" language="German2" />
</analyzer>

Greetings
Francesco

Reply via email to