Re: analyzer with multiple stem-filters for more languages

Jack Krupansky Fri, 14 Mar 2014 17:01:22 -0700

You would have to carefully analyze the source code and tables of these twostemmers to determine if one might incorrectly stem words in the otherlanguage. Technically, that could be fine for indexing, but it might giveusers some unexpected results for queries. There might also be cases wherethe second stemmer would stem a term that was already stemmed by the firststemmer.

You could avoid the latter issue by using the duplicate token technique. Fora single stemmer this is generally:


<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

For two (or more) languages:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

This would produce the stemmed term for both languages, or either language,or neither, as the case may be.


-- Jack Krupansky

-----Original Message-----From: Croci Francesco Luigi (ID SWS)

Sent: Friday, March 14, 2014 8:17 AM
To: solr-user@lucene.apache.org
Subject: analyzer with multiple stem-filters for more languages

It is possible to define an analyzer with more than one Stem-filter for morelanguages?


Something like this:

<analyzer type="index">
               ...
<filter class="solr.PorterStemFilterFactory"/>  (default for english)
<filter class="solr.SnowballPorterFilterFactory" language="German2" />
</analyzer>

Greetings

Francesco

Re: analyzer with multiple stem-filters for more languages

Reply via email to