I've just started experimenting with the solr.KeywordMarkerFilterFactory in
Solr 3.1, and I'm seeing some strange behavior. It seems that every word
subsequent to a protected word is also treated as being protected.
For testing purposes, I have put the word "spelling" in my protwords.txt. If I
do a test for "spelling bees" in the analyze tool, the stemmer produces
"spelling bees" - nothing is stemmed. But if I do a test for "bees spelling",
I get "bee spelling", the expected result with "bees" stemmed but "spelling"
left unstemmed. I have tried extended examples - in every case I tried, all of
the words prior to "spelling" get stemmed, but none of the words after
"spelling" get stemmed. When turning on the verbose mode of the analyze tool,
I can see that the settings of the "keyword" attribute introduced by
solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so
I think the solr.KeywordMarkerFilterFactory component is to blame, and not
anything later in the analyze chain.
Any idea what might be going wrong? Is this a known issue?
Here is my field type definition, in case it makes a difference:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
thanks,
Demian