Bug in solr.KeywordMarkerFilterFactory?

Demian Katz Wed, 20 Apr 2011 11:05:03 -0700

I've just started experimenting with the solr.KeywordMarkerFilterFactory in 
Solr 3.1, and I'm seeing some strange behavior.  It seems that every word 
subsequent to a protected word is also treated as being protected.


For testing purposes, I have put the word "spelling" in my protwords.txt.  If I 
do a test for "spelling bees" in the analyze tool, the stemmer produces 
"spelling bees" - nothing is stemmed.  But if I do a test for "bees spelling", 
I get "bee spelling", the expected result with "bees" stemmed but "spelling" 
left unstemmed.  I have tried extended examples - in every case I tried, all of 
the words prior to "spelling" get stemmed, but none of the words after 
"spelling" get stemmed.  When turning on the verbose mode of the analyze tool, 
I can see that the settings of the "keyword" attribute introduced by 
solr.KeywordMarkerFilterFactory correspond with the the stemming behavior... so 
I think the solr.KeywordMarkerFilterFactory component is to blame, and not 
anything later in the analyze chain.

Any idea what might be going wrong?  Is this a known issue?

Here is my field type definition, in case it makes a difference:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

thanks,
Demian

Bug in solr.KeywordMarkerFilterFactory?

Reply via email to