RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Demian Katz Wed, 07 Apr 2010 07:32:36 -0700

Hello.  It has been a few weeks, and I haven't gotten any responses.  Perhaps 
my question is too complicated -- maybe a better approach is to try to gain 
enough knowledge to answer it myself.  My gut feeling is still that it's 
something to do with the way term positions are getting handled by the 
WordDelimiterFilterFactory, but I don't have a good understanding of how term 
positions are calculated or factored into searching.  Can anyone recommend some 
good reading to familiarize myself with these concepts in better detail?

thanks,
Demian

From: Demian Katz
Sent: Tuesday, March 16, 2010 9:47 AM
To: solr-user@lucene.apache.org
Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms?

This is my first post on this list -- apologies if this has been discussed 
before; I didn't come upon anything exactly equivalent in searching the 
archives via Google.

I'm using Solr 1.4 as part of the VuFind application, and I just noticed that 
searches for hyphenated terms are failing in strange ways.  I strongly suspect 
it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm 
not exactly sure what.

The problem is that I have a record with the title "Love customs in 
eighteenth-century Spain."  Depending on how I search for this, I get successes 
or failures in a seemingly unpredictable pattern.

Demonstration queries below were tested using the direct Solr administration 
tool, just to eliminate any VuFind-related factors from the equation while 
debugging.

Queries that work:
title:(Love customs in eighteenth century Spain)                                
               // no hyphen, no phrases
title:("Love customs in eighteenth-century Spain")                              
            // phrase search on whole title, with hyphen

Queries that fail:
title:(Love customs in eighteenth-century Spain)                                
              // hyphen, no phrases
title:("Love customs in eighteenth century Spain")                              
             // phrase search on whole title, without hyphen
title:(Love customs in "eighteenth-century" Spain)                              
            // hyphenated word as phrase
title:(Love customs in "eighteenth century" Spain)                              
             // hyphenated word as phrase, hyphen removed

Here is VuFind's text field type definition:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory" 
version="icu4j" composed="false" remove_diacritics="true" 
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" 
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory" 
version="icu4j" composed="false" remove_diacritics="true" 
remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
      </analyzer>
    </fieldType>

I did notice that in the "text" field type in VuFind's schema has 
"catenateWords" and "catenateNumbers" turned on in both the index and query 
analyzer chains.  It is my understanding that these options should be disabled 
for the query chain and only enabled for the index chain.  However, this may be 
a red herring -- I have already tried changing this setting, but it didn't 
change the success/failure pattern described above.  I have also played with 
the preserveOriginal setting without apparent effect.

>From playing with the Field Analysis tool, I notice that there is a gap in the 
>term position sequence after analysis...  but I'm not sure if this is 
>significant.

Has anybody else run into this sort of problem?  Any ideas on a fix?

thanks,
Demian

RE: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Reply via email to