Hello. It has been a few weeks, and I haven't gotten any responses. Perhaps my question is too complicated -- maybe a better approach is to try to gain enough knowledge to answer it myself. My gut feeling is still that it's something to do with the way term positions are getting handled by the WordDelimiterFilterFactory, but I don't have a good understanding of how term positions are calculated or factored into searching. Can anyone recommend some good reading to familiarize myself with these concepts in better detail?
thanks, Demian From: Demian Katz Sent: Tuesday, March 16, 2010 9:47 AM To: solr-user@lucene.apache.org Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms? This is my first post on this list -- apologies if this has been discussed before; I didn't come upon anything exactly equivalent in searching the archives via Google. I'm using Solr 1.4 as part of the VuFind application, and I just noticed that searches for hyphenated terms are failing in strange ways. I strongly suspect it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm not exactly sure what. The problem is that I have a record with the title "Love customs in eighteenth-century Spain." Depending on how I search for this, I get successes or failures in a seemingly unpredictable pattern. Demonstration queries below were tested using the direct Solr administration tool, just to eliminate any VuFind-related factors from the equation while debugging. Queries that work: title:(Love customs in eighteenth century Spain) // no hyphen, no phrases title:("Love customs in eighteenth-century Spain") // phrase search on whole title, with hyphen Queries that fail: title:(Love customs in eighteenth-century Spain) // hyphen, no phrases title:("Love customs in eighteenth century Spain") // phrase search on whole title, without hyphen title:(Love customs in "eighteenth-century" Spain) // hyphenated word as phrase title:(Love customs in "eighteenth century" Spain) // hyphenated word as phrase, hyphen removed Here is VuFind's text field type definition: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/> <filter class="solr.ISOLatin1AccentFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/> <filter class="solr.ISOLatin1AccentFilterFactory"/> </analyzer> </fieldType> I did notice that in the "text" field type in VuFind's schema has "catenateWords" and "catenateNumbers" turned on in both the index and query analyzer chains. It is my understanding that these options should be disabled for the query chain and only enabled for the index chain. However, this may be a red herring -- I have already tried changing this setting, but it didn't change the success/failure pattern described above. I have also played with the preserveOriginal setting without apparent effect. >From playing with the Field Analysis tool, I notice that there is a gap in the >term position sequence after analysis... but I'm not sure if this is >significant. Has anybody else run into this sort of problem? Any ideas on a fix? thanks, Demian