Well, for a quick trial using trunk, I had to remove the UnicodeNormalizationFactory, is that yours?
But with that removed, I get the results you do, ASSUMING that you've set your default operator to AND in schema.xml... Believe it or not, it all changes and all your queries return a hit if you do one of two things (I did this in both index and query when testing 'cause I'm lazy): 1> move the inclusion of the StopFilterFactory after WordDelimiterFactory or 2> for StopFilterFactory, set enablePositionIncrements="false" I think either of these might work in your situation....... On doing some more investigation, it appears that if a hyphenated word is immediately after a stopword AND the above is true (stop factory included before WordDelimiterFactory and enablePositionIncrements="true"), then the search fails. I indexed this title: Love-customs in eighteenth-century Spain for nineteenth-century Searching in solr/admin/form.jsp for: title:(nineteenth-century) fails. But if I remove the "for" from the title, the above query works. Searching for title:(love-customs) always works. Finally, (and it's *really* time to go to sleep now), just setting enablePositionIncrements="false" in the "index" portion of the schema also causes things to work. Developer folks: I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I refine this a bit (really, sleepy time is near) and add a JIRA? Best Erick On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz <demian.k...@villanova.edu>wrote: > Hello. It has been a few weeks, and I haven't gotten any responses. > Perhaps my question is too complicated -- maybe a better approach is to try > to gain enough knowledge to answer it myself. My gut feeling is still that > it's something to do with the way term positions are getting handled by the > WordDelimiterFilterFactory, but I don't have a good understanding of how > term positions are calculated or factored into searching. Can anyone > recommend some good reading to familiarize myself with these concepts in > better detail? > > thanks, > Demian > > From: Demian Katz > Sent: Tuesday, March 16, 2010 9:47 AM > To: solr-user@lucene.apache.org > Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms? > > This is my first post on this list -- apologies if this has been discussed > before; I didn't come upon anything exactly equivalent in searching the > archives via Google. > > I'm using Solr 1.4 as part of the VuFind application, and I just noticed > that searches for hyphenated terms are failing in strange ways. I strongly > suspect it has something to do with the solr.WordDelimiterFilterFactory > filter, but I'm not exactly sure what. > > The problem is that I have a record with the title "Love customs in > eighteenth-century Spain." Depending on how I search for this, I get > successes or failures in a seemingly unpredictable pattern. > > Demonstration queries below were tested using the direct Solr > administration tool, just to eliminate any VuFind-related factors from the > equation while debugging. > > Queries that work: > title:(Love customs in eighteenth century Spain) > // no hyphen, no phrases > title:("Love customs in eighteenth-century Spain") > // phrase search on whole title, with hyphen > > Queries that fail: > title:(Love customs in eighteenth-century Spain) > // hyphen, no phrases > title:("Love customs in eighteenth century Spain") > // phrase search on whole title, without hyphen > title:(Love customs in "eighteenth-century" Spain) > // hyphenated word as phrase > title:(Love customs in "eighteenth century" Spain) > // hyphenated word as phrase, hyphen removed > > Here is VuFind's text field type definition: > > <fieldType name="text" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.SnowballPorterFilterFactory" language="English" > protected="protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > <filter class="schema.UnicodeNormalizationFilterFactory" > version="icu4j" composed="false" remove_diacritics="true" > remove_modifiers="true" fold="true"/> > <filter class="solr.ISOLatin1AccentFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.SnowballPorterFilterFactory" language="English" > protected="protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > <filter class="schema.UnicodeNormalizationFilterFactory" > version="icu4j" composed="false" remove_diacritics="true" > remove_modifiers="true" fold="true"/> > <filter class="solr.ISOLatin1AccentFilterFactory"/> > </analyzer> > </fieldType> > > I did notice that in the "text" field type in VuFind's schema has > "catenateWords" and "catenateNumbers" turned on in both the index and query > analyzer chains. It is my understanding that these options should be > disabled for the query chain and only enabled for the index chain. However, > this may be a red herring -- I have already tried changing this setting, but > it didn't change the success/failure pattern described above. I have also > played with the preserveOriginal setting without apparent effect. > > From playing with the Field Analysis tool, I notice that there is a gap in > the term position sequence after analysis... but I'm not sure if this is > significant. > > Has anybody else run into this sort of problem? Any ideas on a fix? > > thanks, > Demian > >