Re: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Erick Erickson Wed, 07 Apr 2010 19:04:39 -0700

Well, for a quick trial using trunk, I had to remove the
UnicodeNormalizationFactory, is that yours?


But with that removed, I get the results you do, ASSUMING that you've set
your default operator to AND in schema.xml...

Believe it or not, it all changes and all your queries return a hit if you
do one of two things (I did this in both index and query when testing 'cause
I'm lazy):
1> move the inclusion of the StopFilterFactory after WordDelimiterFactory
or
2> for StopFilterFactory, set enablePositionIncrements="false"

I think either of these might work in your situation.......

On doing some more investigation, it appears that if a hyphenated word is
immediately after a stopword AND the above is true (stop factory included
before WordDelimiterFactory and enablePositionIncrements="true"), then the
search fails. I indexed this title:

Love-customs in eighteenth-century Spain for nineteenth-century

Searching in solr/admin/form.jsp for:
title:(nineteenth-century)

fails. But if I remove the "for" from the title, the above query works.
Searching for
title:(love-customs)
always works.

Finally, (and it's *really* time to go to sleep now), just setting
enablePositionIncrements="false" in the "index" portion of the schema also
causes things to work.

Developer folks:
I didn't see anything in a quick look in SOLR or Lucene JIRAs, should I
refine this a bit (really, sleepy time is near) and add a JIRA?

Best
Erick

On Wed, Apr 7, 2010 at 10:29 AM, Demian Katz <demian.k...@villanova.edu>wrote:

> Hello.  It has been a few weeks, and I haven't gotten any responses.
>  Perhaps my question is too complicated -- maybe a better approach is to try
> to gain enough knowledge to answer it myself.  My gut feeling is still that
> it's something to do with the way term positions are getting handled by the
> WordDelimiterFilterFactory, but I don't have a good understanding of how
> term positions are calculated or factored into searching.  Can anyone
> recommend some good reading to familiarize myself with these concepts in
> better detail?
>
> thanks,
> Demian
>
> From: Demian Katz
> Sent: Tuesday, March 16, 2010 9:47 AM
> To: solr-user@lucene.apache.org
> Subject: solr.WordDelimiterFilterFactory problem with hyphenated terms?
>
> This is my first post on this list -- apologies if this has been discussed
> before; I didn't come upon anything exactly equivalent in searching the
> archives via Google.
>
> I'm using Solr 1.4 as part of the VuFind application, and I just noticed
> that searches for hyphenated terms are failing in strange ways.  I strongly
> suspect it has something to do with the solr.WordDelimiterFilterFactory
> filter, but I'm not exactly sure what.
>
> The problem is that I have a record with the title "Love customs in
> eighteenth-century Spain."  Depending on how I search for this, I get
> successes or failures in a seemingly unpredictable pattern.
>
> Demonstration queries below were tested using the direct Solr
> administration tool, just to eliminate any VuFind-related factors from the
> equation while debugging.
>
> Queries that work:
> title:(Love customs in eighteenth century Spain)
>                     // no hyphen, no phrases
> title:("Love customs in eighteenth-century Spain")
>                  // phrase search on whole title, with hyphen
>
> Queries that fail:
> title:(Love customs in eighteenth-century Spain)
>                    // hyphen, no phrases
> title:("Love customs in eighteenth century Spain")
>                   // phrase search on whole title, without hyphen
> title:(Love customs in "eighteenth-century" Spain)
>                  // hyphenated word as phrase
> title:(Love customs in "eighteenth century" Spain)
>                   // hyphenated word as phrase, hyphen removed
>
> Here is VuFind's text field type definition:
>
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="schema.UnicodeNormalizationFilterFactory"
> version="icu4j" composed="false" remove_diacritics="true"
> remove_modifiers="true" fold="true"/>
>        <filter class="solr.ISOLatin1AccentFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="schema.UnicodeNormalizationFilterFactory"
> version="icu4j" composed="false" remove_diacritics="true"
> remove_modifiers="true" fold="true"/>
>        <filter class="solr.ISOLatin1AccentFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> I did notice that in the "text" field type in VuFind's schema has
> "catenateWords" and "catenateNumbers" turned on in both the index and query
> analyzer chains.  It is my understanding that these options should be
> disabled for the query chain and only enabled for the index chain.  However,
> this may be a red herring -- I have already tried changing this setting, but
> it didn't change the success/failure pattern described above.  I have also
> played with the preserveOriginal setting without apparent effect.
>
> From playing with the Field Analysis tool, I notice that there is a gap in
> the term position sequence after analysis...  but I'm not sure if this is
> significant.
>
> Has anybody else run into this sort of problem?  Any ideas on a fix?
>
> thanks,
> Demian
>
>

Re: solr.WordDelimiterFilterFactory problem with hyphenated terms?

Reply via email to