Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

Shawn Heisey Wed, 08 Jul 2015 15:52:04 -0700

On 7/8/2015 4:01 PM, Jack Krupansky wrote:
> In Lucene 4.8, LUCENE-5111: Fix WordDelimiterFilter offsets
>
> https://issues.apache.org/jira/browse/LUCENE-5111
>
> Make sure the documents are queried and indexed with the same Lucene match
> version.


Since I have updated the luceneMatchVersion on the 4.9.1 version to
LUCENE_47, I am now reindexing it, to see if that helps.

I discovered that I had some information backwards in my previous
messages -- it is *index* time analysis that differs.  Query time
analysis is the same across versions.  The reindex may very well fix
this problem, but luceneMatchVersion is a band-aid, and I think there is
a bug to be fixed.

I have no doubt that LUCENE-5111 fixed a real issue, but I think it also
caused some new problems.

When faced with text like "aaa-bbb", the original term (created by
preserveOriginal) ends up at relative position 1.  Prior to this fix,
the next terms will be "aaa" at position 1 and "bbb" at position 2.  The
"aaabbb" term created by the catenation option also ends up at position
2.  This arrangement makes perfect sense to me.

After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end
up at position 2.  I can't see how it is logical to end up with these
positions.  It breaks phrase queries on my index because the query-time
analysis puts these two terms at position 1 and 2.

The WDF options I chose seemed logical to me when I made them (about
four years ago), but I admit that I don't remember the exact motivation
behind those choices.  You can find the entire fieldType definition in a
previous message on this thread.  The two analysis chains are the same
except for WDF options.  Should I use different options?

Index-time options:

|        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />

Query-time options:
||        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="0"
          catenateNumbers="0"
          catenateAll="0"
          preserveOriginal="0"
        />|


Thanks,
Shawn

Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

Reply via email to