On 7/8/2015 4:01 PM, Jack Krupansky wrote: > In Lucene 4.8, LUCENE-5111: Fix WordDelimiterFilter offsets > > https://issues.apache.org/jira/browse/LUCENE-5111 > > Make sure the documents are queried and indexed with the same Lucene match > version.
Since I have updated the luceneMatchVersion on the 4.9.1 version to LUCENE_47, I am now reindexing it, to see if that helps. I discovered that I had some information backwards in my previous messages -- it is *index* time analysis that differs. Query time analysis is the same across versions. The reindex may very well fix this problem, but luceneMatchVersion is a band-aid, and I think there is a bug to be fixed. I have no doubt that LUCENE-5111 fixed a real issue, but I think it also caused some new problems. When faced with text like "aaa-bbb", the original term (created by preserveOriginal) ends up at relative position 1. Prior to this fix, the next terms will be "aaa" at position 1 and "bbb" at position 2. The "aaabbb" term created by the catenation option also ends up at position 2. This arrangement makes perfect sense to me. After the fix (with luceneMatchVersion at 4.9), both "aaa" and "bbb" end up at position 2. I can't see how it is logical to end up with these positions. It breaks phrase queries on my index because the query-time analysis puts these two terms at position 1 and 2. The WDF options I chose seemed logical to me when I made them (about four years ago), but I admit that I don't remember the exact motivation behind those choices. You can find the entire fieldType definition in a previous message on this thread. The two analysis chains are the same except for WDF options. Should I use different options? Index-time options: | <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" preserveOriginal="1" /> Query-time options: || <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" preserveOriginal="0" />| Thanks, Shawn