On 4/10/2014 11:53 AM, Shawn Heisey wrote:
> My analysis chain includes CJKBigramFilter on both the index and query. 
> I have outputUnigrams enabled on the index side, but it is disabled on
> the query side.  This has resulted in a problem with phrase queries. 
> This is a subset of my index analysis for the three terms you can see in
> the ICUNF step, separated by spaces:
> 
> https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png
> 
> Note that in the CJKBF step, the second unigram is output at position 2,
> pushing the english terms to 3 and 4.
> 
> When the customer phrase filter query (lucene query parser) for the
> first two terms on this specific field, it doesn't match, because the
> query analysis doesn't output the unigrams and therefore the positions
> don't match.
> 
> I would have expected both unigrams to be at position 1.  Is this a bug
> or expected behavior?

It's been a week with no reply.

First I worked around this problem by disabling outputUnigrams on the
index side, to match the query side.  At that point, the customer was
unable to do a searches for a single character and find longer strings
containing that character.  I knew this would happen ... I did tell our
project manager, but I do not know whether it was communicated to the
customer.

Then I tried setting outputUnigrams to true on both index and query.
Just as I had anticipated, the customer was unhappy with getting results
where a "word" containing only one character of their multi-character
search string was present.

Re-stating the underlying problem and my question:

The outputUnigrams option sets one of the unigrams from each bigram to
the same position as the bigram, but then puts the other one at the next
position, breaking phrase queries.  This sounds like a bug.  Is it a
bug?  If not, I would REALLY like a config option to produce the
behavior that I expected.

Thanks,
Shawn

Reply via email to