On 4/10/2014 11:53 AM, Shawn Heisey wrote: > My analysis chain includes CJKBigramFilter on both the index and query. > I have outputUnigrams enabled on the index side, but it is disabled on > the query side. This has resulted in a problem with phrase queries. > This is a subset of my index analysis for the three terms you can see in > the ICUNF step, separated by spaces: > > https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png > > Note that in the CJKBF step, the second unigram is output at position 2, > pushing the english terms to 3 and 4. > > When the customer phrase filter query (lucene query parser) for the > first two terms on this specific field, it doesn't match, because the > query analysis doesn't output the unigrams and therefore the positions > don't match. > > I would have expected both unigrams to be at position 1. Is this a bug > or expected behavior?
It's been a week with no reply. First I worked around this problem by disabling outputUnigrams on the index side, to match the query side. At that point, the customer was unable to do a searches for a single character and find longer strings containing that character. I knew this would happen ... I did tell our project manager, but I do not know whether it was communicated to the customer. Then I tried setting outputUnigrams to true on both index and query. Just as I had anticipated, the customer was unhappy with getting results where a "word" containing only one character of their multi-character search string was present. Re-stating the underlying problem and my question: The outputUnigrams option sets one of the unigrams from each bigram to the same position as the bigram, but then puts the other one at the next position, breaking phrase queries. This sounds like a bug. Is it a bug? If not, I would REALLY like a config option to produce the behavior that I expected. Thanks, Shawn