Re: Another japanese analysis problem

Alexandre Rafalovitch Thu, 17 Apr 2014 23:06:07 -0700

Did you read through the CJK article series? Maybe there is something
in there? 
http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html


Sorry, no help on actual Japanese.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Fri, Apr 18, 2014 at 12:50 PM, Shawn Heisey <s...@elyograg.org> wrote:
> On 4/10/2014 11:53 AM, Shawn Heisey wrote:
>> My analysis chain includes CJKBigramFilter on both the index and query.
>> I have outputUnigrams enabled on the index side, but it is disabled on
>> the query side.  This has resulted in a problem with phrase queries.
>> This is a subset of my index analysis for the three terms you can see in
>> the ICUNF step, separated by spaces:
>>
>> https://www.dropbox.com/s/9q1x9pdbsjhzocg/bigram-position-problem.png
>>
>> Note that in the CJKBF step, the second unigram is output at position 2,
>> pushing the english terms to 3 and 4.
>>
>> When the customer phrase filter query (lucene query parser) for the
>> first two terms on this specific field, it doesn't match, because the
>> query analysis doesn't output the unigrams and therefore the positions
>> don't match.
>>
>> I would have expected both unigrams to be at position 1.  Is this a bug
>> or expected behavior?
>
> It's been a week with no reply.
>
> First I worked around this problem by disabling outputUnigrams on the
> index side, to match the query side.  At that point, the customer was
> unable to do a searches for a single character and find longer strings
> containing that character.  I knew this would happen ... I did tell our
> project manager, but I do not know whether it was communicated to the
> customer.
>
> Then I tried setting outputUnigrams to true on both index and query.
> Just as I had anticipated, the customer was unhappy with getting results
> where a "word" containing only one character of their multi-character
> search string was present.
>
> Re-stating the underlying problem and my question:
>
> The outputUnigrams option sets one of the unigrams from each bigram to
> the same position as the bigram, but then puts the other one at the next
> position, breaking phrase queries.  This sounds like a bug.  Is it a
> bug?  If not, I would REALLY like a config option to produce the
> behavior that I expected.
>
> Thanks,
> Shawn
>

Re: Another japanese analysis problem

Reply via email to