This does not address the question. A single-ideogram query will not
find ideograms in the middle of phrases.

I have also found that phrase slop does not work with bigrams. At all.
I created a separate field type with unigrams. The CJK fields use the
StandardAnalyzer. I made a stack with just the SA which gives raw Euro
text and single terms for CJK ideograms. This worked well for direct
phrase and phrase slop queries. You should use both kinds of fields-
the bigram search helps boost similar phrases.

You should also try the SmartChineseAnalyzer and new Japanese analyzer
suite. I've discovered that CJK search is a very tricky thing, and
different use cases like different strategies.

On Fri, Apr 27, 2012 at 10:57 AM, Walter Underwood
<wun...@wunderwood.org> wrote:
> Bigrams across character types seems like a useful thing, especially for 
> indexing adjective and verb endings.
>
> An n-gram approach is always going to generate a lot of junk along with the 
> gold. Tighten the rules and good stuff is missed, guaranteed. The only way to 
> sort it out is to use a tokenizer with some linguistic rules.
>
> wunder
>
> On Apr 27, 2012, at 10:43 AM, Burton-West, Tom wrote:
>
>> I have a few questions about the CJKBigram filter.
>>
>> About 10% of our queries that contain Han characters are single character 
>> queries.   It looks like the CJKBigram filter only outputs single characters 
>> when there are no adjacent bigrammable characters in the input.   This means 
>> we would have to create a separate field to index Han unigrams in order to 
>> address single character queries.  Is this correct?
>>
>> For Japanese, the default settings form bigrams across character types.  So 
>> for a string containing Hiragana and Han characters bigrams containing a 
>> mixture of Hiragana and Han characters are formed:
>> いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”
>>
>> Is there a way to specify that you don’t want bigrams across character types?
>>
>> Tom
>>
>> Tom Burton-West
>> Digital Library Production Service
>> University of Michigan Library
>>
>> http://www.hathitrust.org/blogs/large-scale-search
>>
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to