Bigrams across character types seems like a useful thing, especially for 
indexing adjective and verb endings.

An n-gram approach is always going to generate a lot of junk along with the 
gold. Tighten the rules and good stuff is missed, guaranteed. The only way to 
sort it out is to use a tokenizer with some linguistic rules.

wunder

On Apr 27, 2012, at 10:43 AM, Burton-West, Tom wrote:

> I have a few questions about the CJKBigram filter.
> 
> About 10% of our queries that contain Han characters are single character 
> queries.   It looks like the CJKBigram filter only outputs single characters 
> when there are no adjacent bigrammable characters in the input.   This means 
> we would have to create a separate field to index Han unigrams in order to 
> address single character queries.  Is this correct?
> 
> For Japanese, the default settings form bigrams across character types.  So 
> for a string containing Hiragana and Han characters bigrams containing a 
> mixture of Hiragana and Han characters are formed:
> いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”
> 
> Is there a way to specify that you don’t want bigrams across character types?
> 
> Tom
> 
> Tom Burton-West
> Digital Library Production Service
> University of Michigan Library
> 
> http://www.hathitrust.org/blogs/large-scale-search
> 





Reply via email to