Bigrams across character types seems like a useful thing, especially for indexing adjective and verb endings.
An n-gram approach is always going to generate a lot of junk along with the gold. Tighten the rules and good stuff is missed, guaranteed. The only way to sort it out is to use a tokenizer with some linguistic rules. wunder On Apr 27, 2012, at 10:43 AM, Burton-West, Tom wrote: > I have a few questions about the CJKBigram filter. > > About 10% of our queries that contain Han characters are single character > queries. It looks like the CJKBigram filter only outputs single characters > when there are no adjacent bigrammable characters in the input. This means > we would have to create a separate field to index Han unigrams in order to > address single character queries. Is this correct? > > For Japanese, the default settings form bigrams across character types. So > for a string containing Hiragana and Han characters bigrams containing a > mixture of Hiragana and Han characters are formed: > いろは革命歌 => “いろ” ”ろは“ “は革” ”革命” “命歌” > > Is there a way to specify that you don’t want bigrams across character types? > > Tom > > Tom Burton-West > Digital Library Production Service > University of Michigan Library > > http://www.hathitrust.org/blogs/large-scale-search >