On 4/2/2014 2:19 PM, Tom Burton-West wrote:
Hi Shawn,

I may still be missing your point.  Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.

I thought if you set  han=true, hiragana=true
You would get this kind of result where the third bigram is composed of a
hirigana and han character

It looks like you are right. I did not notice that the bigram filter was putting the tokens back together, even though the tokenizer was splitting them apart. I might be worrying over nothing! Thank you for taking some time to point out the obvious.

I did notice something odd, though. Keep in mind that I have absolutely no idea what I am writing here, so I have no idea if this is valid at all:

For an input of 田中角栄 the bigram filter works like you described, and what I would expect. If I add a space at the point where the ICU tokenizer would have split them anyway, the bigram filter output is very different. Best guess: It notices that the end/start values from the original input are not consecutive, and therefore doesn't combine them. Like I said above, I may have nothing at all to worry about here.

Thanks,
Shawn

Reply via email to