On 4/2/2014 2:19 PM, Tom Burton-West wrote:
Hi Shawn,
I may still be missing your point. Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.
I thought if you set han=true, hiragana=true
You would get this kind of result where the third bigram is composed of a
hirigana and han character
It looks like you are right. I did not notice that the bigram filter
was putting the tokens back together, even though the tokenizer was
splitting them apart. I might be worrying over nothing! Thank you for
taking some time to point out the obvious.
I did notice something odd, though. Keep in mind that I have absolutely
no idea what I am writing here, so I have no idea if this is valid at all:
For an input of 田中角栄 the bigram filter works like you described, and
what I would expect. If I add a space at the point where the ICU
tokenizer would have split them anyway, the bigram filter output is very
different. Best guess: It notices that the end/start values from the
original input are not consecutive, and therefore doesn't combine them.
Like I said above, I may have nothing at all to worry about here.
Thanks,
Shawn