Re: Analysis of Japanese characters

Shawn Heisey Wed, 02 Apr 2014 13:52:58 -0700

On 4/2/2014 2:19 PM, Tom Burton-West wrote:

Hi Shawn,


I may still be missing your point.  Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.

I thought if you set  han=true, hiragana=true
You would get this kind of result where the third bigram is composed of a
hirigana and han character

It looks like you are right. I did not notice that the bigram filterwas putting the tokens back together, even though the tokenizer wassplitting them apart. I might be worrying over nothing! Thank you fortaking some time to point out the obvious.

I did notice something odd, though. Keep in mind that I have absolutelyno idea what I am writing here, so I have no idea if this is valid at all:

For an input of 田中角栄 the bigram filter works like you described, andwhat I would expect. If I add a space at the point where the ICUtokenizer would have split them anyway, the bigram filter output is verydifferent. Best guess: It notices that the end/start values from theoriginal input are not consecutive, and therefore doesn't combine them.Like I said above, I may have nothing at all to worry about here.


Thanks,
Shawn

Re: Analysis of Japanese characters

Reply via email to