Re: Analysis of Japanese characters

2014-04-07 Thread Shawn Heisey
On 4/7/2014 2:07 PM, T. Kuro Kurosaka wrote: Tom, You should be using JapaneseAnalyzer (kuromoji). Neither CJK nor ICU tokenize at word boundaries. Is JapaneseAnalyzer configurable with regard to what it does with non-japanese text? If it's not, it won't work for me. We use a combination of

Re: Analysis of Japanese characters

2014-04-07 Thread T. Kuro Kurosaka
Tom, You should be using JapaneseAnalyzer (kuromoji). Neither CJK nor ICU tokenize at word boundaries. On 04/02/2014 10:33 AM, Tom Burton-West wrote: Hi Shawn, I'm not sure I understand the problem and why you need to solve it at the ICUTokenizer level rather than the CJKBigramFilter Can you pe

Re: Analysis of Japanese characters

2014-04-03 Thread Alexandre Rafalovitch
No specific answers, but have you read the detailed CJK article collection: http://discovery-grindstone.blogspot.ca/ . There is a lot of information there. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr profici

Re: Analysis of Japanese characters

2014-04-03 Thread Tom Burton-West
Hi Shawn, >>For an input of 田中角栄 the bigram filter works like you described, and what I would expect. If I add a space at the point where the ICU >>tokenizer would have split them anyway, the bigram filter output is very different. If I'm understanding what you are reporting, I suspect this is b

Re: Analysis of Japanese characters

2014-04-02 Thread Shawn Heisey
On 4/2/2014 2:19 PM, Tom Burton-West wrote: Hi Shawn, I may still be missing your point. Below is an example where the ICUTokenizer splits Now, I'm beginning to wonder if I really understand what those flags on the CJKBigramFilter do. The ICUTokenizer spits out unigrams and the CJKBigramFilter

Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
Hi Shawn, I may still be missing your point. Below is an example where the ICUTokenizer splits Now, I'm beginning to wonder if I really understand what those flags on the CJKBigramFilter do. The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them back together into bigrams. I t

Re: Analysis of Japanese characters

2014-04-02 Thread Shawn Heisey
On 4/2/2014 11:33 AM, Tom Burton-West wrote: Hi Shawn, I'm not sure I understand the problem and why you need to solve it at the ICUTokenizer level rather than the CJKBigramFilter Can you perhaps give a few examples of the problem? Have you looked at the flags for the CJKBigramfilter? You can t

Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
Hi Shawn, I'm not sure I understand the problem and why you need to solve it at the ICUTokenizer level rather than the CJKBigramFilter Can you perhaps give a few examples of the problem? Have you looked at the flags for the CJKBigramfilter? You can tell it to make bigrams of different Japanese ch