On 4/7/2014 2:07 PM, T. Kuro Kurosaka wrote:
Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.
Is JapaneseAnalyzer configurable with regard to what it does with
non-japanese text? If it's not, it won't work for me.
We use a combination of
Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.
On 04/02/2014 10:33 AM, Tom Burton-West wrote:
Hi Shawn,
I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you pe
No specific answers, but have you read the detailed CJK article
collection: http://discovery-grindstone.blogspot.ca/ . There is a lot
of information there.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr profici
Hi Shawn,
>>For an input of 田中角栄 the bigram filter works like you described, and what
I would expect. If I add a space at the point where the ICU >>tokenizer
would have split them anyway, the bigram filter output is very different.
If I'm understanding what you are reporting, I suspect this is b
On 4/2/2014 2:19 PM, Tom Burton-West wrote:
Hi Shawn,
I may still be missing your point. Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter
Hi Shawn,
I may still be missing your point. Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.
I t
On 4/2/2014 11:33 AM, Tom Burton-West wrote:
Hi Shawn,
I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?
Have you looked at the flags for the CJKBigramfilter?
You can t
Hi Shawn,
I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?
Have you looked at the flags for the CJKBigramfilter?
You can tell it to make bigrams of different Japanese ch