On 4/2/2014 11:33 AM, Tom Burton-West wrote:
Hi Shawn,I'm not sure I understand the problem and why you need to solve it at the ICUTokenizer level rather than the CJKBigramFilter Can you perhaps give a few examples of the problem? Have you looked at the flags for the CJKBigramfilter? You can tell it to make bigrams of different Japanese character sets. For example the config given in the JavaDocs tells it to make bigrams across 3 of the different Japanese character sets. (Is the issue related to Romaji?) <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="false" /> http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html Tom On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey <[email protected]> wrote:My company is setting up a system for a customer from Japan. We have an existing system that handles primarily English. Here's my general text analysis chain: http://apaste.info/xa5 After talking to the customer about problems they are encountering with search, we have determined that some of the problems are caused because ICUTokenizer splits on *any* character set change, including changes between different Japanase character sets. Knowing the risk of this being an XY problem, here's my question: Can someone help me develop a rule file for the ICU Tokenizer that will *not* split when the character set changes from one of the japanese character sets to another japanese character set, but still split on other character set changes?
Because of what ICUTokenizer does, by the time it makes it to the bigram filter, they're already separate terms.
Simplifying to english, let's pretend that upper and lowercase letters are in different character sets. Original term is abCD. You expect that by the end of the analysis, you'll have ab bC CD. With the ICUTokenizer, you end up with just ab CD.
The index side is more complex because of outputUnigrams. We are still deciding whether we want to keep that parameter set, but that's a separate issue, one that we know how to resolve without help.
Thanks, Shawn
