On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey <s...@elyograg.org> wrote: > Because the text in my index comes in many different languages with no > ability to know the language ahead of time, I have a need to use > ICUTokenizer and/or the CJK filters, but I have a problem with them as they > are implemented currently. They do extra things like handle email > addresses, tokenize on non-alphanumeric characters, etc. I need them to not > do these things. This is my current index analyzer chain:
the idea is that you customize it to whatever your app needs, by passing ICUTokenizerConfig to the Tokenizer: http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java the default implementation (DefaultICUTokenizerConfig) is pretty minimal, mostly the unicode default word break implementation, described here: http://unicode.org/reports/tr29/ as you see, you just need to provide a BreakIterator given the script code, you could implement this by hand in java code, or it could use a dictionary, or whatever. But the easiest and usually most performant is just to use rules, especially since they are compiled to an efficient form for processing, the syntax is described here: http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules you compile them into a state machine with this: http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java and you can load the serialized form (statically, or in your factory, or whatever) with http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29 the reason the script code is provided, is because if you are customizing, its pretty easy to screw some languages over with some rules that might happen to work well for another set of languages. so this way you can provide different rules depending upon the writing system. for example you could return special punctuation rules for western languages when its the latin script, but still return the default impl for Tibetan or something you might be less familiar with (maybe you actually speak Tibetan, this was just an example).