Re: International filters/tokenizers doing too much

Robert Muir Tue, 14 Jun 2011 16:36:13 -0700

On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey <s...@elyograg.org> wrote:
> Because the text in my index comes in many different languages with no
> ability to know the language ahead of time, I have a need to use
> ICUTokenizer and/or the CJK filters, but I have a problem with them as they
> are implemented currently.  They do extra things like handle email
> addresses, tokenize on non-alphanumeric characters, etc.  I need them to not
> do these things.  This is my current index analyzer chain:


the idea is that you customize it to whatever your app needs, by
passing ICUTokenizerConfig to the Tokenizer:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java

the default implementation (DefaultICUTokenizerConfig) is pretty
minimal, mostly the unicode default word break implementation,
described here: http://unicode.org/reports/tr29/

as you see, you just need to provide a BreakIterator given the script
code, you could implement this by hand in java code, or it could use a
dictionary, or whatever.

But the easiest and usually most performant is just to use rules,
especially since they are compiled to an efficient form for
processing, the syntax is described here:
http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules

you compile them into a state machine with this:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java
and you can load the serialized form (statically, or in your factory,
or whatever) with
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29

the reason the script code is provided, is because if you are
customizing, its pretty easy to screw some languages over with some
rules that might happen to work well for another set of languages.
so this way you can provide different rules depending upon the writing system.

for example you could return special punctuation rules for western
languages when its the latin script, but still return the default impl
for Tibetan or something you might be less familiar with (maybe you
actually speak Tibetan, this was just an example).

Re: International filters/tokenizers doing too much

Reply via email to