Re: International filters/tokenizers doing too much

Shawn Heisey Tue, 14 Jun 2011 18:57:04 -0700

On 6/14/2011 5:34 PM, Robert Muir wrote:

On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey<s...@elyograg.org>  wrote:

Because the text in my index comes in many different languages with no
ability to know the language ahead of time, I have a need to use
ICUTokenizer and/or the CJK filters, but I have a problem with them as they
are implemented currently.  They do extra things like handle email
addresses, tokenize on non-alphanumeric characters, etc.  I need them to not
do these things.  This is my current index analyzer chain:

the idea is that you customize it to whatever your app needs, by
passing ICUTokenizerConfig to the Tokenizer:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java


the default implementation (DefaultICUTokenizerConfig) is pretty
minimal, mostly the unicode default word break implementation,
described here: http://unicode.org/reports/tr29/

as you see, you just need to provide a BreakIterator given the script
code, you could implement this by hand in java code, or it could use a
dictionary, or whatever.

But the easiest and usually most performant is just to use rules,
especially since they are compiled to an efficient form for
processing, the syntax is described here:
http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules

you compile them into a state machine with this:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java
and you can load the serialized form (statically, or in your factory,
or whatever) with
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29

the reason the script code is provided, is because if you are
customizing, its pretty easy to screw some languages over with some
rules that might happen to work well for another set of languages.
so this way you can provide different rules depending upon the writing system.

for example you could return special punctuation rules for western
languages when its the latin script, but still return the default impl
for Tibetan or something you might be less familiar with (maybe you
actually speak Tibetan, this was just an example).

My understanding starts to break down horribly with things like this. Ican make sense out of very simple Java code, but I can't make sense outof this, and don't know how to take these bits of information you'vegiven me and do something useful with them. I will take the informationto our programming team before I bug you about it again. They willprobably have some idea what to do. I'm hoping that I can just createan extra .jar and not touch the existing lucene/solr code.

Beyond the ICU stuff, what kind of options do I have for dealing withother character sets (CJK, arabic, cyrillic, etc) in some sane mannerwhile not touching typical Latin punctuation? I notice that for CJK,there is only a Tokenizer and an Analyzer, what I really need is a tokenfilter that ONLY deals with the CJK characters. Is that going to be amajor undertaking that is best handled by an experienced Lucenedeveloper? Would such a thing be required for Arabic and Cyrillic, orare they pretty well covered by whitespace and WDF?


Thanks,
Shawn

Re: International filters/tokenizers doing too much

Reply via email to