On 6/14/2011 5:34 PM, Robert Muir wrote:
On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey<s...@elyograg.org> wrote:
Because the text in my index comes in many different languages with no
ability to know the language ahead of time, I have a need to use
ICUTokenizer and/or the CJK filters, but I have a problem with them as they
are implemented currently. They do extra things like handle email
addresses, tokenize on non-alphanumeric characters, etc. I need them to not
do these things. This is my current index analyzer chain:
the idea is that you customize it to whatever your app needs, by
passing ICUTokenizerConfig to the Tokenizer:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java
the default implementation (DefaultICUTokenizerConfig) is pretty
minimal, mostly the unicode default word break implementation,
described here: http://unicode.org/reports/tr29/
as you see, you just need to provide a BreakIterator given the script
code, you could implement this by hand in java code, or it could use a
dictionary, or whatever.
But the easiest and usually most performant is just to use rules,
especially since they are compiled to an efficient form for
processing, the syntax is described here:
http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules
you compile them into a state machine with this:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java
and you can load the serialized form (statically, or in your factory,
or whatever) with
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29
the reason the script code is provided, is because if you are
customizing, its pretty easy to screw some languages over with some
rules that might happen to work well for another set of languages.
so this way you can provide different rules depending upon the writing system.
for example you could return special punctuation rules for western
languages when its the latin script, but still return the default impl
for Tibetan or something you might be less familiar with (maybe you
actually speak Tibetan, this was just an example).
My understanding starts to break down horribly with things like this. I
can make sense out of very simple Java code, but I can't make sense out
of this, and don't know how to take these bits of information you've
given me and do something useful with them. I will take the information
to our programming team before I bug you about it again. They will
probably have some idea what to do. I'm hoping that I can just create
an extra .jar and not touch the existing lucene/solr code.
Beyond the ICU stuff, what kind of options do I have for dealing with
other character sets (CJK, arabic, cyrillic, etc) in some sane manner
while not touching typical Latin punctuation? I notice that for CJK,
there is only a Tokenizer and an Analyzer, what I really need is a token
filter that ONLY deals with the CJK characters. Is that going to be a
major undertaking that is best handled by an experienced Lucene
developer? Would such a thing be required for Arabic and Cyrillic, or
are they pretty well covered by whitespace and WDF?
Thanks,
Shawn