Hi,

I've been playing around with using the ICUTokenizer from 4.0.0.
Using the code below, I was receiving an ArrayIndexOutOfBounds
exception on the call to tokenizer.incrementToken().  Looking at the
ICUTokenizer source, I can see why this is occuring (usableLength
defaults to -1).

                ICUTokenizer tokenizer = new ICUTokenizer(myReader);            
                CharTermAttribute termAtt = 
tokenizer.getAttribute(CharTermAttribute.class);

                while(tokenizer.incrementToken())
                {
                        System.out.println(termAtt.toString());
                }

After poking around a little more, I found that I can just call
tokenizer.reset() (initializes usableLength to 0) right after
constructing the object
(org.apache.lucene.analysis.icu.segmentation.TestICUTokenizer does a
similar step in it's super class).  I was wondering if someone could
explain why I need to call tokenizer.reset() prior to using the
tokenizer for the first time.

Thanks in advance,

Shane

Reply via email to