Thanks Robert, I'll take a look there.  Does it sound like I'm on the
right the right track with what I'm implementing, in other words is a
TokenFilter appropriate or is there something else that would be a
better fit for what I've described?

On Thu, Feb 9, 2012 at 6:44 PM, Robert Muir <rcm...@gmail.com> wrote:
> If you are writing a custom tokenstream, I recommend using some of the
> resources in Lucene's test-framework.jar to test it.
> These find lots of bugs! (including thread-safety bugs)
>
> For a filter: I recommend to use the assertions in
> BaseTokenStreamTestCase: assertTokenStreamContents, assertAnalyzesTo,
> and especially checkRandomData
> http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java
>
> When testing your filter, for even more checks, don't use Whitespace
> or Keyword Tokenizer, use MockTokenizer, it has more checks:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/MockTokenizer.java
>
> For some examples, you can look at the tests in modules/analysis.
>
> And of course enable assertions (-ea) when testing!
>
> On Thu, Feb 9, 2012 at 6:30 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>> I have the need to take user input and index it in a unique fashion,
>> essentially the value is some string (say "abcdefghijk") and needs to
>> be converted into a set of tokens (say 1 2 3 4).  I am currently have
>> implemented a custom TokenFilter to do this, is this appropriate?  In
>> cases where I am indexing things slowly (i.e. 1 at a time) this works
>> fine, but when I send 10,000 things to solr (all in one thread) I am
>> noticing exceptions where it seems that the generated instance
>> variable is being used by several threads.  Is my implementation
>> appropriate or is there another more appropriate way to do this?  Are
>> TokenFilters reused?  Would it be more appropriate to convert the
>> stream to 1 token space separated then run that through a
>> WhiteSpaceTokenizer?  Any guidance on this would be greatly
>> appreciated.
>>
>>        class CustomFilter extends TokenFilter {
>>                private final CharTermAttribute termAtt =
>> addAttribute(CharTermAttribute.class);
>>                private final PositionIncrementAttribute posAtt =
>> addAttribute(PositionIncrementAttribute.class);
>>                protected CustomFilter(TokenStream input) {
>>                        super(input);
>>                }
>>
>>                Iterator<AttributeSource> replacement;
>>                @Override
>>                public boolean incrementToken() throws IOException {
>>
>>
>>                        if(generated == null){
>>                                //setup generated
>>                                if(!input.incrementToken()){
>>                                        return false;
>>                                }
>>
>>                                //clearAttributes();
>>                                List<String> cells = 
>> StaticClass.generateTokens(termAtt.toString());
>>                                generated = new 
>> ArrayList<AttributeSource>(cells.size());
>>                                boolean first = true;
>>                                for(String cell : cells) {
>>                                        AttributeSource newTokenSource = 
>> this.cloneAttributes();
>>
>>                                        CharTermAttribute newTermAtt =
>> newTokenSource.addAttribute(CharTermAttribute.class);
>>                                        newTermAtt.setEmpty();
>>                                        newTermAtt.append(cell);
>>                                        OffsetAttribute newOffsetAtt =
>> newTokenSource.addAttribute(OffsetAttribute.class);
>>                                        PositionIncrementAttribute 
>> newPosIncAtt =
>> newTokenSource.addAttribute(PositionIncrementAttribute.class);
>>                                        newOffsetAtt.setOffset(0,0);
>>                                        
>> newPosIncAtt.setPositionIncrement(first ? 1 : 0);
>>                                        generated.add(newTokenSource);
>>                                        first = false;
>>                                        generated.add(newTokenSource);
>>                                }
>>
>>                        }
>>                        if(!generated.isEmpty()){
>>                                copy(this, generated.remove(0));
>>                                return true;
>>                        }
>>
>>                        return false;
>>
>>                }
>>
>>                private void copy(AttributeSource target, AttributeSource 
>> source) {
>>                        if (target != source)
>>                                source.copyTo(target);
>>                }
>>
>>                private LinkedList<AttributeSource> buffer;
>>                private LinkedList<AttributeSource> matched;
>>
>>                private boolean exhausted;
>>
>>                private AttributeSource nextTok() throws IOException {
>>                        if (buffer != null && !buffer.isEmpty()) {
>>                                return buffer.removeFirst();
>>                        } else {
>>                                if (!exhausted && input.incrementToken()) {
>>                                        return this;
>>                                } else {
>>                                        exhausted = true;
>>                                        return null;
>>                                }
>>                        }
>>                }
>>                @Override
>>                public void reset() throws IOException {
>>                        super.reset();
>>                        generated = null;
>>                }
>>        }
>
>
>
> --
> lucidimagination.com

Reply via email to