Hello,

It seems that Tokenizer may violate the contract put forth by the
TokenStream.reset function. Specifically, TokenStream.reset states:

"*Resets this stream to a clean state. Stateful implementations must
implement this method so that they can be reused, just as if they had been
created fresh.*"

Tokenizer does not do this. Tokenizers can only be reset one time. On
subsequent resets IllegalStateReader is swapped in as the Reader, and
incrementToken throws an exception.

The complication arises because Tokenizer takes a Reader and LUCENE-2387
was filed to intentionally unset the input (Reader) to prevent memory leak.
However, unsetting it means we can never read from the Tokenizer a 2nd time
(unless you set the Reader again) and thus it violates the contract.

Should there be a way to reuse Tokenizers?

Thanks,
Dan

Reply via email to