Re: Differences between FilterFactory and TokenizerFactory?

Ahmet Arslan Tue, 05 Oct 2010 10:21:07 -0700

> There are EdgeNGramFilterFactory
> & EdgeNGramTokenizerFactory.
> 
> Likewise there are StandardFilterFactory &
> StandardTokenizerFactory.
> 
> LowerCaseFilterFactory & LowerCaseTokenizerFactory.
> 
> Seems like they always come in pairs. 
> 
> What are the differences between FilterFactory and
> TokenizerFactory? When should I use one as opposed to the
> other?


Tokenizer breaks input text into words/tokens. Its input is a Reader. Only one 
tokenizer exists in an Analyzer. For example StandardTokenizer removes 
punctuations, recognizes e-mail addresses. 

TokenFilters operate on output of tokenizer. Its input is words/tokens.

LowerCaseTokenizerFactory can be expressed as combination of LetterTokenizer + 
LowerCaseFilter.

EdgeNGramTokenizerFactory can be think as KeywordTokenizer + 
EdgeNGramFilterFactory.

For example when you have LetterTokenizer + LowerCaseFilter combination in your 
analyzer chain, you can replace them with LowerCaseTokenizerFactory for 
performance gain.

Re: Differences between FilterFactory and TokenizerFactory?

Reply via email to