> There are EdgeNGramFilterFactory > & EdgeNGramTokenizerFactory. > > Likewise there are StandardFilterFactory & > StandardTokenizerFactory. > > LowerCaseFilterFactory & LowerCaseTokenizerFactory. > > Seems like they always come in pairs. > > What are the differences between FilterFactory and > TokenizerFactory? When should I use one as opposed to the > other?
Tokenizer breaks input text into words/tokens. Its input is a Reader. Only one tokenizer exists in an Analyzer. For example StandardTokenizer removes punctuations, recognizes e-mail addresses. TokenFilters operate on output of tokenizer. Its input is words/tokens. LowerCaseTokenizerFactory can be expressed as combination of LetterTokenizer + LowerCaseFilter. EdgeNGramTokenizerFactory can be think as KeywordTokenizer + EdgeNGramFilterFactory. For example when you have LetterTokenizer + LowerCaseFilter combination in your analyzer chain, you can replace them with LowerCaseTokenizerFactory for performance gain.