Re: Order of applying tokens/filter

Walter Underwood Sun, 04 Oct 2020 22:05:18 -0700

Several problems.

1. Do not remove stopwords. That is a 1970s-era hack for saving disk space. 
Want to search for “vitamin a”? Better not remove stopwords.
2. Synonyms are before the stemmer, especially the Porter stemmer, where the 
output isn’t English words.
3. Use KStem instead of Porter. Porter is a clever hack from 1980, but we have 
better technology now.
4. Add RemoveDuplcatesFilter as the last step, just in case your synonyms stem 
to the same word. It is cheap insurance.


Also, I really recommend using the ICUNormalizer2CharFilterFactory with “nfkc” 
mode as the first step before the tokenizer. Otherwise, you’ll get bitten by 
some weird Unicode thing that takes days to debug. And if you are going to 
lower-case everything, let ICU do that for you with “nfkc_cf” mode.

So that gives:

ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
WhitespaceTokenizerFactory
SynonymGraphFilterFactory
FlattenGraphFilterFactory
KStemFilterFactory
RemoveDuplicatesFilterFactory

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 4, 2020, at 9:24 PM, Jayadevan Maymala <jayade...@ftltechsys.com> 
> wrote:
> 
> Hi all,
> 
> Is this the best (performance-wise as well as efficacy) order of applying
> analyzers/filters? We have an eCom site where the many products are listed,
> and users may type in search words and get relevant results.
> 
> 1) Tokenize on whitespace (WhitespaceTokenizerFactory)
> 2) Remove stopwords (StopFilterFactory)
> 3) Stem (PorterStemFilterFactory)
> 4) Convert to lowercase  (LowerCaseFilterFactory)
> 5) Add synonyms (SynonymGraphFilterFactory,FlattenGraphFilterFactory)
> 
> Any possible gotchas?
> 
> Regards,
> Jayadevan

Re: Order of applying tokens/filter

Reply via email to