Several problems. 1. Do not remove stopwords. That is a 1970s-era hack for saving disk space. Want to search for “vitamin a”? Better not remove stopwords. 2. Synonyms are before the stemmer, especially the Porter stemmer, where the output isn’t English words. 3. Use KStem instead of Porter. Porter is a clever hack from 1980, but we have better technology now. 4. Add RemoveDuplcatesFilter as the last step, just in case your synonyms stem to the same word. It is cheap insurance.
Also, I really recommend using the ICUNormalizer2CharFilterFactory with “nfkc” mode as the first step before the tokenizer. Otherwise, you’ll get bitten by some weird Unicode thing that takes days to debug. And if you are going to lower-case everything, let ICU do that for you with “nfkc_cf” mode. So that gives: ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default) WhitespaceTokenizerFactory SynonymGraphFilterFactory FlattenGraphFilterFactory KStemFilterFactory RemoveDuplicatesFilterFactory wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 4, 2020, at 9:24 PM, Jayadevan Maymala <jayade...@ftltechsys.com> > wrote: > > Hi all, > > Is this the best (performance-wise as well as efficacy) order of applying > analyzers/filters? We have an eCom site where the many products are listed, > and users may type in search words and get relevant results. > > 1) Tokenize on whitespace (WhitespaceTokenizerFactory) > 2) Remove stopwords (StopFilterFactory) > 3) Stem (PorterStemFilterFactory) > 4) Convert to lowercase (LowerCaseFilterFactory) > 5) Add synonyms (SynonymGraphFilterFactory,FlattenGraphFilterFactory) > > Any possible gotchas? > > Regards, > Jayadevan