> There are a lot of company names that > people are uncertain as to the correct spelling. A few of > examples are: > 1. best buy, bestbuy > 2. walmart, wal mart, wal-mart > 3. Holiday Inn, HolidayInn > > What Tokenizer Factory and/or TokenFilterFactory should I > use so that somebody typing "wal mart"(quotes not included) > will find "wal mart" and "walmart"(again, quotes not > included)
I faced a similar requirement before. I solved it by hardcoding those names to synonyms_index.txt and using SynonymFilterFactory at index time. synonyms_index.txt will contain: best buy, bestbuy walmart, wal mart Holiday Inn, HolidayInn <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true" expand="true" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> Since solr wiki[1] advices to use index time synonym when dealing with multi-word synonyms, I am using index time synonym expansion only. [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46 When working with StandardAnalyzer, wal-mart is broken into two tokens: wal and mart. So you dont need to write - forms of the words in synonyms_index.txt If all of your examples were similar to HolidayInn, you could use solr.WordDelimiterFilterFactory (without writing all these company named to a file) but you can't handle "wal mart" and "walmart" with it. Hope this helps.