> There are a lot of company names that
> people are uncertain as to the correct spelling. A few of
> examples are:
> 1. best buy, bestbuy
> 2. walmart, wal mart, wal-mart
> 3. Holiday Inn, HolidayInn
> 
> What Tokenizer Factory and/or TokenFilterFactory should I
> use so that somebody typing "wal mart"(quotes not included)
> will find "wal mart" and "walmart"(again, quotes not
> included)

I faced a similar requirement before. I solved it by hardcoding those names to 
synonyms_index.txt and using SynonymFilterFactory at index time.

synonyms_index.txt will contain:

best buy, bestbuy
walmart, wal mart
Holiday Inn, HolidayInn

<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory" />
  <filter class="solr.LowerCaseFilterFactory" /> 
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" 
ignoreCase="true" expand="true" /> 
  </analyzer>
<analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory" />
  <filter class="solr.LowerCaseFilterFactory" /> 
</analyzer>

Since solr wiki[1] advices to use index time synonym when dealing with 
multi-word synonyms, I am using index time synonym expansion only.

[1] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46

When working with StandardAnalyzer, wal-mart is broken into two tokens: wal and 
mart. So you dont need to write - forms of the words in synonyms_index.txt


If all of your examples were similar to HolidayInn, you could use 
solr.WordDelimiterFilterFactory (without writing all these company named to a 
file) but you can't handle "wal mart" and "walmart" with it.

Hope this helps.


      

Reply via email to