Ahmet,

Thanks a lot. Your suggestion was really helpful. I tried using synonyms before but for some reason it didn't work but this time around it worked.

On 09/11/2009 02:55 AM, AHMET ARSLAN wrote:
There are a lot of company names that
people are uncertain as to the correct spelling. A few of
examples are:
1. best buy, bestbuy
2. walmart, wal mart, wal-mart
3. Holiday Inn, HolidayInn

What Tokenizer Factory and/or TokenFilterFactory should I
use so that somebody typing "wal mart"(quotes not included)
will find "wal mart" and "walmart"(again, quotes not
included)
I faced a similar requirement before. I solved it by hardcoding those names to 
synonyms_index.txt and using SynonymFilterFactory at index time.

synonyms_index.txt will contain:

best buy, bestbuy
walmart, wal mart
Holiday Inn, HolidayInn

<analyzer type="index">
   <tokenizer class="solr.StandardTokenizerFactory" />
   <filter class="solr.LowerCaseFilterFactory" />
   <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true" 
expand="true" />
   </analyzer>
<analyzer type="query">
   <tokenizer class="solr.StandardTokenizerFactory" />
   <filter class="solr.LowerCaseFilterFactory" />
</analyzer>

Since solr wiki[1] advices to use index time synonym when dealing with 
multi-word synonyms, I am using index time synonym expansion only.

[1] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46

When working with StandardAnalyzer, wal-mart is broken into two tokens: wal and 
mart. So you dont need to write - forms of the words in synonyms_index.txt


If all of your examples were similar to HolidayInn, you could use solr.WordDelimiterFilterFactory 
(without writing all these company named to a file) but you can't handle "wal mart" and 
"walmart" with it.

Hope this helps.



Reply via email to