Ahmet,
Thanks a lot. Your suggestion was really helpful. I tried using synonyms
before but for some reason it didn't work but this time around it worked.
On 09/11/2009 02:55 AM, AHMET ARSLAN wrote:
There are a lot of company names that
people are uncertain as to the correct spelling. A few of
examples are:
1. best buy, bestbuy
2. walmart, wal mart, wal-mart
3. Holiday Inn, HolidayInn
What Tokenizer Factory and/or TokenFilterFactory should I
use so that somebody typing "wal mart"(quotes not included)
will find "wal mart" and "walmart"(again, quotes not
included)
I faced a similar requirement before. I solved it by hardcoding those names to
synonyms_index.txt and using SynonymFilterFactory at index time.
synonyms_index.txt will contain:
best buy, bestbuy
walmart, wal mart
Holiday Inn, HolidayInn
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true"
expand="true" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
Since solr wiki[1] advices to use index time synonym when dealing with
multi-word synonyms, I am using index time synonym expansion only.
[1]
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46
When working with StandardAnalyzer, wal-mart is broken into two tokens: wal and
mart. So you dont need to write - forms of the words in synonyms_index.txt
If all of your examples were similar to HolidayInn, you could use solr.WordDelimiterFilterFactory
(without writing all these company named to a file) but you can't handle "wal mart" and
"walmart" with it.
Hope this helps.