Re: Multiple passes with WordDelimiterFilterFactory

Shawn Heisey Sun, 29 Aug 2010 12:44:44 -0700

Thank you for taking the time to help. The way I've got the worddelimiter index filter set up with only one pass, "wolf-biederman" willresult in wolf, biederman, wolfbiederman, and wolf-biederman. With twopasses, the last one is not present. One pass changes "gremlin's" togremlin and gremlin's. Two passes results in gremlin and gremlins.

I was trying to use the PatternReplaceCharFilterFactory to strip leadingand trailing punctuation, but it didn't work. It seems that charFiltersare applied even before the tokenizer, which will not produce theresults I want, and the filter I'd come up with was eating everything,producing no results. I later realized that it would not work withradically different character sets like Arabic and Cyrillic, even if Isolved those problems. Is there a regular filter that could stripleading/trailing punctuation?

As for stemming, we have no effective way to separate the languages.Most of the content is English, but we also have Spanish, Arabic,Russian, German, French, and possibly a few others. For that reason,I'm not using stemming. I've been thinking that I might want to use anEnglish stemmer anyway to improve results on most of the content, but Ihaven't done any testing yet.


Thanks,
Shawn


On 8/29/2010 12:28 PM, Erick Erickson wrote:

Look at the tokenizer/filter chain that makes up your analyzers, and see:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

for other tokenizer/analyzer/filter options.

You're on the right track looking at the various choices provided, and
I suspect you'll find what you need...

Be a little cautious about preserving things. Your users will often be more
confused than helped if you require hyphens for a match. Ditto with
possessives, plurals, etc. You might want to look at stemmers....

Re: Multiple passes with WordDelimiterFilterFactory

Reply via email to