Thank you for taking the time to help. The way I've got the word delimiter index filter set up with only one pass, "wolf-biederman" will result in wolf, biederman, wolfbiederman, and wolf-biederman. With two passes, the last one is not present. One pass changes "gremlin's" to gremlin and gremlin's. Two passes results in gremlin and gremlins.

I was trying to use the PatternReplaceCharFilterFactory to strip leading and trailing punctuation, but it didn't work. It seems that charFilters are applied even before the tokenizer, which will not produce the results I want, and the filter I'd come up with was eating everything, producing no results. I later realized that it would not work with radically different character sets like Arabic and Cyrillic, even if I solved those problems. Is there a regular filter that could strip leading/trailing punctuation?

As for stemming, we have no effective way to separate the languages. Most of the content is English, but we also have Spanish, Arabic, Russian, German, French, and possibly a few others. For that reason, I'm not using stemming. I've been thinking that I might want to use an English stemmer anyway to improve results on most of the content, but I haven't done any testing yet.

Thanks,
Shawn


On 8/29/2010 12:28 PM, Erick Erickson wrote:
Look at the tokenizer/filter chain that makes up your analyzers, and see:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

for other tokenizer/analyzer/filter options.

You're on the right track looking at the various choices provided, and
I suspect you'll find what you need...

Be a little cautious about preserving things. Your users will often be more
confused than helped if you require hyphens for a match. Ditto with
possessives, plurals, etc. You might want to look at stemmers....

Reply via email to