Thank you for taking the time to help. The way I've got the word
delimiter index filter set up with only one pass, "wolf-biederman" will
result in wolf, biederman, wolfbiederman, and wolf-biederman. With two
passes, the last one is not present. One pass changes "gremlin's" to
gremlin and gremlin's. Two passes results in gremlin and gremlins.
I was trying to use the PatternReplaceCharFilterFactory to strip leading
and trailing punctuation, but it didn't work. It seems that charFilters
are applied even before the tokenizer, which will not produce the
results I want, and the filter I'd come up with was eating everything,
producing no results. I later realized that it would not work with
radically different character sets like Arabic and Cyrillic, even if I
solved those problems. Is there a regular filter that could strip
leading/trailing punctuation?
As for stemming, we have no effective way to separate the languages.
Most of the content is English, but we also have Spanish, Arabic,
Russian, German, French, and possibly a few others. For that reason,
I'm not using stemming. I've been thinking that I might want to use an
English stemmer anyway to improve results on most of the content, but I
haven't done any testing yet.
Thanks,
Shawn
On 8/29/2010 12:28 PM, Erick Erickson wrote:
Look at the tokenizer/filter chain that makes up your analyzers, and see:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
for other tokenizer/analyzer/filter options.
You're on the right track looking at the various choices provided, and
I suspect you'll find what you need...
Be a little cautious about preserving things. Your users will often be more
confused than helped if you require hyphens for a match. Ditto with
possessives, plurals, etc. You might want to look at stemmers....