Hmmmm.... A couple of things. 1> Have you looked at alternate stemmers? Porter stemmer is rather aggressive. Perhaps a less-agressive stemmer would suit your internal users. 2> Try a few things, but if you can't solve it reasonably quickly, go back to your internal customer and explain the costs of fixing this. Really. You're jumping through hoops because results "did not please my internal customer". Can they quantify their objections? Or is this just looking at the results for random searches and guessing at relevance? If the latter, you really, really, really need to get them to quantify their objections and I bet you'll find that they can't. And you'll forever be trying to tweak results to please how they feel about it today. Which will be different from how they felt about *the exact same results* yesterday. You can go around this loop forever.
We've (programmers in general) done a rather poor job historically of laying out the *costs* of fixing things to suit a customer and allowing the various stake-holders to make rational decisions. We say "Sure, that can be done" and leave out "but it will take a month when we won't be able to do X, Y, or Z, and requires more hardware". There, rant done.... 3> I suppose you could think about writing your own filter that added the original token and the stemmed token. Something like the SynonymFilter but instead of alternate versions of the word, you'd have the stemmed version and the original version at the same position. Or maybe you have the stemmed version and then the original version with a special ending character (say $) appended. Then you'd have to somehow write a query-time analysis chain (or a query parser?) that somehow knew enough to use the stemmed or original word (plus $) in the query. But I admit I haven't thought this through at all. There'd have to be some parameter you passed through with the query that controlled whether the regular stemming process happened or not... And I don't know offhand how that'd work. Or reverse that. Append $ to all the stemmed variants. But really, before going there (which I admit would be more fun than arguing with your customer), try one of the less aggressive stemmers. Or see if your other stake-holders would be better served by not stemming at all. Or.... Best Erick On Fri, Oct 14, 2011 at 3:22 AM, Victor <scanner...@yahoo.co.uk> wrote: > Hi Erick, > > I work for a very big library and we store huge amounts of data. Indexing > some of our collections can take days and the index files can get very big. > We are a non-profit organisation, so we want to provide maximum service to > our customers but at the same time we are bound to a fixed budget and want > to keep costs as low as possible (including disk space). Our customers vary > from academic people that want to do very precise searches to common users > who want to seach in a more general way. The library now wants to implement > some form of stemming, but we have had one demo in the past with a stemmer > that returned results that did not please my internal customer (another > department). > > So my wish list looks like this: > > 1) Implement stemming > 2) Give the end user the possibility to turn stemming on or off for their > searches > 3) Have maximum control over the stemmer without the need to reindex if we > change something there > 4) Prevent the need for more storage (to keep the operations people happy) > > So far I have been able to satisfy 1,2 and 3. I am using a synonyms list at > query time to apply my stemming. The synonym list I build as follows: > > a) load a library (a text file with 1 word per line) > b) remove stop words from the list > c) link words that have the same stem > > Bullet c) is a little bit more sophisticated, because I do not link words > that are already part of a pre-defined synonym list that contains > exceptions. > > All this I do to keep maximum control over the behaviour of the stemmer. > Since this is a demo and it will be used to convince other people in my > organisation that stemming could be worth implementing, I need to be able to > adjust its behaviour quickly. > > So far processing speed has not been an issue, but disk storage has. > Generally, at index time we remove as few tokens as possible and our objects > are complete books, news papers (from 1618 until 1995), etc . So you can > imagine that our indexes get very, very big. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3420946.html > Sent from the Solr - User mailing list archive at Nabble.com. >