Hmmmm....
A couple of things.
1> Have you looked at alternate stemmers? Porter stemmer is rather
aggressive. Perhaps a less-agressive stemmer would suit your
internal users.
2> Try a few things, but if you can't solve it reasonably quickly,
go back to your internal customer and explain the costs of
fixing this. Really. You're jumping through hoops because
results "did not please my internal customer". Can they
quantify their objections? Or is this just looking at the
results for random searches and guessing at relevance?
If the latter, you really, really, really need to get them to
quantify their objections and I bet you'll find that they can't.
And you'll forever be trying to tweak results to please
how they feel about it today. Which will be different from
how they felt about *the exact same results* yesterday.
You can go around this loop forever.
We've (programmers in general) done a rather poor job
historically of laying out the *costs* of fixing things to
suit a customer and allowing the various stake-holders
to make rational decisions. We say "Sure, that can be done"
and leave out "but it will take a month when we won't
be able to do X, Y, or Z, and requires more hardware".
There, rant done....
3> I suppose you could think about writing your own filter that
added the original token and the stemmed token.
Something like the SynonymFilter but instead of alternate
versions of the word, you'd have the stemmed version
and the original version at the same position. Or maybe
you have the stemmed version and then the original
version with a special ending character (say $) appended.
Then you'd have to somehow write a query-time
analysis chain (or a query parser?) that somehow
knew enough to use the stemmed or original word (plus $)
in the query. But I admit I haven't thought this through
at all. There'd have to be some parameter you passed
through with the query that controlled whether the
regular stemming process happened or not... And I
don't know offhand how that'd work.
Or reverse that. Append $ to all the stemmed variants.
But really, before going there (which I admit would be more
fun than arguing with your customer), try one of the less
aggressive stemmers. Or see if your other stake-holders
would be better served by not stemming at all. Or....
Best
Erick
On Fri, Oct 14, 2011 at 3:22 AM, Victor <[email protected]> wrote:
> Hi Erick,
>
> I work for a very big library and we store huge amounts of data. Indexing
> some of our collections can take days and the index files can get very big.
> We are a non-profit organisation, so we want to provide maximum service to
> our customers but at the same time we are bound to a fixed budget and want
> to keep costs as low as possible (including disk space). Our customers vary
> from academic people that want to do very precise searches to common users
> who want to seach in a more general way. The library now wants to implement
> some form of stemming, but we have had one demo in the past with a stemmer
> that returned results that did not please my internal customer (another
> department).
>
> So my wish list looks like this:
>
> 1) Implement stemming
> 2) Give the end user the possibility to turn stemming on or off for their
> searches
> 3) Have maximum control over the stemmer without the need to reindex if we
> change something there
> 4) Prevent the need for more storage (to keep the operations people happy)
>
> So far I have been able to satisfy 1,2 and 3. I am using a synonyms list at
> query time to apply my stemming. The synonym list I build as follows:
>
> a) load a library (a text file with 1 word per line)
> b) remove stop words from the list
> c) link words that have the same stem
>
> Bullet c) is a little bit more sophisticated, because I do not link words
> that are already part of a pre-defined synonym list that contains
> exceptions.
>
> All this I do to keep maximum control over the behaviour of the stemmer.
> Since this is a demo and it will be used to convince other people in my
> organisation that stemming could be worth implementing, I need to be able to
> adjust its behaviour quickly.
>
> So far processing speed has not been an issue, but disk storage has.
> Generally, at index time we remove as few tokens as possible and our objects
> are complete books, news papers (from 1618 until 1995), etc . So you can
> imagine that our indexes get very, very big.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3420946.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>