Hi Erick,

I work for a very big library and we store huge amounts of data. Indexing
some of our collections can take days and the index files can get very big.
We are a non-profit organisation, so we want to provide maximum service to
our customers but at the same time we are bound to a fixed budget and want
to keep costs as low as possible (including disk space). Our customers vary
from academic people that want to do very precise searches to common users
who want to seach in a more general way. The library now wants to implement
some form of stemming, but we have had one demo in the past with a stemmer
that returned results that did not please my internal customer (another
department).

So my wish list looks like this:

1) Implement stemming
2) Give the end user the possibility to turn stemming on or off for their
searches
3) Have maximum control over the stemmer without the need to reindex if we
change something there
4) Prevent the need for more storage (to keep the operations people happy)

So far I have been able to satisfy 1,2 and 3. I am using a synonyms list at
query time to apply my stemming. The synonym list I build as follows:

a) load a library (a text file with 1 word per line)
b) remove stop words from the list
c) link words that have the same stem

Bullet c) is a little bit more sophisticated, because I do not link words
that are already part of a pre-defined synonym list that contains
exceptions.

All this I do to keep maximum control over the behaviour of the stemmer.
Since this is a demo and it will be used to convince other people in my
organisation that stemming could be worth implementing, I need to be able to
adjust its behaviour quickly.

So far processing speed has not been an issue, but disk storage has.
Generally, at index time we remove as few tokens as possible and our objects
are complete books, news papers (from 1618 until 1995), etc . So you can
imagine that our indexes get very, very big.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3420946.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to