Hi Erick, I work for a very big library and we store huge amounts of data. Indexing some of our collections can take days and the index files can get very big. We are a non-profit organisation, so we want to provide maximum service to our customers but at the same time we are bound to a fixed budget and want to keep costs as low as possible (including disk space). Our customers vary from academic people that want to do very precise searches to common users who want to seach in a more general way. The library now wants to implement some form of stemming, but we have had one demo in the past with a stemmer that returned results that did not please my internal customer (another department).
So my wish list looks like this: 1) Implement stemming 2) Give the end user the possibility to turn stemming on or off for their searches 3) Have maximum control over the stemmer without the need to reindex if we change something there 4) Prevent the need for more storage (to keep the operations people happy) So far I have been able to satisfy 1,2 and 3. I am using a synonyms list at query time to apply my stemming. The synonym list I build as follows: a) load a library (a text file with 1 word per line) b) remove stop words from the list c) link words that have the same stem Bullet c) is a little bit more sophisticated, because I do not link words that are already part of a pre-defined synonym list that contains exceptions. All this I do to keep maximum control over the behaviour of the stemmer. Since this is a demo and it will be used to convince other people in my organisation that stemming could be worth implementing, I need to be able to adjust its behaviour quickly. So far processing speed has not been an issue, but disk storage has. Generally, at index time we remove as few tokens as possible and our objects are complete books, news papers (from 1618 until 1995), etc . So you can imagine that our indexes get very, very big. -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-on-the-same-field-type-possible-tp3417898p3420946.html Sent from the Solr - User mailing list archive at Nabble.com.