Hi, You might also want to check out the new Lucene-Hunspell stemmer at http://code.google.com/p/lucene-hunspell/ It uses OpenOffice dictionaries with known stems in combination with a large set of language specific rules. It handles your example, but it is an early release, so test it thoroughly before deploying in production :)
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 28. juni 2010, at 17.43, Joe Calderon wrote: > the general consensus among people who run into the problem you have > is to use a plurals only stemmer, a synonyms file or a combination of > both (for irregular nouns etc) > > if you search the archives you can find info on a plurals stemmer > > On Mon, Jun 28, 2010 at 6:49 AM, <dar...@ontrenet.com> wrote: >> Thanks for the tip. Yeah, I think the stemming confounds search results as >> it stands (porter stemmer). >> >> I was also thinking of using my dictionary of 500,000 words with their >> complete morphologies and conjugations and create a synonyms.txt to >> provide english accurate morphology. >> >> Is this a good idea? >> >> Darren >> >>> Hi Darren, >>> >>> You might want to look at the KStemmer >>> (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem) >>> instead of the standard PorterStemmer. It essentially has a 'dictionary' >>> of exception words where stemming stops if found, so in your case >>> president won't be stemmed any further than president (but presidents will >>> be stemmed to president). You will have to integrate it into solr >>> yourself, but that's straightforward. >>> >>> HTH >>> Brendan >>> >>> >>> On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote: >>> >>>> Hi, >>>> It seems to me that because the stemming does not produce >>>> grammatically correct stems in many of the cases, >>>> search anomalies can occur like the one I am seeing where I have a >>>> document with "president" in it and it is returned >>>> when I search for "preside", a different word entirely. >>>> >>>> Is this correct or acceptable behavior? Previous discussions here on >>>> stemming, I was told its ok as long as all the words reduce >>>> to the same stem, but when different words reduce to the same stem it >>>> seems to affect search results in a "bad way". >>>> >>>> Darren >>> >>> >> >>