A good starting place might be the list of stemming errors for the original 
Porter stemmer in this article that describes k-stem:

Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings 
of the 16th annual international ACM SIGIR conference on Research and 
development in information retrieval (pp. 191-202). Pittsburgh, Pennsylvania, 
United States: ACM. doi:10.1145/160688.160718

I don't know if the current porter stemmer is different.  I do see that on the 
snowball page there is a porter and a porter2 stemmer and this explanation is 
linked from the porter2 stemmer page: 
http://snowball.tartarus.org/algorithms/english/stemmer.html


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Friday, July 30, 2010 4:42 PM
To: solr-user@lucene.apache.org
Subject: Good list of English words that get "butchered" by Porter Stemmer

Hello,

I'm looking for a list of English  words that, when stemmed by Porter stemmer, 
end up in the same stem as  some similar, but unrelated words.  Below are some 
examples:

# this gets stemmed to "iron", so if you search for "ironic", you'll get "iron" 
matches
ironic

# same stem as animal
anime
animated 
animation
animations

I imagine such a list could be added to the example protwords.txt

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Reply via email to