A good starting place might be the list of stemming errors for the original Porter stemmer in this article that describes k-stem:
Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 191-202). Pittsburgh, Pennsylvania, United States: ACM. doi:10.1145/160688.160718 I don't know if the current porter stemmer is different. I do see that on the snowball page there is a porter and a porter2 stemmer and this explanation is linked from the porter2 stemmer page: http://snowball.tartarus.org/algorithms/english/stemmer.html Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -----Original Message----- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Friday, July 30, 2010 4:42 PM To: solr-user@lucene.apache.org Subject: Good list of English words that get "butchered" by Porter Stemmer Hello, I'm looking for a list of English words that, when stemmed by Porter stemmer, end up in the same stem as some similar, but unrelated words. Below are some examples: # this gets stemmed to "iron", so if you search for "ironic", you'll get "iron" matches ironic # same stem as animal anime animated animation animations I imagine such a list could be added to the example protwords.txt Thanks, Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/