Thanks the KeywordMarkerFilterFactory seems to be what I was looking for. I'm still wondering about keeping the unstemmed word as a token though. While I know that this would increase the index size slightly I wonder what the negative of doing such a thing would be? Just seems less destructive since I always store the unstemmed version and the stemmed version. By not storing the unstemmed version there is no way to go back without reindexing. If I wanted to implement this I'm assuming a custom tokenizer would be most appropriate? Does something like this already exist?
On Thu, Mar 8, 2012 at 8:36 AM, Ahmet Arslan <iori...@yahoo.com> wrote: >> I was previously using the >> PorterStemmer to do stemming and ran into >> an issue where it was overly aggressive with some words or >> abbreviations which I needed to stop. I have recently >> switched to >> KStem and I believe the issue is less, but I was wondering >> still if >> there was a way to set a number of stop words for which you >> didn't >> want stemming to occur or if there was a way to tell the >> Stemmer to >> store the unstemmed version as well. So for instance >> if a query came >> in for "Ahmed", the PorterStemmer would turn that into Ahm, >> while in >> this case Ahmed is a name and I want to search that >> unstemmed. If >> there was a stop word list I could attempt to compile a list >> of words >> I didn't want stem or if there was a way to say also say >> create a >> token for the unstemmed word so what went into the index for >> Ahmed >> would be "ahmed" "ahm" so we'd cover both cases. What >> are the draw >> backs of providing both. > > StemmerOverrideFilterFactory and KeywordMarkerFilterFactory are used for > these kind of purposes. > http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming > > > >