Re: Stemmer Question

Jamie Johnson Thu, 08 Mar 2012 07:41:15 -0800

Thanks the KeywordMarkerFilterFactory seems to be what I was looking
for.  I'm still wondering about keeping the unstemmed word as a token
though.  While I know that this would increase the index size slightly
I wonder what the negative of doing such a thing would be?  Just seems
less destructive since I always store the unstemmed version and the
stemmed version.  By not storing the unstemmed version there is no way
to go back without reindexing. If I wanted to implement this I'm
assuming a custom tokenizer would be most appropriate?  Does something
like this already exist?


On Thu, Mar 8, 2012 at 8:36 AM, Ahmet Arslan <iori...@yahoo.com> wrote:
>> I was previously using the
>> PorterStemmer to do stemming and ran into
>> an issue where it was overly aggressive with some words or
>> abbreviations which I needed to stop.  I have recently
>> switched to
>> KStem and I believe the issue is less, but I was wondering
>> still if
>> there was a way to set a number of stop words for which you
>> didn't
>> want stemming to occur or if there was a way to tell the
>> Stemmer to
>> store the unstemmed version as well.  So for instance
>> if a query came
>> in for "Ahmed", the PorterStemmer would turn that into Ahm,
>> while in
>> this case Ahmed is a name and I want to search that
>> unstemmed.  If
>> there was a stop word list I could attempt to compile a list
>> of words
>> I didn't want stem or if there was a way to say also say
>> create a
>> token for the unstemmed word so what went into the index for
>> Ahmed
>> would be "ahmed" "ahm" so we'd cover both cases.  What
>> are the draw
>> backs of providing both.
>
> StemmerOverrideFilterFactory and KeywordMarkerFilterFactory are used for 
> these kind of purposes.
> http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming
>
>
>
>

Re: Stemmer Question

Reply via email to