RE: stemming filter analyzers, any favorites?

Robert Petersen Wed, 20 Apr 2011 14:02:32 -0700

I have been doing that, and for Bags example the trailing 's' is not being 
removed by the Kstemmer so if indexing the word bags and searching on bag you 
get no matches.  Why wouldn't the trailing 's' get stemmed off?  Kstemmer is 
dictionary based so bags isn't in the dictionary?   That trailing 's' should 
always be dropped no?  That seems like it would be better, we don't want to 
make synonyms for basic use cases like this.  I fear I will have to return to 
the Porter stemmer.  Are there other better ones is my main question.


Off topic secondary question: sometimes I am puzzled by the output of the 
analysis page.  It seems like there should be a match, but I don't get the 
results during a search that I'd expect...  

Like in the case if the WordDelimiterFilterFactory splits up a term into a 
bunch of terms before the K-stemmer is applied, sometimes if the matching term 
is in position two of the final analysis but the searcher had the partial term 
just alone and so thereby in position 1 in the analysis stack then when 
searching there wasn't a match.  Am I reading this correctly?  Is that right or 
should that match and I am misreading my analysis output?  

Thanks!

Robi

PS  I have a category named Bags and am catching flack for it not coming up in 
a search for bag.  hah
PPS the term is not in protwords.txt


com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1
term text       bags
term type       word
source start,end        0,4
payload         


-----Original Message-----
From: Erick Erickson [mailto:[email protected]] 
Sent: Wednesday, April 20, 2011 10:55 AM
To: [email protected]
Subject: Re: stemming filter analyzers, any favorites?

You can get a better sense of exactly what tranformations occur when
if you look at the analysis page (be sure to check the "verbose"
checkbox).

I'm surprised that "bags" doesn't match "bag", what does the analysis
page say?

Best
Erick

On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen <[email protected]> wrote:
> Stemming filter analyzers... anyone have any favorites for particular
> search domains?  Just wondering what people are using.  I'm using Lucid
> K Stemmer and having issues.   Seems like it misses a lot of common
> stems.  We went to that because of excessively loose matches on the
> solr.PorterStemFilterFactory
>
>
> I understand K Stemmer is a dictionary based stemmer.  Seems to me like
> it is missing a lot of common stem reductions.  Ie   Bags does not match
> Bag in our searches.
>
> Here is my analyzer stack:
>
>                <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>                        <analyzer type="index">
>                                <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>                                <filter
> class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
> ignoreCase="true" expand="true"/>
>                                <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="stopwords.txt"/>
>          <filter class="solr.WordDelimiterFilterFactory"
>                generateWordParts="1"
>                generateNumberParts="1"
>                catenateWords="1"
>                catenateNumbers="1"
>                catenateAll="1"
>                preserveOriginal="1"
>                />                              <filter
> class="solr.LowerCaseFilterFactory"/>
>                                <!-- The LucidKStemmer currently
> requires a lowercase filter somewhere before it. -->
>                                <filter
> class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
> protected="protwords.txt"/>
>                                <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        </analyzer>
>                        <analyzer type="query">
>                                <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>                                <filter
> class="solr.SynonymFilterFactory" synonyms="query_synonyms.txt"
> ignoreCase="true" expand="true"/>
>                                <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="stopwords.txt"/>
>          <filter class="solr.WordDelimiterFilterFactory"
>                generateWordParts="1"
>                generateNumberParts="1"
>                catenateWords="1"
>                catenateNumbers="1"
>                catenateAll="1"
>                preserveOriginal="1"
>                />                              <filter
> class="solr.LowerCaseFilterFactory"/>
>                                <!-- The LucidKStemmer currently
> requires a lowercase filter somewhere before it. -->
>                                <filter
> class="com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory"
> protected="protwords.txt"/>
>                                <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                        </analyzer>
>                </fieldType>
>

RE: stemming filter analyzers, any favorites?

Reply via email to