Unstemmed searching

Robert Haschart Fri, 27 Feb 2015 13:02:55 -0800

Several months ago Tom-Burton West asked:

   The Solr wiki says   "A repeated question is "how can I have the
   original term contribute
   more to the score than the stemmed version"? In Solr 4.3, the
   KeywordRepeatFilterFactory has been added to assist this
   functionality. "


   https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming

   (Full section reproduced below.)
   I can see how in the example from the wiki reproduced below that both
   the stemmed and original term get indexed, but I don't see how the
   original term gets more weight than the stemmed term.  Wouldn't this
   require a filter that gives terms with the keyword attribute more
   weight?

   What am I missing?

   Tom

I've read the follow-ups to that message, and have used theKeywordRepeatFilterFactory in the analyzer chain for both index andquery as follows:


<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"  />

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" enablePositionIncrements="true" /><filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="1" catenateNumbers="1"catenateAll="0"/>

<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

And although this may be giving some amount of boost to the unstemmedform, our users are still asking for the ability to specify thatstemming is turned off altogether.I know that this can be done by copying every field to an unstemmedversion of that field, but it seems that with the KeywordRepeatFilteralready in play, that there should be _something_ that can be done todisable stemming dynamically at query time without needing to copy allthe fields and re-index everything.

So that is "X" and possible "Y"'s that might accomplish this that I'vethought of are:


1) Allow "Dummy" Snowball filter at query time

 * Create org.tartarus.snowball.ext.DummyStemmer which does no stemming
   at all.
 * Add a checkbox to the interface to allow the user to select
   "unstemmed" searching
 * Devise a way for a parameter specified with the query to be passed
   through to the <filter class="solr.SnowballPorterFilterFactory" />
   as the language to use
 * Use either "English" or "Dummy" to perform either stemmed searching
   or unstemmed searching.

2) Consult the keyword attribute perhaps in a function query

Any thoughts on either of these ideas, of different approaches to solvethe problem.


thanks in advance

Robert Haschart

Unstemmed searching

Reply via email to