RE: Unstemmed searching

Markus Jelsma Fri, 27 Feb 2015 13:17:06 -0800

Hello Robert. Unstemmed terms have slightly higher IDF so they gain more 
weight, but stemmed tokens usually have slightly higher TF, so differences are 
marginal at best, especially when using standard TFIDFSimilarity. However, by 
setting a payload for stemmed terms, you can recognize them at search time and 
give them a lower score. You need a custom similarity when dealing with 
payloads so it is possible to tune the weight without reindexing.


MArkus

 
 
-----Original message-----
> From:Robert Haschart <rh...@virginia.edu>
> Sent: Friday 27th February 2015 22:01
> To: solr-user@lucene.apache.org
> Subject: Unstemmed searching
> 
> Several months ago Tom-Burton West asked:
> 
>     The Solr wiki says   "A repeated question is "how can I have the
>     original term contribute
>     more to the score than the stemmed version"? In Solr 4.3, the
>     KeywordRepeatFilterFactory has been added to assist this
>     functionality. "
> 
>     https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
> 
>     (Full section reproduced below.)
>     I can see how in the example from the wiki reproduced below that both
>     the stemmed and original term get indexed, but I don't see how the
>     original term gets more weight than the stemmed term.  Wouldn't this
>     require a filter that gives terms with the keyword attribute more
>     weight?
> 
>     What am I missing?
> 
>     Tom
> 
> 
> I've read the follow-ups to that message, and have used the 
> KeywordRepeatFilterFactory in the analyzer chain for both index and 
> query as follows:
> 
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.ICUFoldingFilterFactory"  />
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
> catenateAll="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordRepeatFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> 
> And although this may be giving some amount of boost to the unstemmed 
> form, our users are still asking for the ability to specify that 
> stemming is turned off altogether.
> I know that this can be done by copying every field to an unstemmed 
> version of that field, but it seems that with the KeywordRepeatFilter 
> already in play, that there should be _something_ that can be done to 
> disable stemming dynamically at query time without needing to copy all 
> the fields and re-index everything.
> 
> So that is "X"  and possible "Y"'s that might accomplish this that I've 
> thought of are:
> 
> 1) Allow "Dummy" Snowball filter at query time
> 
>   * Create org.tartarus.snowball.ext.DummyStemmer which does no stemming
>     at all.
>   * Add a checkbox to the interface to allow the user to select
>     "unstemmed" searching
>   * Devise a way for a parameter specified with the query to be passed
>     through to the <filter class="solr.SnowballPorterFilterFactory" />
>     as the language to use
>   * Use either "English" or "Dummy" to perform either stemmed searching
>     or unstemmed searching.
> 
> 2) Consult the keyword attribute perhaps in a function query
> 
> Any thoughts on either of these ideas, of different approaches to solve 
> the problem.
> 
> thanks in advance
> 
> Robert Haschart
> 
>

RE: Unstemmed searching

Reply via email to