Several months ago Tom-Burton West asked:
The Solr wiki says "A repeated question is "how can I have the
original term contribute
more to the score than the stemmed version"? In Solr 4.3, the
KeywordRepeatFilterFactory has been added to assist this
functionality. "
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
(Full section reproduced below.)
I can see how in the example from the wiki reproduced below that both
the stemmed and original term get indexed, but I don't see how the
original term gets more weight than the stemmed term. Wouldn't this
require a filter that gives terms with the keyword attribute more
weight?
What am I missing?
Tom
I've read the follow-ups to that message, and have used the
KeywordRepeatFilterFactory in the analyzer chain for both index and
query as follows:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
And although this may be giving some amount of boost to the unstemmed
form, our users are still asking for the ability to specify that
stemming is turned off altogether.
I know that this can be done by copying every field to an unstemmed
version of that field, but it seems that with the KeywordRepeatFilter
already in play, that there should be _something_ that can be done to
disable stemming dynamically at query time without needing to copy all
the fields and re-index everything.
So that is "X" and possible "Y"'s that might accomplish this that I've
thought of are:
1) Allow "Dummy" Snowball filter at query time
* Create org.tartarus.snowball.ext.DummyStemmer which does no stemming
at all.
* Add a checkbox to the interface to allow the user to select
"unstemmed" searching
* Devise a way for a parameter specified with the query to be passed
through to the <filter class="solr.SnowballPorterFilterFactory" />
as the language to use
* Use either "English" or "Dummy" to perform either stemmed searching
or unstemmed searching.
2) Consult the keyword attribute perhaps in a function query
Any thoughts on either of these ideas, of different approaches to solve
the problem.
thanks in advance
Robert Haschart