Re: Unstemmed searching

Jan Høydahl Fri, 27 Feb 2015 16:05:10 -0800

Passing query params down into analysis chain has been discussed before but I 
think it is a bit controversial/complex.
How about a more high-level approach to be able to change query analyzer, e.g. 
[f.<field>.]q.analyzer=<analyzer|fieldType>
Then query parsers would use the specified analyzer for a field instead of the 
schema-defined one.


About your Dummy language, it would avoid stemming, but would not avoid false 
matches against stemmed words that accidentially match the query word. Example: 
"books" gets stemmed as "books,book". You search for q=book a 
ticket&lang=dummy, and still get a match on the "books" document.
Or is there a way to affect whether a token matches or not based on its payload?
A common workaround is be to use a customized stemmer which prefixes all 
stemmed terms with a special unicode character, so you can totally avoid them 
if you need to.

We discuss the option of deboosting certain token types (stems, synonyms etc) 
in https://issues.apache.org/jira/browse/LUCENE-3130 but that issue never 
resulted in anything.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 27. feb. 2015 kl. 22.13 skrev Markus Jelsma <markus.jel...@openindex.io>:
> 
> Hello Robert. Unstemmed terms have slightly higher IDF so they gain more 
> weight, but stemmed tokens usually have slightly higher TF, so differences 
> are marginal at best, especially when using standard TFIDFSimilarity. 
> However, by setting a payload for stemmed terms, you can recognize them at 
> search time and give them a lower score. You need a custom similarity when 
> dealing with payloads so it is possible to tune the weight without reindexing.
> 
> MArkus
> 
> 
> 
> -----Original message-----
>> From:Robert Haschart <rh...@virginia.edu>
>> Sent: Friday 27th February 2015 22:01
>> To: solr-user@lucene.apache.org
>> Subject: Unstemmed searching
>> 
>> Several months ago Tom-Burton West asked:
>> 
>>    The Solr wiki says   "A repeated question is "how can I have the
>>    original term contribute
>>    more to the score than the stemmed version"? In Solr 4.3, the
>>    KeywordRepeatFilterFactory has been added to assist this
>>    functionality. "
>> 
>>    https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
>> 
>>    (Full section reproduced below.)
>>    I can see how in the example from the wiki reproduced below that both
>>    the stemmed and original term get indexed, but I don't see how the
>>    original term gets more weight than the stemmed term.  Wouldn't this
>>    require a filter that gives terms with the keyword attribute more
>>    weight?
>> 
>>    What am I missing?
>> 
>>    Tom
>> 
>> 
>> I've read the follow-ups to that message, and have used the 
>> KeywordRepeatFilterFactory in the analyzer chain for both index and 
>> query as follows:
>> 
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.ICUFoldingFilterFactory"  />
>> <filter class="solr.StopFilterFactory" ignoreCase="true" 
>> words="stopwords.txt" enablePositionIncrements="true" />
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
>> catenateAll="0"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.KeywordRepeatFilterFactory"/>
>> <filter class="solr.SnowballPorterFilterFactory"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> 
>> And although this may be giving some amount of boost to the unstemmed 
>> form, our users are still asking for the ability to specify that 
>> stemming is turned off altogether.
>> I know that this can be done by copying every field to an unstemmed 
>> version of that field, but it seems that with the KeywordRepeatFilter 
>> already in play, that there should be _something_ that can be done to 
>> disable stemming dynamically at query time without needing to copy all 
>> the fields and re-index everything.
>> 
>> So that is "X"  and possible "Y"'s that might accomplish this that I've 
>> thought of are:
>> 
>> 1) Allow "Dummy" Snowball filter at query time
>> 
>>  * Create org.tartarus.snowball.ext.DummyStemmer which does no stemming
>>    at all.
>>  * Add a checkbox to the interface to allow the user to select
>>    "unstemmed" searching
>>  * Devise a way for a parameter specified with the query to be passed
>>    through to the <filter class="solr.SnowballPorterFilterFactory" />
>>    as the language to use
>>  * Use either "English" or "Dummy" to perform either stemmed searching
>>    or unstemmed searching.
>> 
>> 2) Consult the keyword attribute perhaps in a function query
>> 
>> Any thoughts on either of these ideas, of different approaches to solve 
>> the problem.
>> 
>> thanks in advance
>> 
>> Robert Haschart
>> 
>>

Re: Unstemmed searching

Reply via email to