Hello Steve,

This is an example of a query-time analyzer that has the problem:

      <charFilter class="solr.MappingCharFilterFactory" 
mapping="lang/mapping_nl.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.KeywordRepeatFilterFactory"/>
      <filter class="solr.StemmerOverrideFilterFactory" 
dictionary="lang/stemmer_nl.txt" ignoreCase="false"/>
      <filter class="solr.SnowballPortalFilterFactory" 
protected="lang/protwords_nl.txt" language="Kp"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" 
synonyms="lang/synonyms_nl.txt" ignoreCase="false" expand="true"/>

Synonym file contains stemmed terms:  traject,verbind

A search for plural term 'trajecten' becomes 
+DisjunctionMaxQuery(((title_nl:trajecten Synonym(title_nl:traject 
title_nl:verbind))))

With mm=2 this means that a search for 'trajecten' will only match documents 
that contain that plural form, singlurars are not matched, due to mm.

I know this is a tricky problem, hope to have conveyed it well enough.

Thanks!
Markus
 
-----Original message-----
> From:Steve Rowe <sar...@gmail.com>
> Sent: Thursday 21st December 2017 16:40
> To: solr-user@lucene.apache.org
> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> 
> Markus,
> 
> I’m confused about exactly what operations you’re performing - could you 
> provide your field type?
> 
> In particular, I don’t understand why you can’t just rewrite the synonyms 
> file entry
> 
>   word1 => word2
> 
> to:
> 
>   word1 => word1, word2
> 
> (Clearly I’m missing something about how stemming is involved.)
> 
> --
> Steve
> www.lucidworks.com
> 
> > On Dec 21, 2017, at 9:28 AM, Markus Jelsma <markus.jel...@openindex.io> 
> > wrote:
> > 
> > Hello Steve,
> > 
> > Well, that is an interesting approach to the topic indeed. But i do not 
> > think it is possible to obtain a list of all inflected forms for all words 
> > that also have roots in some synonym file, the stemmers are not reversible. 
> > 
> > Any other ideas?
> > 
> > Thanks,
> > Markus
> > 
> > -----Original message-----
> >> From:Steve Rowe <sar...@gmail.com>
> >> Sent: Thursday 21st December 2017 0:10
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> >> 
> >> Hi Markus,
> >> 
> >> My suggestion: rewrite your synonyms to include the triggering word in the 
> >> expanded synonyms list.  That way you won’t need 
> >> KeywordRepeat/RemoveDuplicates filters, and mm=100% will work as you 
> >> expect.
> >> 
> >> I don’t think this situation is a bug, since mm applies to the built 
> >> query, not to the original query terms.
> >> 
> >> --
> >> Steve
> >> www.lucidworks.com
> >> 
> >>> On Dec 20, 2017, at 5:02 PM, Markus Jelsma <markus.jel...@openindex.io> 
> >>> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> Yes of course, index time synonyms lessens the query time complexity and 
> >>> will solve the mm problem. It also screws IDF and the flexibility of 
> >>> adding synonyms on demand. The first we do not want, the second is 
> >>> impossible for us (very large main search index).
> >>> 
> >>> We are looking for a solution with mm that takes KeywordRepeat, stemming 
> >>> and synonym expansion into consideration. To me the current working of mm 
> >>> in this case is a bug, i input one term so treat it as one term in mm, 
> >>> regardless of expanded query terms.
> >>> 
> >>> Any query time ideas to share? I am not well versed with the actual code 
> >>> dealing with this specific subject, the code doesn't like me. I am fine 
> >>> if someone points me to the code that tells mm about the number of 
> >>> original input terms, and what to do. If someone does, please also 
> >>> explain why the change i want to make is a bad one, what to be aware of 
> >>> or what to beware of, or what to take into account.
> >>> 
> >>> Also, am i the only one who regards this behaviour as a bug, or more 
> >>> subtle, a weird unexpected behaviour?
> >>> 
> >>> Many many thanks!
> >>> Markus
> >>> 
> >>> -----Original message-----
> >>>> From:Shawn Heisey <apa...@elyograg.org>
> >>>> Sent: Wednesday 20th December 2017 22:39
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> >>>> 
> >>>> On 12/19/2017 4:38 AM, Markus Jelsma wrote:
> >>>>> I have an interesting issue with mm and SynonymQuery and 
> >>>>> KeywordRepeatFilter. We do query time synonym expansion and use 
> >>>>> KeywordRepeat for not only finding stemmed tokens. Our synonyms are 
> >>>>> already preprocessed and contain only stemmed tokens. Synonym file 
> >>>>> contains: traject,verbind
> >>>>> 
> >>>>> So, any non-root stem that ends up in a synonym is actually a search 
> >>>>> for three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> >>>>> Synonym(title_nl:traject title_nl:verbind))))
> >>>>> 
> >>>>> But, our default mm requires that two terms must match if the input 
> >>>>> query consists of two terms: 2<-1 5<-2 6<90%
> >>>>> 
> >>>>> So, a simple query looking for a plural (trajecten) will not match a 
> >>>>> document where the title contains only its singular form: q=trajecten 
> >>>>> will not match document with title_nl:"een traject"
> >>>> 
> >>>> I would think that doing synonym expansion at index time would remove
> >>>> any possible confusion about the number of terms at query time.  Queries
> >>>> that involve synonyms will be slightly less complex, but the index would
> >>>> be larger, so it's difficult to say whether those kinds of queries would
> >>>> be any faster or not.
> >>>> 
> >>>> There is one clear disadvantage to index-time synonym expansion: If you
> >>>> change your synonyms, you have to reindex.
> >>>> 
> >>>> Thanks,
> >>>> Shawn
> >>>> 
> >>>> 
> >> 
> >> 
> 
> 

Reply via email to