Hello Steve, This is an example of a query-time analyzer that has the problem:
<charFilter class="solr.MappingCharFilterFactory" mapping="lang/mapping_nl.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordRepeatFilterFactory"/> <filter class="solr.StemmerOverrideFilterFactory" dictionary="lang/stemmer_nl.txt" ignoreCase="false"/> <filter class="solr.SnowballPortalFilterFactory" protected="lang/protwords_nl.txt" language="Kp"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="lang/synonyms_nl.txt" ignoreCase="false" expand="true"/> Synonym file contains stemmed terms: traject,verbind A search for plural term 'trajecten' becomes +DisjunctionMaxQuery(((title_nl:trajecten Synonym(title_nl:traject title_nl:verbind)))) With mm=2 this means that a search for 'trajecten' will only match documents that contain that plural form, singlurars are not matched, due to mm. I know this is a tricky problem, hope to have conveyed it well enough. Thanks! Markus -----Original message----- > From:Steve Rowe <sar...@gmail.com> > Sent: Thursday 21st December 2017 16:40 > To: solr-user@lucene.apache.org > Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter > > Markus, > > I’m confused about exactly what operations you’re performing - could you > provide your field type? > > In particular, I don’t understand why you can’t just rewrite the synonyms > file entry > > word1 => word2 > > to: > > word1 => word1, word2 > > (Clearly I’m missing something about how stemming is involved.) > > -- > Steve > www.lucidworks.com > > > On Dec 21, 2017, at 9:28 AM, Markus Jelsma <markus.jel...@openindex.io> > > wrote: > > > > Hello Steve, > > > > Well, that is an interesting approach to the topic indeed. But i do not > > think it is possible to obtain a list of all inflected forms for all words > > that also have roots in some synonym file, the stemmers are not reversible. > > > > Any other ideas? > > > > Thanks, > > Markus > > > > -----Original message----- > >> From:Steve Rowe <sar...@gmail.com> > >> Sent: Thursday 21st December 2017 0:10 > >> To: solr-user@lucene.apache.org > >> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter > >> > >> Hi Markus, > >> > >> My suggestion: rewrite your synonyms to include the triggering word in the > >> expanded synonyms list. That way you won’t need > >> KeywordRepeat/RemoveDuplicates filters, and mm=100% will work as you > >> expect. > >> > >> I don’t think this situation is a bug, since mm applies to the built > >> query, not to the original query terms. > >> > >> -- > >> Steve > >> www.lucidworks.com > >> > >>> On Dec 20, 2017, at 5:02 PM, Markus Jelsma <markus.jel...@openindex.io> > >>> wrote: > >>> > >>> Hello, > >>> > >>> Yes of course, index time synonyms lessens the query time complexity and > >>> will solve the mm problem. It also screws IDF and the flexibility of > >>> adding synonyms on demand. The first we do not want, the second is > >>> impossible for us (very large main search index). > >>> > >>> We are looking for a solution with mm that takes KeywordRepeat, stemming > >>> and synonym expansion into consideration. To me the current working of mm > >>> in this case is a bug, i input one term so treat it as one term in mm, > >>> regardless of expanded query terms. > >>> > >>> Any query time ideas to share? I am not well versed with the actual code > >>> dealing with this specific subject, the code doesn't like me. I am fine > >>> if someone points me to the code that tells mm about the number of > >>> original input terms, and what to do. If someone does, please also > >>> explain why the change i want to make is a bad one, what to be aware of > >>> or what to beware of, or what to take into account. > >>> > >>> Also, am i the only one who regards this behaviour as a bug, or more > >>> subtle, a weird unexpected behaviour? > >>> > >>> Many many thanks! > >>> Markus > >>> > >>> -----Original message----- > >>>> From:Shawn Heisey <apa...@elyograg.org> > >>>> Sent: Wednesday 20th December 2017 22:39 > >>>> To: solr-user@lucene.apache.org > >>>> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter > >>>> > >>>> On 12/19/2017 4:38 AM, Markus Jelsma wrote: > >>>>> I have an interesting issue with mm and SynonymQuery and > >>>>> KeywordRepeatFilter. We do query time synonym expansion and use > >>>>> KeywordRepeat for not only finding stemmed tokens. Our synonyms are > >>>>> already preprocessed and contain only stemmed tokens. Synonym file > >>>>> contains: traject,verbind > >>>>> > >>>>> So, any non-root stem that ends up in a synonym is actually a search > >>>>> for three terms: +DisjunctionMaxQuery(((title_nl:trajecten > >>>>> Synonym(title_nl:traject title_nl:verbind)))) > >>>>> > >>>>> But, our default mm requires that two terms must match if the input > >>>>> query consists of two terms: 2<-1 5<-2 6<90% > >>>>> > >>>>> So, a simple query looking for a plural (trajecten) will not match a > >>>>> document where the title contains only its singular form: q=trajecten > >>>>> will not match document with title_nl:"een traject" > >>>> > >>>> I would think that doing synonym expansion at index time would remove > >>>> any possible confusion about the number of terms at query time. Queries > >>>> that involve synonyms will be slightly less complex, but the index would > >>>> be larger, so it's difficult to say whether those kinds of queries would > >>>> be any faster or not. > >>>> > >>>> There is one clear disadvantage to index-time synonym expansion: If you > >>>> change your synonyms, you have to reindex. > >>>> > >>>> Thanks, > >>>> Shawn > >>>> > >>>> > >> > >> > >