Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter

Walter Underwood Thu, 21 Dec 2017 08:13:51 -0800

You can find all the inflected forms that are in your index. Search for the 
root form, use highlighting to pull out matches, and collect them. It is a 
bother, but not that hard for a program to do.


In the synonym file, you don’t need to list an inflected form of the synonym, 
because it will be stemmed. So:

traject => verbind
trajecten => verbind

If you want an algorithmic solution, look for a “morphological generator”. That 
is the inverse of a morphological analyzer. In the olden days, query time 
generation was an alternative to stemming (analysis) at index time. But that 
makes the query much larger and much slower.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)


> On Dec 21, 2017, at 6:28 AM, Markus Jelsma <[email protected]> wrote:
> 
> Hello Steve,
> 
> Well, that is an interesting approach to the topic indeed. But i do not think 
> it is possible to obtain a list of all inflected forms for all words that 
> also have roots in some synonym file, the stemmers are not reversible. 
> 
> Any other ideas?
> 
> Thanks,
> Markus
> 
> -----Original message-----
>> From:Steve Rowe <[email protected]>
>> Sent: Thursday 21st December 2017 0:10
>> To: [email protected]
>> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
>> 
>> Hi Markus,
>> 
>> My suggestion: rewrite your synonyms to include the triggering word in the 
>> expanded synonyms list.  That way you won’t need 
>> KeywordRepeat/RemoveDuplicates filters, and mm=100% will work as you expect.
>> 
>> I don’t think this situation is a bug, since mm applies to the built query, 
>> not to the original query terms.
>> 
>> --
>> Steve
>> www.lucidworks.com
>> 
>>> On Dec 20, 2017, at 5:02 PM, Markus Jelsma <[email protected]> 
>>> wrote:
>>> 
>>> Hello,
>>> 
>>> Yes of course, index time synonyms lessens the query time complexity and 
>>> will solve the mm problem. It also screws IDF and the flexibility of adding 
>>> synonyms on demand. The first we do not want, the second is impossible for 
>>> us (very large main search index).
>>> 
>>> We are looking for a solution with mm that takes KeywordRepeat, stemming 
>>> and synonym expansion into consideration. To me the current working of mm 
>>> in this case is a bug, i input one term so treat it as one term in mm, 
>>> regardless of expanded query terms.
>>> 
>>> Any query time ideas to share? I am not well versed with the actual code 
>>> dealing with this specific subject, the code doesn't like me. I am fine if 
>>> someone points me to the code that tells mm about the number of original 
>>> input terms, and what to do. If someone does, please also explain why the 
>>> change i want to make is a bad one, what to be aware of or what to beware 
>>> of, or what to take into account.
>>> 
>>> Also, am i the only one who regards this behaviour as a bug, or more 
>>> subtle, a weird unexpected behaviour?
>>> 
>>> Many many thanks!
>>> Markus
>>> 
>>> -----Original message-----
>>>> From:Shawn Heisey <[email protected]>
>>>> Sent: Wednesday 20th December 2017 22:39
>>>> To: [email protected]
>>>> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
>>>> 
>>>> On 12/19/2017 4:38 AM, Markus Jelsma wrote:
>>>>> I have an interesting issue with mm and SynonymQuery and 
>>>>> KeywordRepeatFilter. We do query time synonym expansion and use 
>>>>> KeywordRepeat for not only finding stemmed tokens. Our synonyms are 
>>>>> already preprocessed and contain only stemmed tokens. Synonym file 
>>>>> contains: traject,verbind
>>>>> 
>>>>> So, any non-root stem that ends up in a synonym is actually a search for 
>>>>> three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
>>>>> Synonym(title_nl:traject title_nl:verbind))))
>>>>> 
>>>>> But, our default mm requires that two terms must match if the input query 
>>>>> consists of two terms: 2<-1 5<-2 6<90%
>>>>> 
>>>>> So, a simple query looking for a plural (trajecten) will not match a 
>>>>> document where the title contains only its singular form: q=trajecten 
>>>>> will not match document with title_nl:"een traject"
>>>> 
>>>> I would think that doing synonym expansion at index time would remove
>>>> any possible confusion about the number of terms at query time.  Queries
>>>> that involve synonyms will be slightly less complex, but the index would
>>>> be larger, so it's difficult to say whether those kinds of queries would
>>>> be any faster or not.
>>>> 
>>>> There is one clear disadvantage to index-time synonym expansion: If you
>>>> change your synonyms, you have to reindex.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> 
>> 
>>

Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter

Reply via email to