RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Markus Jelsma Sun, 18 Nov 2018 14:21:17 -0800

Hello,

Apologies for bothering you all again, but i really need some help in this 
matter. How can we resolve this issue? Are we dealing with a bug here (then 
i'll open a ticket), am i doing something wrong?


Is here anyone who had the same issue or understand the problem?

Many thanks,
Markus 

 
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Tuesday 13th November 2018 9:52
> To: solr-user <solr-user@lucene.apache.org>
> Subject: KeywordRepeat, stemming, (single term) synonyms and minimum should 
> match (edismax)
> 
> Hello, apologies for this long winded e-mail.
> 
> Our fields have KeywordRepeat and language specific filters such as a 
> stemmer, the final filter at query-time is SynonymGraph. We do not use 
> RemoveDuplicatesFilter for those of you wondering why when you see the parsed 
> queries below, this is due to [1]. 
> 
> We use a custom QParser extending edismax and also extend 
> ExtendedSolrQueryParser, so we are able to override newFieldQuery in case we 
> have to. The problem also directly applies to Solr's vanilla edismax. The 
> file synonyms.txt contains the stemmed versions of the original terms.
> 
> Consider this example synonym set [bier,brouw] where bier means beer and 
> brouw is the stemmed version of brouwsel (brewage, concoction), and consider 
> these parameters on /select: qf=content_nl&defType=edismax&mm=2<-1 5<-2 
> 6<90%25.
> 
> The queries q=bier and q=brouw both parse to the following query and give the 
> desired results (notice the missing RemoveDuplicates here):
> +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier 
> content_nl:brouw))~2))
> 
> However, for q=brouwsel something (partially) unexpected happens:
> +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))
> 
> This results in a BooleanQuery where, due to mm=2, both clauses need to 
> match, giving very few matches. Removing KeywordRepeat or setting mm=1 of 
> course fixes the problem, but that is not what we want.
> 
> What is also unexpected, and may be related to the problem, is that when 
> checking the analzer output via the GUI, we see the position incrementing 
> when KeywordRepeat and SynonymGraph are combined. When these filters are not 
> combined, the positions are always 1, as expected. When combined we get this 
> for 'brouw':
> term: bier brouw bier brouw
> pos:  1     1         2      2
> 
> or for 'brouwsel':
> term: brouwsel bier brouw
> pos:  1               2      2
> 
> ExtendedSolrQueryParser, and everything underneath, is a complicated piece of 
> code. In the end it extends Lucene's QueryBuilder, but not always relying on 
> its results, it seems. Edismax for example 'resets' minShouldMatch in 
> SolrPluginUtils.setMinShouldMatch(), so this is a complicated web of code and 
> i am a bit too deep in this unfamiliar area, and i am in need of help here.
> 
> So, my question is, how to solve this problem? Or how to approach it?  What 
> is the actual problem? How can i get the same stable results for both 
> queries? Does the odd positon increment have anything to do with it (it seems 
> Lucene's QueryBuilder does something with it). What do i need to do?
> 
> Many thanks,
> Markus
> 
> ps. this is on Solr 7.2.1 and 7.5.0.
> 
> [1] 
> http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html
>

RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Reply via email to