[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

Christoph Goller (Jira) Thu, 09 Apr 2020 04:36:22 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079225#comment-17079225
 ]


Christoph Goller commented on LUCENE-8943:
------------------------------------------

In the meanwhile I worked on the problem of the MultiWordsSynonymQuery a little 
bit.

I think queries that are allowed as synonym-clauses within the 
MultiWordsSynonymQuery we need two additional properties.
 # Their weights should deliver a pseudo-statistics so that we can compute a 
combined IDF value for all the synonyms.
I implemented a prototype for SpanQueries already. I did this by adding an 
interface PseudoStatistics which may be implemented by Weights.
 # They have to provide frequencies so that we can add up all frequencies just 
as SynonymQuery currently does for terms. We could also have an additional 
Interface for that that could be implemented by some Scorers or we could add a 
new ScoreMode COMPLETE_FREQUENCIES and Scorers could deliver frequencies 
instead of scores for this ScoreMode.

Instead of implementing MultiWordsSynonymQuery as a subclass of BooleanQuery I 
would rather implement it as a mor generalized version of SynonymQuery with 
lots of core copied from there.

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -------------------------------------------------
>
>                 Key: LUCENE-8943
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8943
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring
>    Affects Versions: 8.0
>            Reporter: Christoph Goller
>            Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

Reply via email to