[ 
https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17357258#comment-17357258
 ] 

Alessandro Benedetti commented on LUCENE-8216:
----------------------------------------------

hi [~jim.ferenczi] I am investigating BM25F in Lucene and Solr and I ended up 
here: org/apache/lucene/sandbox/search/CombinedFieldQuery.java:289 
When calculating the IDF in BM25F we do that across fields, so as far as I 
explored the matter in my investigation yet the Document Frequency for a term T 
should be:

Number of documents in the corpus that contains the term T (in any field).

So effectively it would be the cardinality of the set that is the Union of all 
the posting lists for such term across the various fields.

>From a quick look at your code, the document frequency is just calcolated as 
>the max document frequency, across all the field involved (and that is 
>actually the lower bound of the real blended document frequency).
Was it done with this approximation for simplicity, or there's any other reason?

> Better cross-field scoring
> --------------------------
>
>                 Key: LUCENE-8216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8216
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Jim Ferenczi
>            Priority: Major
>             Fix For: 8.0
>
>         Attachments: LUCENE-8216.patch, LUCENE-8216.patch
>
>
> I'd like Lucene to have better support for scoring across multiple fields. 
> Today we have BlendedTermQuery which tries to help there but it probably 
> tries to do too much on some aspects (handling cross-field term queries AND 
> synonyms) and too little on other ones (it tries to merge index-level 
> statistics, but not per-document statistics like tf and norm).
> Maybe we could implement something like BM25F so that queries across multiple 
> fields would retain the benefits of BM25 like the fact that the impact of the 
> term frequency saturates quickly, which is not the case with BlendedTermQuery 
> if you have occurrences across many fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to