[ https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17357258#comment-17357258 ]
Alessandro Benedetti commented on LUCENE-8216: ---------------------------------------------- hi [~jim.ferenczi] I am investigating BM25F in Lucene and Solr and I ended up here: org/apache/lucene/sandbox/search/CombinedFieldQuery.java:289 When calculating the IDF in BM25F we do that across fields, so as far as I explored the matter in my investigation yet the Document Frequency for a term T should be: Number of documents in the corpus that contains the term T (in any field). So effectively it would be the cardinality of the set that is the Union of all the posting lists for such term across the various fields. >From a quick look at your code, the document frequency is just calcolated as >the max document frequency, across all the field involved (and that is >actually the lower bound of the real blended document frequency). Was it done with this approximation for simplicity, or there's any other reason? > Better cross-field scoring > -------------------------- > > Key: LUCENE-8216 > URL: https://issues.apache.org/jira/browse/LUCENE-8216 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Jim Ferenczi > Priority: Major > Fix For: 8.0 > > Attachments: LUCENE-8216.patch, LUCENE-8216.patch > > > I'd like Lucene to have better support for scoring across multiple fields. > Today we have BlendedTermQuery which tries to help there but it probably > tries to do too much on some aspects (handling cross-field term queries AND > synonyms) and too little on other ones (it tries to merge index-level > statistics, but not per-document statistics like tf and norm). > Maybe we could implement something like BM25F so that queries across multiple > fields would retain the benefits of BM25 like the fact that the impact of the > term frequency saturates quickly, which is not the case with BlendedTermQuery > if you have occurrences across many fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org