[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347403#comment-17347403
 ] 

Adrien Grand commented on LUCENE-9335:
--------------------------------------

bq. I also notice that BMM BulkScorer collects roughly 10X the amount of docs 
compared with BMM scorer, which in turn also collects > 10X the amount of docs 
compared with BMW. I feel this may also explain the unexpected slow down? In 
general I would assume these scorers to all collect the same amount of top docs.

Actually this matches my expectation. BMM and BMW differ in that BMM only makes 
a decision about which scorers lead iteration once per block, while BMW needs 
to make decisions on every document. So BMM collects more documents than BMW 
but BMW takes the risk that trying to be too smart makes things slower than a 
simpler approach.

bq.  Are these passages data set and queries used available for download 
somewhere

Yes. You can download the "Collection" and "Queries" files from 
https://microsoft.github.io/msmarco/#ranking (make sure to accept terms at the 
top first so that download links are active).

> Add a bulk scorer for disjunctions that does dynamic pruning
> ------------------------------------------------------------
>
>                 Key: LUCENE-9335
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9335
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>          Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to