[
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333040#comment-17333040
]
Adrien Grand commented on LUCENE-8069:
--------------------------------------
Since I was playing with the MSMarco passages dataset for other reasons I
wanted to give this change a try again with the first 1000 queries from the
`eval` file. Unlike the wikipedia tasks file, queries in this dataset have many
terms, often 5+, sometimes even 10+. All of them are disjunctions.
Lucene defaults:
- avg: 11ms
- median: 6ms
- p90: 28ms
- p99: 80ms
Index sorted by increasing field length:
- avg: 7ms
- median: 2ms
- p90: 6ms
- p99: 17ms
This seems to confirm that this approach could be very valuable.
> Allow index sorting by field length
> -----------------------------------
>
> Key: LUCENE-8069
> URL: https://issues.apache.org/jira/browse/LUCENE-8069
> Project: Lucene - Core
> Issue Type: Wish
> Reporter: Adrien Grand
> Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by
> field length would mean we would be likely to collect best matches first.
> Depending on the similarity implementation, this might even allow to early
> terminate collection of top documents on term queries.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]