[ 
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333040#comment-17333040
 ] 

Adrien Grand commented on LUCENE-8069:
--------------------------------------

Since I was playing with the MSMarco passages dataset for other reasons I 
wanted to give this change a try again with the first 1000 queries from the 
`eval` file. Unlike the wikipedia tasks file, queries in this dataset have many 
terms, often 5+, sometimes even 10+. All of them are disjunctions.

Lucene defaults:
 - avg: 11ms
 - median: 6ms
 - p90: 28ms
 - p99: 80ms

Index sorted by increasing field length:
 - avg: 7ms
 - median: 2ms
 - p90: 6ms
 - p99: 17ms

This seems to confirm that this approach could be very valuable.

> Allow index sorting by field length
> -----------------------------------
>
>                 Key: LUCENE-8069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8069
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by 
> field length would mean we would be likely to collect best matches first. 
> Depending on the similarity implementation, this might even allow to early 
> terminate collection of top documents on term queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to