[ https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333040#comment-17333040 ]
Adrien Grand commented on LUCENE-8069: -------------------------------------- Since I was playing with the MSMarco passages dataset for other reasons I wanted to give this change a try again with the first 1000 queries from the `eval` file. Unlike the wikipedia tasks file, queries in this dataset have many terms, often 5+, sometimes even 10+. All of them are disjunctions. Lucene defaults: - avg: 11ms - median: 6ms - p90: 28ms - p99: 80ms Index sorted by increasing field length: - avg: 7ms - median: 2ms - p90: 6ms - p99: 17ms This seems to confirm that this approach could be very valuable. > Allow index sorting by field length > ----------------------------------- > > Key: LUCENE-8069 > URL: https://issues.apache.org/jira/browse/LUCENE-8069 > Project: Lucene - Core > Issue Type: Wish > Reporter: Adrien Grand > Priority: Minor > > Short documents are more likely to get higher scores, so sorting an index by > field length would mean we would be likely to collect best matches first. > Depending on the similarity implementation, this might even allow to early > terminate collection of top documents on term queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org