[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558103#comment-17558103
 ] 

Ignacio Vera commented on LUCENE-10396:
---------------------------------------

I have been thinking on the ability if visiting a single documents per field 
value as explained for Adrien above and I think we can implement it in a not 
intrusive way for SortedDocValues that have low cardinality. The idea is to add 
the following method to the SortedDocValues api with a default implementation:

{code}
  /**
   * Advances the iterator the the next document whose ordinal is distinct to 
the current ordinal
   * and returns the document number itself. Exhausts the iterator and returns 
{@link
   * #NO_MORE_DOCS} if there is no more ordinals distinct to the current one.
   *
   * <p>The behaviour of this method is <b>undefined</b> when called with 
<code> target &le; current
   * </code>, or after the iterator has exhausted. Both cases may result in 
unpredicted behaviour.
   *  
   * The default implementation just iterates over the documents and manually 
checks if the ordinal has changed but
   *  but some implementations are  considerably more efficient than that.
   *
   */
  public int advanceOrd() throws IOException {
    int doc = docID();
    if (doc == DocIdSetIterator.NO_MORE_DOCS) {
      return doc;
    }
    final long ord = ordValue();
    do  {
      doc = nextDoc();
    } while (doc != DocIdSetIterator.NO_MORE_DOCS && ordValue() == ord);
    assert doc == docID();
    return doc;
  }
{code}

When consuming the doc values, if the field is the primary sort of he index and 
the cardinality is low (average of documents per field >64?), then we will use 
a DirectMonotonicWriter to write the offset for each ord which should not use 
too much disk space. When producing the doc value, we will override this method 
with a faster DirectMonotonicReader implementation.




> Automatically create sparse indexes for sort fields
> ---------------------------------------------------
>
>                 Key: LUCENE-10396
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10396
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to