[jira] [Comment Edited] (LUCENE-10396) Automatically create sparse indexes for sort fields

Adrien Grand (Jira) Mon, 31 Jan 2022 23:32:18 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484830#comment-17484830
 ]


Adrien Grand edited comment on LUCENE-10396 at 2/1/22, 7:31 AM:
----------------------------------------------------------------

One interesting question about this will be the read API. What I have in mind 
for now looks something like this:

{code:java}
class SparseIndexProducer {

  /**
   * Get an iterator over ranges of doc IDs that may contain values that fall 
in the given range of values.
   */
  DocIdRangeIterator getNumeric(FieldInfo field, long rangeMin, long rangeMax);

  DocIdRangeIterator getSorted(FieldInfo field, long rangeMinOrd, long 
rangeMaxOrd);

  DocIdRangeIterator getBinary(FieldInfo field, BytesRef rangeMin, BytesRef 
rangeMax);
}

class DocIdRange {
  int first, last;
}

class DocIdRangeIterator {

  /**
   * Return the next range of doc IDs so that range.min is greater than or 
equal to the target.
   */
  DocIdRange advance(int target);

}
{code}


was (Author: jpountz):
One interesting question about this will be the read API. What I have in mind 
for now looks something like this:

{code:java}
class SparseIndexProducer {
  DocIdRangeIterator getIterator()
}

class DocIdRange {
  int first, last;
}

class DocIdRangeIterator {

  /**
   * Return a range of doc IDs that may contain {@code value} as the value of 
the sortPos-th sort field.
   * All documents between target included and {@code range.first} excluded are 
guaranteed to not
   * contain {@code value}. {@code range.first} equal to NO_MORE_DOCS is used 
to signal that no more
   * documents may contain {@code value}.
   */
  DocIdRange nextRangeNumeric(int target, int sortPos, long value);

  /**
   * Likewise for SORTED doc values.
   */
  DocIdRange nextRangeSorted(int target, int sortPos, long ord);

  /**
   * Likewise for BINARY doc values.
   */
  DocIdRange nextRangeBinary(int target, int sortPos, BytesRef binaryValue);

}
{code}

> Automatically create sparse indexes for sort fields
> ---------------------------------------------------
>
>                 Key: LUCENE-10396
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10396
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10396) Automatically create sparse indexes for sort fields

Reply via email to