[
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558103#comment-17558103
]
Ignacio Vera commented on LUCENE-10396:
---------------------------------------
I have been thinking on the ability if visiting a single documents per field
value as explained for Adrien above and I think we can implement it in a not
intrusive way for SortedDocValues that have low cardinality. The idea is to add
the following method to the SortedDocValues api with a default implementation:
{code}
/**
* Advances the iterator the the next document whose ordinal is distinct to
the current ordinal
* and returns the document number itself. Exhausts the iterator and returns
{@link
* #NO_MORE_DOCS} if there is no more ordinals distinct to the current one.
*
* <p>The behaviour of this method is <b>undefined</b> when called with
<code> target ≤ current
* </code>, or after the iterator has exhausted. Both cases may result in
unpredicted behaviour.
*
* The default implementation just iterates over the documents and manually
checks if the ordinal has changed but
* but some implementations are considerably more efficient than that.
*
*/
public int advanceOrd() throws IOException {
int doc = docID();
if (doc == DocIdSetIterator.NO_MORE_DOCS) {
return doc;
}
final long ord = ordValue();
do {
doc = nextDoc();
} while (doc != DocIdSetIterator.NO_MORE_DOCS && ordValue() == ord);
assert doc == docID();
return doc;
}
{code}
When consuming the doc values, if the field is the primary sort of he index and
the cardinality is low (average of documents per field >64?), then we will use
a DirectMonotonicWriter to write the offset for each ord which should not use
too much disk space. When producing the doc value, we will override this method
with a faster DirectMonotonicReader implementation.
> Automatically create sparse indexes for sort fields
> ---------------------------------------------------
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to
> be able to early terminate sorted queries, but as a way to cluster doc IDs
> that share similar properties so that queries can take advantage of it. For
> instance imagine you're maintaining a catalog of cars for sale, by sorting by
> car type, then fuel type then price. Then all cars with the same type, fuel
> type and similar prices will be stored in a contiguous range of doc IDs.
> Without index sorting, conjunctions across these 3 fields would be almost a
> worst-case scenario as every clause might match lots of documents while their
> intersection might not. With index sorting enabled however, there's only a
> very small number of calls to advance() that would lead to doc IDs that do
> not match, because these advance() calls that do not lead to a match would
> always jump over a large number of doc IDs. I had created that example for
> ApacheCon last year that demonstrates the benefits of index sorting on
> conjunctions. In both cases, the index is storing the same data, it just gets
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there
> is a lot more we can do by taking advantage of the index sort explicitly. For
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range
> queries on fields that are primary sort fields by performing a binary search
> to identify the first and last documents that match the range query. I would
> like to introduce [sparse
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields
> that are used for index sorting, with the goal of improving the runtime of
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive
> and making it easier and more efficient to leverage index sorting to filter
> on subsequent sort fields. A simple form of a sparse index could consist of
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we
> wouldn't need to make it configurable and could enable it automatically as
> soon as index sorting is enabled. And it would get its own file format
> abstraction.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]