iverase opened a new pull request, #13449:
URL: https://github.com/apache/lucene/pull/13449
Speaking to Adrien about how a sparse index would look like in lucene, he
suggested that the sparse indexing does not need to be a new format bit an
additional responsibility if `DocValuesFormat`.
The idea is to add an option to add a [skip
list](https://en.wikipedia.org/wiki/Skip_list) on top of doc values and to
expose it via the `DocValuesSkipper` abstraction, which has an API that is
similar to what we're doing for impacts today. This provides a very simple
index which can be very efficient when the index is sorted and the field
belongs to the index sorting.
In order to implement it, we added a new flag in `FieldType.java` that
configures whether to create a "skip index" for doc values. This flag is only
allowed to be set on doc values of type NUMERIC, SORTED_NUMERIC, SORTED and
SORTED_SET. Attempting to index other type of doc values with the flag set
results on an exception.
This flag needs to be persisted on the `FieldInfosFormat`. This does not
require a format change as we have some unused bit flags in
`Lucene94FieldInfosFormat` that we can use.
We have changed the `DocValuesFormat` to generate the "skip index" whenever
the flag is set. For this first implementation we went to the most basic
implementation which consist in a skip list with just one level. In this level
we collect the documents statistics every 4096 documents and we write them into
the index. This basic structure already provides interesting numbers. I
discussed with Adrien that as a follow up we should introduce more levels to
the skip list and optimise the index for low cardinality fields.
In order to index a field with a skip list, we added static methods to the
doc values field, for example NumericDocValuesField#indexedField(String name,
long value) which will generated the right FieldType. In order to query it, you
can use the existing NumericDocValuesField#newSlowRangeQuery(String field,
long lowerValue, long upperValue). The generated query will use the skip index
if exists by using the `DocValuesRangeIterator`.
Finally, here are some number I got using the geonames data set from lucene
util.
The first test index the field called `modified` and adds the field as the
primary sort of the index.
```
Index LongField query LongField#newRangeQuery
INDEX TIME: 42.604 sec
INDEX DOCS: 12347608 documents
INDEX SIZE: 12.402745246887207 MB
QUERY TIME: 1157.7 ms
QUERY DOCS: 6243379080 documents
Index LongField query IndexSortSortedNumericDocValuesRangeQuery
INDEX TIME: 42.562 sec
INDEX DOCS: 12347608 documents
INDEX SIZE: 12.402745246887207 MB
QUERY TIME: 662.6 ms
QUERY DOCS: 6243379080 documents
Index Doc values skipping query SortedNumericDocValuesField#newSlowRangeQuery
INDEX TIME: 38.927 sec
INDEX DOCS: 12347608 documents
INDEX SIZE: 11.800291061401367 MB
QUERY TIME: 1072.5 ms
QUERY DOCS: 6243379080 documents
```
This basic implementation is already faster that querying using the bkd
tree. The IndexSortSortedNumericDocValuesRangeQuery is faster as it contains
many optimisations but my expectation is that we can make this index as fast if
not faster than this implementation.
The second test, we are indexing two fields and sorting the index using
them; the countryCode as primary sort and the modified field as secondary sort.
Then we execute the range queries on the modified field:
```
Index KeywordField, LongField query LongField#newRangeQuery
INDEX TIME: 50.486 sec
INDEX DOCS: 12347608 documents
INDEX SIZE: 24.378992080688477 MB
QUERY TIME: 1273.0 ms
QUERY DOCS: 6243379080 documents
Index KeywordField, LongField query
SortedNumericDocValuesField#newSlowRangeQuery
INDEX TIME: 50.486 sec
INDEX DOCS: 12347608 documents
INDEX SIZE: 24.378992080688477 MB
QUERY TIME: 13392.6 ms
QUERY DOCS: 6243379080 documents
Index Doc values skipping for both, query
SortedNumericDocValuesField#newSlowRangeQuery
INDEX TIME: 44.127 sec
INDEX DOCS: 12347608 documents
INDEX SIZE: 16.09447193145752 MB
QUERY TIME: 2975.0 ms
QUERY DOCS: 6243379080 documents
```
In this case the query is slower than the BKD tree but still much faster
than the brute approach. The advantage of the new query is that it does not
need to build the big bitset that we might need to build with the BKD tree.
relates https://github.com/apache/lucene/issues/11432
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]