iverase opened a new pull request, #13449: URL: https://github.com/apache/lucene/pull/13449
Speaking to Adrien about how a sparse index would look like in lucene, he suggested that the sparse indexing does not need to be a new format bit an additional responsibility if `DocValuesFormat`. The idea is to add an option to add a [skip list](https://en.wikipedia.org/wiki/Skip_list) on top of doc values and to expose it via the `DocValuesSkipper` abstraction, which has an API that is similar to what we're doing for impacts today. This provides a very simple index which can be very efficient when the index is sorted and the field belongs to the index sorting. In order to implement it, we added a new flag in `FieldType.java` that configures whether to create a "skip index" for doc values. This flag is only allowed to be set on doc values of type NUMERIC, SORTED_NUMERIC, SORTED and SORTED_SET. Attempting to index other type of doc values with the flag set results on an exception. This flag needs to be persisted on the `FieldInfosFormat`. This does not require a format change as we have some unused bit flags in `Lucene94FieldInfosFormat` that we can use. We have changed the `DocValuesFormat` to generate the "skip index" whenever the flag is set. For this first implementation we went to the most basic implementation which consist in a skip list with just one level. In this level we collect the documents statistics every 4096 documents and we write them into the index. This basic structure already provides interesting numbers. I discussed with Adrien that as a follow up we should introduce more levels to the skip list and optimise the index for low cardinality fields. In order to index a field with a skip list, we added static methods to the doc values field, for example NumericDocValuesField#indexedField(String name, long value) which will generated the right FieldType. In order to query it, you can use the existing NumericDocValuesField#newSlowRangeQuery(String field, long lowerValue, long upperValue). The generated query will use the skip index if exists by using the `DocValuesRangeIterator`. Finally, here are some number I got using the geonames data set from lucene util. The first test index the field called `modified` and adds the field as the primary sort of the index. ``` Index LongField query LongField#newRangeQuery INDEX TIME: 42.604 sec INDEX DOCS: 12347608 documents INDEX SIZE: 12.402745246887207 MB QUERY TIME: 1157.7 ms QUERY DOCS: 6243379080 documents Index LongField query IndexSortSortedNumericDocValuesRangeQuery INDEX TIME: 42.562 sec INDEX DOCS: 12347608 documents INDEX SIZE: 12.402745246887207 MB QUERY TIME: 662.6 ms QUERY DOCS: 6243379080 documents Index Doc values skipping query SortedNumericDocValuesField#newSlowRangeQuery INDEX TIME: 38.927 sec INDEX DOCS: 12347608 documents INDEX SIZE: 11.800291061401367 MB QUERY TIME: 1072.5 ms QUERY DOCS: 6243379080 documents ``` This basic implementation is already faster that querying using the bkd tree. The IndexSortSortedNumericDocValuesRangeQuery is faster as it contains many optimisations but my expectation is that we can make this index as fast if not faster than this implementation. The second test, we are indexing two fields and sorting the index using them; the countryCode as primary sort and the modified field as secondary sort. Then we execute the range queries on the modified field: ``` Index KeywordField, LongField query LongField#newRangeQuery INDEX TIME: 50.486 sec INDEX DOCS: 12347608 documents INDEX SIZE: 24.378992080688477 MB QUERY TIME: 1273.0 ms QUERY DOCS: 6243379080 documents Index KeywordField, LongField query SortedNumericDocValuesField#newSlowRangeQuery INDEX TIME: 50.486 sec INDEX DOCS: 12347608 documents INDEX SIZE: 24.378992080688477 MB QUERY TIME: 13392.6 ms QUERY DOCS: 6243379080 documents Index Doc values skipping for both, query SortedNumericDocValuesField#newSlowRangeQuery INDEX TIME: 44.127 sec INDEX DOCS: 12347608 documents INDEX SIZE: 16.09447193145752 MB QUERY TIME: 2975.0 ms QUERY DOCS: 6243379080 documents ``` In this case the query is slower than the BKD tree but still much faster than the brute approach. The advantage of the new query is that it does not need to build the big bitset that we might need to build with the BKD tree. relates https://github.com/apache/lucene/issues/11432 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org