[PR] Sparse index: optional skip list on top of doc values [lucene]

via GitHub Tue, 04 Jun 2024 03:21:47 -0700


iverase opened a new pull request, #13449:
URL: https://github.com/apache/lucene/pull/13449


    Speaking to Adrien about how a sparse index would look like in lucene, he 
suggested that the sparse indexing does not need to be a new format bit an 
additional responsibility if `DocValuesFormat`. 
   
   The idea is to add an option to add a [skip 
list](https://en.wikipedia.org/wiki/Skip_list) on top of doc values and to 
expose it via the `DocValuesSkipper` abstraction, which has an API that is 
similar to what we're doing for impacts today. This provides a very simple 
index which can be very efficient when the index is sorted and the field 
belongs to the index sorting.
   
   In order to implement it, we added a new flag in `FieldType.java` that 
configures whether to create a "skip index" for doc values. This flag is only 
allowed to be set on doc values of type NUMERIC, SORTED_NUMERIC, SORTED and 
SORTED_SET. Attempting to index other type of doc values with the flag set 
results on an exception.
   
   This flag needs to be persisted on the `FieldInfosFormat`. This does not 
require a format change as we have some unused bit flags  in 
`Lucene94FieldInfosFormat` that we can use.
   
   We have changed the `DocValuesFormat` to generate the "skip index" whenever 
the flag is set. For this first implementation we went to the most basic 
implementation which consist in a skip list with just one level. In this level 
we collect the documents statistics every 4096 documents and we write them into 
the index. This basic structure already provides interesting numbers. I 
discussed with Adrien that as a follow up we should introduce more levels to 
the skip list and optimise the index for low cardinality fields.
   
   In order to index a field with a skip list, we added static methods to the 
doc values field, for example NumericDocValuesField#indexedField(String name, 
long value) which will generated the right FieldType. In order to query it, you 
can use the existing  NumericDocValuesField#newSlowRangeQuery(String field, 
long lowerValue, long upperValue). The generated query will use the skip index 
if exists by using the `DocValuesRangeIterator`. 
   
   Finally, here are some number I got using the geonames data set from lucene 
util.  
   
   The first test index the field called `modified` and adds the field as the 
primary sort of the index. 
   
   ```
    Index LongField query LongField#newRangeQuery
   
     INDEX TIME: 42.604 sec
     INDEX DOCS: 12347608 documents
     INDEX SIZE: 12.402745246887207 MB
     QUERY TIME: 1157.7 ms
     QUERY DOCS: 6243379080 documents
   
   Index LongField query IndexSortSortedNumericDocValuesRangeQuery
   
     INDEX TIME: 42.562 sec
     INDEX DOCS: 12347608 documents
     INDEX SIZE: 12.402745246887207 MB
     QUERY TIME: 662.6 ms
     QUERY DOCS: 6243379080 documents
   
   Index Doc values skipping query SortedNumericDocValuesField#newSlowRangeQuery
   
     INDEX TIME: 38.927 sec
     INDEX DOCS: 12347608 documents
     INDEX SIZE: 11.800291061401367 MB
     QUERY TIME: 1072.5 ms
     QUERY DOCS: 6243379080 documents
   ```
   
   This basic implementation is already faster that querying using the bkd 
tree. The IndexSortSortedNumericDocValuesRangeQuery is faster as it contains 
many optimisations but my expectation is that we can make this index as fast if 
not faster than this implementation.
   
   The second test, we are indexing two fields and sorting the index using 
them; the countryCode as primary sort and the modified field as secondary sort. 
Then we execute the range queries on the modified field:
   
   ```
   Index KeywordField, LongField query LongField#newRangeQuery
   
     INDEX TIME: 50.486 sec
     INDEX DOCS: 12347608 documents
     INDEX SIZE: 24.378992080688477 MB
     QUERY TIME: 1273.0 ms
     QUERY DOCS: 6243379080 documents
   
   Index KeywordField, LongField query 
SortedNumericDocValuesField#newSlowRangeQuery
   
     INDEX TIME: 50.486 sec
     INDEX DOCS: 12347608 documents
     INDEX SIZE: 24.378992080688477 MB
     QUERY TIME: 13392.6 ms
     QUERY DOCS: 6243379080 documents
   
    Index Doc values skipping for both, query  
SortedNumericDocValuesField#newSlowRangeQuery
     INDEX TIME: 44.127 sec
     INDEX DOCS: 12347608 documents
     INDEX SIZE: 16.09447193145752 MB
     QUERY TIME: 2975.0 ms
     QUERY DOCS: 6243379080 documents
   ```
   In this case the query is slower than the BKD tree but still much faster 
than the brute approach. The advantage of the new query is that it does not 
need to build the big bitset that we might need to build with the BKD tree.
   
   relates https://github.com/apache/lucene/issues/11432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Sparse index: optional skip list on top of doc values [lucene]

Reply via email to