benwtrent commented on issue #11830: URL: https://github.com/apache/lucene/issues/11830#issuecomment-1319099313
I changed the PR to move towards delta encoding & vint. Even with storing the memory offsets within `vex`, the storage improvements are much better than PackedInts. Table with some numbers around the size improvements for different data sets & parameters: | packed_vex_mb_size | vex_mb_size | packed_index_build_time | index_build_time | params | dataset | percent_reduction | |--------------------|-------------|-------------------------|------------------|------------------------------------|---------------------|-------------------| | 79.9 | 161.6 | 767 | 784 | "{'M': 16, 'efConstruction': 100}" | glove-100-angular | 50.55693069 | | 108.4 | 464.1 | 1138 | 1225 | "{'M': 48, 'efConstruction': 100}" | glove-100-angular | 76.64296488 | | 2.3 | 8.2 | 36 | 36 | "{'M': 16, 'efConstruction': 100}" | mnist-784-euclidean | 71.95121951 | | 2.4 | 23.5 | 36 | 36 | "{'M': 48, 'efConstruction': 100}" | mnist-784-euclidean | 89.78723404 | | 66.1 | 392.2 | 501 | 572 | "{'M': 48, 'efConstruction': 100}" | sift-128-euclidean | 83.1463539 | | 59.7 | 136.6 | 449 | 516 | "{'M': 16, 'efConstruction': 100}" | sift-128-euclidean | 56.29575403 | For the curious, here are the QPS numbers (higher is better) for packed (delta & vint) vs baseline: # Glove  # MNist  # SIFT  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org